Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [x] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [x] Get your model's test accuracy. (One time, at the end.)
- [x] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [218]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [219]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [220]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [221]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [222]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [223]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [224]:
df['Date'] = pd.to_datetime(df.Date)

In [225]:
up_date = []
for date in df['Date']:
    up_date.append(date.year)
df['Date'] = up_date

In [226]:
import numpy as np
for i in df.iloc[:,21:-1].columns:
    df[str(i)] = df[str(i)].map({'X':1,'x':1,np.nan:0})

In [227]:
def fill_with_mean(df,columns):
    for col in columns:
        df[col] = df[col].fillna(df[col].mean())

fill_list = ['Length','Circum','Volume','Yelp','Google']
fill_with_mean(df,fill_list)

cols_to_drop = ['Chips','Mass (g)','Density (g/mL)']
df.drop(columns=cols_to_drop,inplace=True)

In [228]:
train = df.loc[(df['Date'] <= 2016)]
val = df.loc[(df['Date'] == 2017)]
test = df.loc[(df['Date'] >= 2018)]

In [229]:
target = 'Great'
y_train = train[target]
y_val = val[target]
y_test = test[target]

In [230]:
# baseline:
# Majority class of burritos are not great.
import seaborn as sns
import matplotlib.pyplot as plt
# sns.countplot(y_train)

In [231]:
# Guesses all burritos are not-great:
majority_class = y_train.mode()[0]
y_pred = [majority_class]*len(y_train)
# y_pred

In [232]:
# now lets get an accuracy score for the baseline:
from sklearn.metrics import accuracy_score
print(
    f"The accuracy of the baseline model: {int(round(accuracy_score(y_train,y_pred),2)*100)}%")

The accuracy of the baseline model: 59%


In [None]:
def drop_cols_on_all_dfs(df_list,col_to_drop):
    for i in df_list:
        i.drop(columns=col_to_drop,inplace=True)

drop_cols_on_all_dfs([train,val,test],'Date')    

In [240]:
features = train.columns

X_train = train[features]
X_val = val[features]
X_test = test[features]

In [None]:
import category_encoders as ce 
encoder = ce.OneHotEncoder(use_cat_names=True)

X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)
X_test_encoded = encoder.transform(X_test)

X_train_encoded

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean') #try dif perams
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

In [243]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed) 
X_val_scaled = scaler.transform(X_val_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

In [245]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(X_train_scaled,y_train)
y_pred = model.predict(X_val_scaled)

validation_score = accuracy_score(y_val,y_pred)


1.0

In [247]:
y_pred = model.predict(X_test_scaled)

test_score = accuracy_score(y_test,y_pred)
test_score

1.0

In [248]:
pd.Series(model.coef_[0], X_train_encoded.columns)

Burrito_California     0.138768
Burrito_Carnitas      -0.003128
Burrito_Asada         -0.020400
Burrito_Other         -0.142864
Burrito_Surf & Turf    0.032452
Yelp                   0.055299
Google                -0.000864
Cost                   0.192120
Hunger                 0.094560
Length                 0.028574
Circum                 0.034331
Volume                 0.044090
Tortilla               0.254914
Temp                   0.228093
Meat                   0.564108
Fillings               0.467731
Meat:filling           0.467215
Uniformity             0.101787
Salsa                  0.142057
Synergy                0.628253
Wrap                   0.076537
Unreliable             0.125295
NonSD                  0.028424
Beef                  -0.120210
Pico                  -0.095569
Guac                   0.117085
Cheese                 0.005002
Fries                 -0.084506
Sour cream             0.008035
Pork                   0.040380
Chicken               -0.072825
Shrimp  