Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [97]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [98]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [99]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [100]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [101]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [102]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [103]:
# Take a look at the data's missing values
df.isnull().sum()

Burrito             0
Date                0
Yelp              334
Google            334
Chips             395
Cost                7
Hunger              3
Mass (g)          399
Density (g/mL)    399
Length            138
Circum            140
Volume            140
Tortilla            0
Temp               20
Meat               14
Fillings            3
Meat:filling        9
Uniformity          2
Salsa              25
Synergy             2
Wrap                3
Unreliable        388
NonSD             414
Beef              242
Pico              263
Guac              267
Cheese            262
Fries             294
Sour cream        329
Pork              370
Chicken           400
Shrimp            400
Fish              415
Rice              385
Beans             386
Lettuce           410
Tomato            414
Bell peper        414
Carrots           420
Cabbage           413
Sauce             383
Salsa.1           414
Cilantro          406
Onion             404
Taquito           417
Pineapple 

In [104]:
# Convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'])

In [105]:
# Make the date the index
df = df.set_index(df['Date'])

In [106]:
# Look at the head
df.head()

Unnamed: 0_level_0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
2016-01-18,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2016-01-24,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2016-01-24,Carnitas,2016-01-24,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2016-01-24,Asada,2016-01-24,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2016-01-27,California,2016-01-27,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [107]:
df['Burrito'].value_counts()

California     169
Other          156
Asada           43
Surf & Turf     28
Carnitas        25
Name: Burrito, dtype: int64

In [108]:
# Look at the value codes for the Baseline
df['Great'].value_counts()

False    239
True     182
Name: Great, dtype: int64

In [109]:
# Make the baseline score
print("Baseline cross_val score: ", 210/(210+152))

Baseline cross_val score:  0.580110497237569


In [110]:
X = df[['Tortilla', 'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'Cost', 'Hunger', 'Burrito']]
X

Unnamed: 0_level_0,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Cost,Hunger,Burrito
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2016-01-18,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,6.49,3.0,California
2016-01-24,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,5.45,3.5,California
2016-01-24,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,4.85,1.5,Carnitas
2016-01-24,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,5.25,2.0,Asada
2016-01-27,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,6.59,4.0,California
...,...,...,...,...,...,...,...,...,...,...,...,...
2019-08-27,5.0,4.0,3.5,,4.0,4.0,2.0,2.0,5.0,6.00,1.0,Other
2019-08-27,4.0,5.0,,3.5,4.0,4.0,5.0,4.0,3.0,6.00,4.0,Other
2019-08-27,4.0,4.0,4.0,3.7,3.0,2.0,3.5,4.0,4.5,7.90,3.0,California
2019-08-27,5.0,2.0,5.0,5.0,5.0,2.0,5.0,5.0,2.0,7.90,3.0,Other


In [111]:
r = X.index

In [112]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np

In [113]:
# Separate X and Y
print(X.shape)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Great_Binary'] = le.fit_transform(df['Great'])
y = df['Great_Binary']

(421, 12)


In [114]:
# Train data
cutoff = ((X.index.year <= 2016))


X_train = X[cutoff]
y_train = y[cutoff]

In [115]:
# Validation data
cutoff_2 = ((X.index.year == 2017))

X_val = X[cutoff_2]
y_val = y[cutoff_2]

In [116]:
# Test data
cutoff_3 = ((X.index.year >= 2018))

X_test = X[cutoff_3]
y_test = y[cutoff_3]

In [117]:
X_train

Unnamed: 0_level_0,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Cost,Hunger,Burrito
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2016-01-18,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,6.49,3.0,California
2016-01-24,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,5.45,3.5,California
2016-01-24,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,4.85,1.5,Carnitas
2016-01-24,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,5.25,2.0,Asada
2016-01-27,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,6.59,4.0,California
...,...,...,...,...,...,...,...,...,...,...,...,...
2016-12-02,4.0,1.5,2.0,3.0,4.2,4.0,3.0,2.0,4.5,5.65,3.0,California
2016-12-02,4.5,5.0,2.0,2.0,2.5,3.5,3.0,2.5,3.0,5.49,3.0,Other
2016-12-10,3.5,2.5,3.0,3.3,1.4,2.3,2.2,3.3,4.5,7.75,4.0,California
2016-12-10,4.0,4.5,2.0,2.0,3.5,3.5,2.0,2.0,4.0,7.75,4.0,Asada


In [122]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from category_encoders import OneHotEncoder

  import pandas.util.testing as tm


In [123]:
model  = make_pipeline(
       OneHotEncoder(use_cat_names=True),
       SimpleImputer(),
       StandardScaler(),
       LogisticRegression(random_state=42)
)

In [124]:
model.fit(X_train, y_train)

  elif pd.api.types.is_categorical(cols):


Pipeline(memory=None,
         steps=[('onehotencoder',
                 OneHotEncoder(cols=['Burrito'], drop_invariant=False,
                               handle_missing='value', handle_unknown='value',
                               return_df=True, use_cat_names=True, verbose=0)),
                ('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', ran

In [125]:
y_val.shape

(85,)

In [126]:
X_val.shape

(85, 12)

In [127]:
logi = model.predict(X_test)

In [128]:
#Logistic Regression metrics
from sklearn import metrics

print('logistic test accuracy score: ', metrics.accuracy_score(y_test, logi))

logistic test accuracy score:  0.7631578947368421


In [129]:
logi = model.predict(X_val)

In [130]:
# Cross Validation 
print('logistic cross-validation accuracy score: ', metrics.accuracy_score(y_val, logi))

logistic cross-validation accuracy score:  0.8235294117647058


In [131]:
# Alternative score method
print('Training Accuracy Score:', model.score(X_train, y_train))
print('Validation Accuracy Score:', model.score(X_val, y_val))
print('Test Accuracy Score:', model.score(X_test, y_test))

Training Accuracy Score: 0.889261744966443
Validation Accuracy Score: 0.8235294117647058
Test Accuracy Score: 0.7631578947368421


## Try XGBClassifier

In [138]:
from xgboost import XGBClassifier
model  = make_pipeline(
       OneHotEncoder(use_cat_names=True),
       SimpleImputer(),
       StandardScaler(),
       XGBClassifier()
)

In [139]:
model.fit(X_train, y_train)

  elif pd.api.types.is_categorical(cols):


Pipeline(memory=None,
         steps=[('onehotencoder',
                 OneHotEncoder(cols=['Burrito'], drop_invariant=False,
                               handle_missing='value', handle_unknown='value',
                               return_df=True, use_cat_names=True, verbose=0)),
                ('simpleimputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('standardscaler',
                 StandardScaler(copy...
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                               min_child_weight=1, missing=None,
                               n_estimators=100, n_jobs=1, nthread=Non

In [140]:
logi = model.predict(X_test)

In [141]:
print('logistic test accuracy score: ', metrics.accuracy_score(y_test, logi))

logistic test accuracy score:  0.7368421052631579


In [142]:
logi = model.predict(X_val)

In [143]:
print('logistic test accuracy score: ', metrics.accuracy_score(y_val, logi))

logistic test accuracy score:  0.8352941176470589


In [144]:
# Alternative score method
print('Training Accuracy Score:', model.score(X_train, y_train))
print('Validation Accuracy Score:', model.score(X_val, y_val))
print('Test Accuracy Score:', model.score(X_test, y_test))

Training Accuracy Score: 0.9899328859060402
Validation Accuracy Score: 0.8352941176470589
Test Accuracy Score: 0.7368421052631579
