Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [x] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [x] Get your model's test accuracy. (One time, at the end.)
- [x] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Make exploratory visualizations.
- [x] Do one-hot encoding.
- [x] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [x] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [8]:
# checkout the resulting shape and columns
print(df.shape)
df.head()

(421, 59)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [10]:
# going to make Great a 0 or 1
df['Great'] = df['Great'].replace({False: 0, True: 1})
df['Great'].describe()

count    421.000000
mean       0.432304
std        0.495985
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Great, dtype: float64

In [0]:
 # to do list
#  Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
#  Begin with baselines for classification.
#  Use scikit-learn for logistic regression. (build pipeline)
#  Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
#  Get your model's test accuracy. (One time, at the end.)

In [0]:
# train/validate/test split
# Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
# first will make Date an actual datetime object
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)

In [21]:
# split into train/val/test
train_cutoff = pd.to_datetime('2016-12-31')
test_cutoff = pd.to_datetime('2018-01-01')

train = df[df['Date'] <= train_cutoff]
test = df[df['Date'] >= test_cutoff]
val = df[(df['Date'] > train_cutoff) & (df['Date'] < test_cutoff)]

train.shape, val.shape, test.shape

((298, 59), (85, 59), (38, 59))

In [22]:
# baseline for classification
# what is % of "great" burritos?
train['Great'].mean()

0.40939597315436244

### This means our mode is 0 (aka not great). Our baseline will be a prediction of 0 (not great) for every single observation. We can intuitively derive the resulting accuracy of this baseline (% of not great burritos). 

In [23]:
from sklearn.metrics import accuracy_score
baseline_preds = [0] * len(train)
accuracy_score(baseline_preds, train['Great'])

0.5906040268456376

In [24]:
baseline_preds = [0] * len(val)
accuracy_score(baseline_preds, val['Great'])

0.5529411764705883

In [0]:
# encode
# impute
# scale
# feature selection?
# fit
# predict

In [0]:
#  Use scikit-learn for logistic regression. (build pipeline)
from sklearn.pipeline import Pipeline
from category_encoders.one_hot import OneHotEncoder
from sklearn.impute import SimpleImputer 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV

In [0]:
# setting X and y for train
target = "Great"
features = train.columns.drop([target, "Date"])
X_train = train[features]
y_train = train[target]

In [33]:
# encoding train set
encoder = OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_train_encoded.head()

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Yelp,Google,Chips_nan,Chips_x,Chips_X,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable_nan,Unreliable_x,NonSD_nan,NonSD_x,NonSD_X,Beef_x,Beef_nan,Beef_X,Pico_x,Pico_nan,Pico_X,Guac_x,Guac_nan,Guac_X,...,Sauce_X,Salsa.1_nan,Salsa.1_x,Salsa.1_X,Cilantro_nan,Cilantro_x,Cilantro_X,Onion_nan,Onion_x,Onion_X,Taquito_nan,Taquito_x,Taquito_X,Pineapple_nan,Pineapple_x,Pineapple_X,Ham_nan,Ham_x,Chile relleno_nan,Chile relleno_x,Nopales_nan,Nopales_x,Lobster_nan,Lobster_x,Queso,Egg_nan,Egg_x,Mushroom_nan,Mushroom_x,Bacon_nan,Bacon_x,Sushi_nan,Sushi_x,Avocado_nan,Avocado_x,Corn_nan,Corn_x,Corn_X,Zucchini_nan,Zucchini_x
0,1,0,0,0,0,3.5,4.2,1,0,0,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
1,1,0,0,0,0,3.5,3.3,1,0,0,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
2,0,1,0,0,0,,,1,0,0,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
3,0,0,1,0,0,,,1,0,0,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
4,1,0,0,0,0,4.0,3.8,0,1,0,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0


In [0]:
# impute train set
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)

In [0]:
# scale train set 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)

In [43]:
# fit model to train set
model = LogisticRegressionCV(max_iter=1000)
model.fit(X_train_scaled, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=1000, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [44]:
# run everything on validation set and get score 
X_val = val[features]
y_val = val[target]

X_val_encoded = encoder.transform(X_val)
X_val_imputed = imputer.transform(X_val_encoded)
X_val_scaled = scaler.transform(X_val_imputed)

model.score(X_val_scaled, y_val)

0.7647058823529411

## Model did better than baseline! (by ~21%)

In [45]:
# now build a pipeline
encoder = OneHotEncoder(use_cat_names=True)
imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()
model = LogisticRegressionCV(max_iter=1000)
burrito_pipeline = Pipeline([
                             ('encode', encoder),
                             ('impute', imputer),
                             ('scale', scaler),
                             ('model', model)
])

burrito_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('encode',
                 OneHotEncoder(cols=['Burrito', 'Chips', 'Unreliable', 'NonSD',
                                     'Beef', 'Pico', 'Guac', 'Cheese', 'Fries',
                                     'Sour cream', 'Pork', 'Chicken', 'Shrimp',
                                     'Fish', 'Rice', 'Beans', 'Lettuce',
                                     'Tomato', 'Bell peper', 'Carrots',
                                     'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro',
                                     'Onion', 'Taquito', 'Pineapple', 'Ham',
                                     'Chile relleno', 'Nopales', ...],
                               drop_inva...
                               verbose=0)),
                ('scale',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('model',
                 LogisticRegressionCV(Cs=10, class_weight=None, cv=None,
                                      dual=False,

In [46]:
# got val score in one function call
burrito_pipeline.score(X_val, y_val)

0.7647058823529411

In [47]:
# want to add some feature selection 
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(k=20)
encoder = OneHotEncoder(use_cat_names=True)
imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()
model = LogisticRegressionCV(max_iter=1000)
burrito_pipeline = Pipeline([
                             ('encode', encoder),
                             ('impute', imputer),
                             ('scale', scaler),
                             ('selector', selector),
                             ('model', model)
])

burrito_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('encode',
                 OneHotEncoder(cols=['Burrito', 'Chips', 'Unreliable', 'NonSD',
                                     'Beef', 'Pico', 'Guac', 'Cheese', 'Fries',
                                     'Sour cream', 'Pork', 'Chicken', 'Shrimp',
                                     'Fish', 'Rice', 'Beans', 'Lettuce',
                                     'Tomato', 'Bell peper', 'Carrots',
                                     'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro',
                                     'Onion', 'Taquito', 'Pineapple', 'Ham',
                                     'Chile relleno', 'Nopales', ...],
                               drop_inva...
                ('selector',
                 SelectKBest(k=20,
                             score_func=<function f_classif at 0x7fb845ab7d08>)),
                ('model',
                 LogisticRegressionCV(Cs=10, class_weight=None, cv=None,
                                      dual=False,

In [48]:
burrito_pipeline.score(X_val, y_val)

0.8705882352941177

### By adding a feature selector to our pipeline the validation score increased by ~10%!

In [49]:
# will now do a single test of our final pipeline on the test set
X_test = test[features]
y_test = test[target]

burrito_pipeline.score(X_test, y_test)

0.7631578947368421

## Our final pipeline got an accuracy score of 76.3% on our test set. 