Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [92]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' or 'jupyter_client' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [93]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [94]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [95]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [96]:
# drop the columns where all values are NaN
df = df.dropna(axis=1, how='all')

In [97]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [98]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [99]:
# Drop some columns to prevent "leakage"
# I don't understand this line
df = df.drop(columns=['Rec', 'overall'])

### 1. Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [100]:
# change date string to datetime object
print('before', df['Date'].describe())
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
print('after', df['Date'].describe())

before count           421
unique          169
top       8/30/2016
freq             29
Name: Date, dtype: object
after count                     421
unique                    169
top       2016-08-30 00:00:00
freq                       29
first     2011-05-16 00:00:00
last      2026-04-25 00:00:00
Name: Date, dtype: object


In [101]:
df['Date'].head()

0   2016-01-18
1   2016-01-24
2   2016-01-24
3   2016-01-24
4   2016-01-27
Name: Date, dtype: datetime64[ns]

In [102]:
df.shape

(421, 58)

In [103]:
# split into train and validation set
# 1. train
cond_train = df['Date'] < '2017-01-01'
cond_valid = (df['Date'] > '2016-12-31') & (df['Date'] < '2018-01-01')
cond_test = (df['Date'] > '2017-12-31')
train = df[cond_train]
valid = df[cond_valid]
test = df[cond_test]

print('train', train['Date'].head())
print('validation', valid['Date'].head())
print('test', test['Date'].head())
print('train.shape', train.shape, 'valid.shape', valid.shape)

train 0   2016-01-18
1   2016-01-24
2   2016-01-24
3   2016-01-24
4   2016-01-27
Name: Date, dtype: datetime64[ns]
validation 301   2017-01-04
302   2017-01-04
303   2017-01-07
304   2017-01-07
305   2017-01-10
Name: Date, dtype: datetime64[ns]
test 77    2026-04-25
386   2018-01-02
387   2018-01-09
388   2018-01-12
389   2018-01-12
Name: Date, dtype: datetime64[ns]
train.shape (298, 58) valid.shape (85, 58)


### 2. Begin with baselines for classification.

In [104]:
# determine majority rate whther great or not
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [105]:
# what if we guessed the majority rate for every predicition?
majority_rate = y_train.mode()[0]
y_pred_train = [majority_rate]*len(y_train)

In [106]:
# use a classification metric: accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred_train)

0.5906040268456376

In [107]:
y_valid = valid[target]
y_pred = [majority_rate]*len(y_valid)
accuracy_score(y_valid, y_pred)

0.5529411764705883

### 3. Linear (Logistic) Regression


In [108]:
numerics = train.select_dtypes(include='number').columns.tolist()
categoricals = train.select_dtypes(exclude='number').columns.tolist()
low_cardinality_categories = [col for col in categoricals if train[col].nunique() <= 10]

features = numerics + low_cardinality_categories

X_train = train[features]
y_train = train[target]

X_valid = valid[features]
y_valid = valid[target]

In [109]:
X_train.shape, X_valid.shape

((298, 57), (85, 57))

In [110]:
import category_encoders as ce

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_enc = encoder.fit_transform(X_train)
X_valid_enc = encoder.transform(X_valid)

In [111]:
X_train_enc.shape, X_valid_enc.shape

((298, 123), (85, 123))

In [112]:
X_train_enc.describe()
# MAss and Density are all NaN

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Chips_nan,Chips_x,Chips_X,Unreliable_nan,Unreliable_x,NonSD_nan,NonSD_x,NonSD_X,Beef_x,Beef_nan,Beef_X,Pico_x,Pico_nan,Pico_X,Guac_x,Guac_nan,Guac_X,Cheese_x,Cheese_nan,Cheese_X,Fries_x,Fries_nan,Fries_X,Sour cream_nan,Sour cream_x,Sour cream_X,Pork_nan,Pork_x,Pork_X,Chicken_nan,Chicken_x,Chicken_X,Shrimp_nan,Shrimp_x,Shrimp_X,Fish_nan,Fish_x,Fish_X,Rice_nan,Rice_x,Rice_X,Beans_nan,Beans_x,Beans_X,Lettuce_nan,Lettuce_x,Lettuce_X,Tomato_nan,Tomato_x,Tomato_X,Bell peper_nan,Bell peper_x,Bell peper_X,Carrots_nan,Carrots_x,Cabbage_nan,Cabbage_x,Cabbage_X,Sauce_nan,Sauce_x,Sauce_X,Salsa.1_nan,Salsa.1_x,Salsa.1_X,Cilantro_nan,Cilantro_x,Cilantro_X,Onion_nan,Onion_x,Onion_X,Taquito_nan,Taquito_x,Taquito_X,Pineapple_nan,Pineapple_x,Pineapple_X,Ham_nan,Ham_x,Chile relleno_nan,Chile relleno_x,Nopales_nan,Nopales_x,Lobster_nan,Lobster_x,Egg_nan,Egg_x,Mushroom_nan,Mushroom_x,Bacon_nan,Bacon_x,Sushi_nan,Sushi_x,Avocado_nan,Avocado_x,Corn_nan,Corn_x,Corn_X,Zucchini_nan,Zucchini_x
count,71.0,71.0,292.0,297.0,0.0,0.0,175.0,174.0,174.0,298.0,283.0,288.0,297.0,292.0,296.0,278.0,296.0,296.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0
mean,3.897183,4.142254,6.896781,3.445286,,,19.829886,22.042241,0.77092,3.472315,3.70636,3.551215,3.519024,3.52887,3.395946,3.32464,3.540203,3.955068,0.395973,0.04698,0.11745,0.369128,0.07047,0.926174,0.063758,0.010067,0.909396,0.090604,0.983221,0.010067,0.006711,0.436242,0.436242,0.127517,0.385906,0.520134,0.09396,0.338926,0.533557,0.127517,0.40604,0.5,0.09396,0.325503,0.600671,0.073826,0.714765,0.211409,0.073826,0.855705,0.097315,0.04698,0.932886,0.063758,0.003356,0.932886,0.057047,0.010067,0.983221,0.010067,0.006711,0.889262,0.080537,0.030201,0.892617,0.080537,0.026846,0.963087,0.030201,0.006711,0.97651,0.016779,0.006711,0.97651,0.013423,0.010067,0.996644,0.003356,0.97651,0.016779,0.006711,0.875839,0.110738,0.013423,0.979866,0.016779,0.003356,0.949664,0.030201,0.020134,0.942953,0.030201,0.026846,0.986577,0.010067,0.003356,0.97651,0.016779,0.006711,0.996644,0.003356,0.986577,0.013423,0.986577,0.013423,0.996644,0.003356,0.986577,0.013423,0.989933,0.010067,0.989933,0.010067,0.993289,0.006711,0.956376,0.043624,0.993289,0.003356,0.003356,0.996644,0.003356
std,0.47868,0.371738,1.211412,0.85215,,,2.081275,1.685043,0.137833,0.797606,0.991897,0.869483,0.850348,1.040457,1.089044,0.971226,0.922426,1.167341,0.489881,0.211952,0.322497,0.48338,0.256368,0.261927,0.244733,0.099997,0.287528,0.287528,0.128657,0.099997,0.081785,0.496752,0.496752,0.334112,0.487627,0.500435,0.292263,0.474141,0.499712,0.334112,0.491918,0.500841,0.292263,0.469351,0.490584,0.261927,0.452286,0.408995,0.261927,0.35198,0.296885,0.211952,0.25064,0.244733,0.057928,0.25064,0.232322,0.099997,0.128657,0.099997,0.081785,0.314336,0.27258,0.171429,0.31012,0.27258,0.161904,0.188865,0.171429,0.081785,0.151708,0.128657,0.081785,0.151708,0.11527,0.099997,0.057928,0.057928,0.151708,0.128657,0.081785,0.33032,0.314336,0.11527,0.140696,0.128657,0.057928,0.219004,0.171429,0.140696,0.232322,0.171429,0.161904,0.11527,0.099997,0.057928,0.151708,0.128657,0.081785,0.057928,0.057928,0.11527,0.11527,0.11527,0.11527,0.057928,0.057928,0.11527,0.11527,0.099997,0.099997,0.099997,0.099997,0.081785,0.081785,0.204601,0.204601,0.081785,0.057928,0.057928,0.057928,0.057928
min,2.5,2.9,2.99,0.5,,,15.0,17.0,0.4,1.4,1.0,1.0,1.0,0.5,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.5,4.0,6.25,3.0,,,18.5,21.0,0.6625,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
50%,4.0,4.2,6.85,3.5,,,19.5,22.0,0.75,3.5,4.0,3.5,3.5,4.0,3.5,3.5,3.75,4.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.5,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
75%,4.0,4.4,7.5,4.0,,,21.0,23.0,0.87,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
max,4.5,4.9,11.95,5.0,,,26.0,27.0,1.24,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [113]:
X_valid_enc.isnull().sum()

Yelp                   72
Google                 72
Cost                    1
Hunger                  2
Mass (g)               63
Density (g/mL)         63
Length                 11
Circum                 11
Volume                 11
Tortilla                0
Temp                    5
Meat                    2
Fillings                1
Meat:filling            2
Uniformity              0
Salsa                   2
Synergy                 0
Wrap                    0
Burrito_California      0
Burrito_Carnitas        0
Burrito_Asada           0
Burrito_Other           0
Burrito_Surf & Turf     0
Chips_nan               0
Chips_x                 0
Chips_X                 0
Unreliable_nan          0
Unreliable_x            0
NonSD_nan               0
NonSD_x                 0
NonSD_X                 0
Beef_x                  0
Beef_nan                0
Beef_X                  0
Pico_x                  0
Pico_nan                0
Pico_X                  0
Guac_x                  0
Guac_nan    

In [114]:
# drop Mass and Density
print('before', X_train_enc.shape, X_valid_enc.shape)
X_train_enc = X_train_enc.drop(['Mass (g)', 'Density (g/mL)'], axis=1)
X_valid_enc = X_valid_enc.drop(['Mass (g)', 'Density (g/mL)'], axis=1)
print('after', X_train_enc.shape, X_valid_enc.shape)

before (298, 123) (85, 123)
after (298, 121) (85, 121)


In [115]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train_imp = imputer.fit_transform(X_train_enc)
X_valid_imp = imputer.transform(X_valid_enc)

In [116]:
print('before impute', X_train_enc.shape, X_valid_enc.shape)
print('After impute', X_train_imp.shape, X_valid_imp.shape)

before impute (298, 121) (85, 121)
After impute (298, 121) (85, 121)


In [117]:
X_train_imp = pd.DataFrame(X_train_imp, columns=X_train_enc.columns)
X_valid_imp = pd.DataFrame(X_valid_imp, columns=X_valid_enc.columns)

In [118]:
train.describe()

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap
count,71.0,71.0,292.0,297.0,0.0,0.0,175.0,174.0,174.0,298.0,283.0,288.0,297.0,292.0,296.0,278.0,296.0,296.0
mean,3.897183,4.142254,6.896781,3.445286,,,19.829886,22.042241,0.77092,3.472315,3.70636,3.551215,3.519024,3.52887,3.395946,3.32464,3.540203,3.955068
std,0.47868,0.371738,1.211412,0.85215,,,2.081275,1.685043,0.137833,0.797606,0.991897,0.869483,0.850348,1.040457,1.089044,0.971226,0.922426,1.167341
min,2.5,2.9,2.99,0.5,,,15.0,17.0,0.4,1.4,1.0,1.0,1.0,0.5,1.0,0.0,1.0,0.0
25%,3.5,4.0,6.25,3.0,,,18.5,21.0,0.6625,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5
50%,4.0,4.2,6.85,3.5,,,19.5,22.0,0.75,3.5,4.0,3.5,3.5,4.0,3.5,3.5,3.75,4.0
75%,4.0,4.4,7.5,4.0,,,21.0,23.0,0.87,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0
max,4.5,4.9,11.95,5.0,,,26.0,27.0,1.24,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [119]:
# 1. Import estimator class
from sklearn.linear_model import LinearRegression

# 2. Instantiate this class
log_reg = LinearRegression()
log_reg.fit(X_train_imp, y_train)
print('Validation Accuracy', log_reg.score(X_valid_imp, y_valid))

Validation Accuracy 0.9999847440752175
