Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [x] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [x] Get your model's test accuracy. (One time, at the end.)
- [x] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [5]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [6]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [7]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [8]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [9]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [12]:
# Convert Date column to datetime 
df['Date'] = pd.to_datetime(df['Date'])
df['Date'].dtype

dtype('<M8[ns]')

In [21]:
df['Date'].describe()

count                     421
unique                    169
top       2016-08-30 00:00:00
freq                       29
first     2011-05-16 00:00:00
last      2026-04-25 00:00:00
Name: Date, dtype: object

In [14]:
# Split into train/validate/test
t_six_and_below = df['Date'].dt.year <= 2016
t_sev = df['Date'].dt.year == 2017
t_eig = df['Date'].dt.year >= 2018
train = df[t_six_and_below].copy()
val = df[t_sev].copy()
test = df[t_eig].copy()

### Start with baselines
Determine majority class

In [44]:
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

For y_train, roughly 59% of burritos were not rated "Great"

In [49]:
# Guessing majority class for every prediction
m_class = y_train.mode()[0]
y_pred = [m_class] * len(y_train)

In [52]:
# Base rate
from sklearn.metrics import accuracy_score as accScore

y_val = val[target]
y_val_pred = [m_class] * len(y_val)

print("Training data accuracy:", accScore(y_train, y_pred)) 
print("Validation data accuracy:", accScore(y_val, y_val_pred))

Training data accuracy: 0.5906040268456376
Validation data accuracy: 0.5529411764705883


Need to beat 60% accuracy

### Use scikit-learn for classification


In [63]:
# View columns to select featuress
(train.describe(exclude='number').T
 .sort_values(by='unique', ascending=False))

Unnamed: 0,count,unique,top,freq,first,last
Date,298,110,2016-08-30 00:00:00,29,2011-05-16,2016-12-15
Burrito,298,5,California,118,NaT,NaT
Rice,33,2,x,24,NaT,NaT
Corn,2,2,x,1,NaT,NaT
Pineapple,7,2,x,5,NaT,NaT
Taquito,4,2,x,3,NaT,NaT
Onion,17,2,x,9,NaT,NaT
Cilantro,15,2,x,9,NaT,NaT
Salsa.1,6,2,x,5,NaT,NaT
Sauce,37,2,x,33,NaT,NaT


In [65]:
train.describe()

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Queso
count,71.0,71.0,292.0,297.0,0.0,0.0,175.0,174.0,174.0,298.0,283.0,288.0,297.0,292.0,296.0,278.0,296.0,296.0,0.0
mean,3.897183,4.142254,6.896781,3.445286,,,19.829886,22.042241,0.77092,3.472315,3.70636,3.551215,3.519024,3.52887,3.395946,3.32464,3.540203,3.955068,
std,0.47868,0.371738,1.211412,0.85215,,,2.081275,1.685043,0.137833,0.797606,0.991897,0.869483,0.850348,1.040457,1.089044,0.971226,0.922426,1.167341,
min,2.5,2.9,2.99,0.5,,,15.0,17.0,0.4,1.4,1.0,1.0,1.0,0.5,1.0,0.0,1.0,0.0,
25%,3.5,4.0,6.25,3.0,,,18.5,21.0,0.6625,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,
50%,4.0,4.2,6.85,3.5,,,19.5,22.0,0.75,3.5,4.0,3.5,3.5,4.0,3.5,3.5,3.75,4.0,
75%,4.0,4.4,7.5,4.0,,,21.0,23.0,0.87,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,
max,4.5,4.9,11.95,5.0,,,26.0,27.0,1.24,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,


In [84]:
# Select features
target = 'Great'
unhelpful = ['Density (g/mL)', 'Mass (g)', 'Queso', 'Date']
features = train.columns.drop([target] + unhelpful)
features

Index(['Burrito', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger', 'Length',
       'Circum', 'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings',
       'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'Unreliable',
       'NonSD', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries', 'Sour cream',
       'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice', 'Beans', 'Lettuce',
       'Tomato', 'Bell peper', 'Carrots', 'Cabbage', 'Sauce', 'Salsa.1',
       'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Ham', 'Chile relleno',
       'Nopales', 'Lobster', 'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado',
       'Corn', 'Zucchini'],
      dtype='object')

In [85]:
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)

(298, 54) (298,)
(85, 54) (85,)


In [87]:
# Run logistic regression (prior to replacing upper case "X" entries)
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

In [88]:
# encode
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

In [92]:
# impute
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

In [95]:
# scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [114]:
# Logistic Regression
model = LogisticRegressionCV(max_iter=500, cv=3)
model.fit(X_train_scaled, y_train)
print("Validation Set Score:", model.score(X_val_scaled, y_val))

Validation Set Score: 0.8823529411764706


In [115]:
# Test set accuracy
X_test = test[features]
X_test_encoded = encoder.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_scaled = scaler.transform(X_test_imputed)
y_test = test[target]

print("Test Set Score:", model.score(X_test_scaled, y_test))

Test Set Score: 0.7631578947368421


The Logistic Regression model beat the baseline rate by 17 percentage points