Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- ✔️ Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- ✔️ Begin with baselines for classification.
- ✔️ Use scikit-learn for logistic regression.
- ✔️ Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- ✔️ Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

### Do train/validate/test split.

Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [0]:
# all nans
df = df.drop(columns='Queso')

In [0]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
df_train = df[df['Date'].dt.year < 2017]
df_val = df[df['Date'].dt.year == 2017]
df_test = df[df['Date'].dt.year > 2017]
df_train = df_train.drop(columns='Date')
df_val = df_val.drop(columns='Date')
df_test = df_test.drop(columns='Date')

### Begin with baselines for classification.

In [9]:
y_col = 'Great'
X_cols = df_train.drop(columns='Great').columns # lol

# Simple majority baseline
from sklearn.dummy import DummyClassifier
model = DummyClassifier(strategy='most_frequent')
model.fit(df_train[X_cols], df_train[y_col])

# how'd we do?
valscore = model.score(df_val[X_cols], df_val[y_col])
testscore = model.score(df_test[X_cols], df_test[y_col])
print('Simple Majority Baseline')
print(f'Validation Score: {valscore:.2f}')
print(f'Test Score: {testscore:.2f}')

Simple Majority Baseline
Validation Score: 0.55
Test Score: 0.42


### Use scikit-learn for logistic regression.

In [0]:
# only use numeric columns
X_cols = ['Cost', 'Hunger', 'Mass (g)', 'Density (g/mL)', 'Length', 'Circum',
          'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings', 'Meat:filling',
          'Uniformity', 'Salsa', 'Synergy', 'Wrap']

In [0]:
X_train = df_train[X_cols]
X_val = df_val[X_cols]
X_test = df_test[X_cols]
y_train = df_train[y_col]
y_val = df_val[y_col]
y_test = df_test[y_col]

In [0]:
# impute nans
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_train = imputer.fit_transform(X_train, y_train)
X_val = imputer.transform(X_val)
X_test = imputer.transform(X_test)

In [17]:
# use k-fold cross validation
from sklearn.linear_model import LogisticRegressionCV
model = LogisticRegressionCV(cv=5, max_iter=1000)
model.fit(X_train, y_train)

# how'd we do?
valscore = model.score(X_val, y_val)
testscore = model.score(X_test, y_test)
print('Logistic Regression w/ CV')
print(f'Validation Score: {valscore:.2f}')
print(f'Test Score: {testscore:.2f}')

Logistic Regression w/ CV
Validation Score: 0.88
Test Score: 0.79
