<a href="https://colab.research.google.com/github/JaimieOnigkeit/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/Jaimie_Onigkeit_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

## Data Exploration & Cleaning

In [0]:
import numpy as np
import pandas as pd

In [15]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,1.0,1,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,1.0,1,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,2016-01-24,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,1,1.0,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,2016-01-24,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,1.0,1,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,2016-01-27,4.0,3.8,1.0,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,1.0,1,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [0]:
df.dtypes

In [0]:
df['Date'] = pd.to_datetime(df['Date'])

df.dtypes

In [0]:
df.isnull().sum()

In [19]:
df = df.replace('x', 1)
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,1.0,1,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,1.0,1,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,2016-01-24,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,1,1.0,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,2016-01-24,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,1.0,1,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,2016-01-27,4.0,3.8,1.0,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,1.0,1,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [0]:
df = df.replace(np.NaN, 0)
df.head()

In [0]:
df.dtypes

In [33]:
df.describe()

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,Carrots,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Zucchini
count,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0,421.0
mean,0.803325,0.861283,6.949834,3.470428,28.541568,0.035288,13.469881,14.774703,0.524941,3.519477,3.603325,3.5,3.514608,3.50981,3.412708,3.171734,3.569952,3.951544,0.078385,0.002375,0.004751,0.009501,0.009501,0.002375,0.0,0.011876,0.007126,0.007126,0.004751,0.030879,0.002375
std,1.590489,1.698009,1.746725,0.861041,125.907378,0.15153,9.570803,10.541694,0.391316,0.794438,1.250767,1.04265,0.850634,1.114702,1.092065,1.199845,0.91851,1.163504,0.269096,0.048737,0.068842,0.097125,0.097125,0.048737,0.0,0.108459,0.084214,0.084214,0.068842,0.173195,0.048737
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,6.25,3.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,6.95,3.5,0.0,0.0,18.5,21.0,0.68,3.5,4.0,3.7,3.5,4.0,3.5,3.5,3.8,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,7.84,4.0,0.0,0.0,20.5,22.5,0.83,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,4.5,5.0,25.0,5.0,925.0,0.865672,26.0,29.0,1.54,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0


## Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.


In [27]:
cutoff1 = '2016-12-31'
cutoff2 = '2018-01-01'
train = df[df.Date <= cutoff1]
validate = df[(df.Date > cutoff1) & (df['Date'] < cutoff2)]
test = df[df.Date >= cutoff2]

print(train.shape, validate.shape, test.shape)

(298, 59) (85, 59) (38, 59)


##  Begin with baselines for classification.


I ran this a bunch of different times with different features. 

In [193]:
target = 'Great'
features = ['Uniformity', 'Fillings', 'Meat', 'Temp', 'Tortilla', 'Hunger','Avocado', 'Salsa']
y_train = train[target]
X_train = train[features]
y_validate = validate[target]
X_validate = validate[features]
y_test = test[target]
X_test = test[features]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [0]:
majority_class=y_train.mode()[0]
y_pred = [majority_class] * len(y_train)


In [195]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

0.5906040268456376

##  Use scikit-learn for logistic regression.


In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

In [0]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_validate_encoded = encoder.transform(X_validate)

In [198]:
X_train_encoded

Unnamed: 0,Uniformity,Fillings,Meat,Temp,Tortilla,Hunger,Avocado,Salsa
0,4.0,3.5,3.0,5.0,3.0,3.0,0.0,4.0
1,4.0,2.5,2.5,3.5,2.0,3.5,0.0,3.5
2,4.0,3.0,2.5,2.0,3.0,1.5,0.0,3.0
3,5.0,3.0,3.5,2.0,3.0,2.0,0.0,4.0
4,5.0,3.5,4.0,5.0,4.0,4.0,0.0,2.5
...,...,...,...,...,...,...,...,...
296,4.0,3.0,2.0,1.5,4.0,3.0,0.0,3.0
297,3.5,2.0,2.0,5.0,4.5,3.0,0.0,3.0
298,2.3,3.3,3.0,2.5,3.5,4.0,0.0,2.2
299,3.5,2.0,2.0,4.5,4.0,4.0,0.0,2.0


In [0]:
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_validate_imputed = imputer.transform(X_validate_encoded)

In [200]:
X_train_imputed

array([[4. , 3.5, 3. , ..., 3. , 0. , 4. ],
       [4. , 2.5, 2.5, ..., 3.5, 0. , 3.5],
       [4. , 3. , 2.5, ..., 1.5, 0. , 3. ],
       ...,
       [2.3, 3.3, 3. , ..., 4. , 0. , 2.2],
       [3.5, 2. , 2. , ..., 4. , 0. , 2. ],
       [4.3, 3. , 4. , ..., 3.7, 0. , 0. ]])

In [0]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_validate_scaled = scaler.fit_transform(X_validate_imputed)

In [202]:
X_train_scaled

array([[ 0.56045325, -0.00827779, -0.40516734, ..., -0.49719437,
        -0.21357443,  0.71733617],
       [ 0.56045325, -1.15561741, -0.87406003, ...,  0.07597361,
        -0.21357443,  0.31814629],
       [ 0.56045325, -0.5819476 , -0.87406003, ..., -2.21669832,
        -0.21357443, -0.08104358],
       ...,
       [-0.95949116, -0.23774571, -0.40516734, ...,  0.64914159,
        -0.21357443, -0.71974738],
       [ 0.11341077, -1.72928722, -1.34295272, ...,  0.64914159,
        -0.21357443, -0.87942333],
       [ 0.82867873, -0.5819476 ,  0.53261804, ...,  0.3052408 ,
        -0.21357443, -2.47618282]])

In [203]:
model = LogisticRegressionCV()
model.fit(X_train_scaled, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

##  Get your model's validation accuracy. (Multiple times if you try multiple iterations.)


In [204]:
y_pred = model.predict(X_validate_scaled)
accuracy_score(y_validate, y_pred)

0.8588235294117647

##  Get your model's test accuracy. (One time, at the end.)


In [205]:
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.5789473684210527

I must have overfitted my model. 