<a href="https://colab.research.google.com/github/economicactivist/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [0]:
#Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [582]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 59 columns):
Burrito           421 non-null object
Date              421 non-null object
Yelp              87 non-null float64
Google            87 non-null float64
Chips             26 non-null object
Cost              414 non-null float64
Hunger            418 non-null float64
Mass (g)          22 non-null float64
Density (g/mL)    22 non-null float64
Length            283 non-null float64
Circum            281 non-null float64
Volume            281 non-null float64
Tortilla          421 non-null float64
Temp              401 non-null float64
Meat              407 non-null float64
Fillings          418 non-null float64
Meat:filling      412 non-null float64
Uniformity        419 non-null float64
Salsa             396 non-null float64
Synergy           419 non-null float64
Wrap              418 non-null float64
Unreliable        33 non-null object
NonSD             7 non-null object
Beef       

In [583]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


###There are a ton of NaN values.  It seems it would be better to make most topping columns into binary values.  The "Queso" column should be dropped altogether because it's all NaNs (and also doesn't logically correspond to the "Cheese" column).

In [0]:
no_queso = df.drop("Queso", axis=1)

In [0]:
filled_na = no_queso.loc[:, "Unreliable":].fillna("0").replace(["x", "X", "Yes"], 1)

In [0]:
no_queso.loc[:, "Unreliable":]=filled_na

In [587]:
no_queso.shape, df.shape

((421, 58), (421, 59))

In [588]:
df.Chips.unique()   #many differnt ways of saying "yes" in the Chips column

array([nan, 'x', 'X', 'Yes', 'No'], dtype=object)

In [0]:
import numpy as np

no_queso = no_queso.replace(['x', 'X', 'Yes'], 1).replace('No', np.nan)

In [590]:
no_queso.head()  

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
4,California,1/27/2016,4.0,3.8,1.0,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0,0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True


###The rest of the float columns are possible candidates for imputing, but first I need to split the data based on year.  Train: 2016, Validate: 2017, Test: 2018

In [0]:
#Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [0]:
no_queso.Date = pd.to_datetime(no_queso.Date)

In [593]:
no_queso.Date.min()

Timestamp('2011-05-16 00:00:00')

In [0]:
train = no_queso[no_queso.Date.dt.year<=2016]

In [0]:
validate = no_queso[no_queso.Date.dt.year==2017]

In [0]:
test = no_queso[no_queso.Date.dt.year>=2018]

In [597]:
test[test.Date.dt.year==2026]

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
77,California,2026-04-25,,,,8.0,4.0,,,21.59,,,4.5,5.0,5.0,5.0,4.5,5.0,3.0,5.0,5.0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True


In [0]:
test =  test.drop(77, axis=0)   #removing the year 2016

In [0]:
y_train = train.Great
X_train = train.drop(y_train.name, axis=1)
y_valid = validate.Great
X_valid = validate.drop(y_valid.name, axis=1)
y_test = test.Great
X_test = test.drop(y_test.name, axis=1)

In [600]:
pd.Series([y_train, X_train, y_valid, X_valid, y_test, X_test]).apply(len)

0    298
1    298
2     85
3     85
4     37
5     37
dtype: int64

###Baselines for classification 

In [601]:
y_train.mean()

0.40939597315436244

If I guessed that all burritos in the dataset were classified as "Great", I'd be correct around 40.8% of the time

In [0]:
#Use scikit-learn for logistic regression.

In [603]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV

log_reg = LogisticRegression(max_iter=1000)
log_reg2 = LogisticRegression(max_iter=1000)
log_cv = LogisticRegressionCV(max_iter=10000)
log_cv2 = LogisticRegressionCV(max_iter=10000)


# Impute missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
train_features =  X_train.loc[:, "Yelp":"Zucchini"]
valid_features =  X_valid.loc[:, "Yelp":"Zucchini"]
test_features =  X_test.loc[:, "Yelp":"Zucchini"]

X_train_imputed = imputer.fit_transform(train_features)
X_val_imputed = imputer.transform(valid_features)
X_test_imputed = imputer.transform(test_features)
X_train_val = np.append(X_train_imputed, X_val_imputed, axis=0)
y_train_val = np.append(y_train, y_valid, axis=0)
# 4. Fit the model
log_reg.fit(X_train_imputed, y_train)
log_cv.fit(X_train_imputed, y_train)
log_reg2.fit(X_train_val, y_train_val)
log_cv2.fit(X_train_val, y_train_val)
# 5. Apply the model to new data.
# The predictions look like this ...
print(log_reg.predict(X_val_imputed))
print(log_cv.predict(X_val_imputed))


[ True False  True  True False False  True  True False  True  True False
 False False  True False False  True  True  True  True  True False  True
  True  True  True  True  True  True  True False False  True False False
  True  True False False False  True  True  True  True  True False False
  True False False False  True  True False False False False False False
 False False False False False  True  True  True  True False  True  True
  True  True False False False  True  True  True  True  True False  True
  True]
[False False  True  True False False  True  True False  True  True False
 False False  True False False  True  True  True  True  True False  True
  True  True  True False  True  True  True False False False False False
  True  True False False False False  True  True  True  True False False
  True False False False  True  True False False False False False False
 False False False False False  True  True  True  True False  True  True
  True  True False False False  True  True 

###Logistic Regression Accuracy for Validation Set

In [604]:
log_reg.score(X_val_imputed, y_valid)

0.8

###Logistic Regression CV Accuracy for Validation Set

In [605]:
log_cv.score(X_val_imputed, y_valid)

0.8470588235294118

###Logistic Regression Accuracy for Test Set


In [606]:
log_reg.score(X_test_imputed, y_test)

0.7837837837837838

###Logistic Regression CV Accuracy for Test Set

In [607]:
log_cv.score(X_test_imputed, y_test)

0.7837837837837838

###Logistic Regression Accuracy for Combined Training & Validation Set

In [608]:
log_reg2.score(X_test_imputed, y_test)

0.7297297297297297

###Logistic Regression CV Accuracy for Combined Training & Validation 

In [609]:
log_cv2.score(X_test_imputed, y_test)

0.8108108108108109