<a href="https://colab.research.google.com/github/allan-gon/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

# Imports

In [7]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.impute import SimpleImputer
import numpy as np

# Inspect

In [8]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 59 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Burrito         421 non-null    object 
 1   Date            421 non-null    object 
 2   Yelp            87 non-null     float64
 3   Google          87 non-null     float64
 4   Chips           26 non-null     object 
 5   Cost            414 non-null    float64
 6   Hunger          418 non-null    float64
 7   Mass (g)        22 non-null     float64
 8   Density (g/mL)  22 non-null     float64
 9   Length          283 non-null    float64
 10  Circum          281 non-null    float64
 11  Volume          281 non-null    float64
 12  Tortilla        421 non-null    float64
 13  Temp            401 non-null    float64
 14  Meat            407 non-null    float64
 15  Fillings        418 non-null    float64
 16  Meat:filling    412 non-null    float64
 17  Uniformity      419 non-null    flo

In [10]:
df.columns.to_list()

['Burrito',
 'Date',
 'Yelp',
 'Google',
 'Chips',
 'Cost',
 'Hunger',
 'Mass (g)',
 'Density (g/mL)',
 'Length',
 'Circum',
 'Volume',
 'Tortilla',
 'Temp',
 'Meat',
 'Fillings',
 'Meat:filling',
 'Uniformity',
 'Salsa',
 'Synergy',
 'Wrap',
 'Unreliable',
 'NonSD',
 'Beef',
 'Pico',
 'Guac',
 'Cheese',
 'Fries',
 'Sour cream',
 'Pork',
 'Chicken',
 'Shrimp',
 'Fish',
 'Rice',
 'Beans',
 'Lettuce',
 'Tomato',
 'Bell peper',
 'Carrots',
 'Cabbage',
 'Sauce',
 'Salsa.1',
 'Cilantro',
 'Onion',
 'Taquito',
 'Pineapple',
 'Ham',
 'Chile relleno',
 'Nopales',
 'Lobster',
 'Queso',
 'Egg',
 'Mushroom',
 'Bacon',
 'Sushi',
 'Avocado',
 'Corn',
 'Zucchini',
 'Great']

# Clean

In [11]:
df.drop(['Unreliable','NonSD','Beef','Pico','Guac','Cheese','Fries','Sour cream',
         'Pork','Chicken','Shrimp','Fish','Rice','Beans','Lettuce','Tomato','Bell peper',
         'Carrots','Cabbage','Sauce','Salsa.1','Cilantro','Onion','Taquito','Pineapple',
         'Ham','Chile relleno','Nopales','Lobster','Queso','Egg', 'Mushroom','Bacon',
         'Sushi','Avocado','Corn','Zucchini','Density (g/mL)','Mass (g)','Chips'],
        axis=1, inplace=True)

In [12]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Cost,Hunger,Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
0,California,1/18/2016,3.5,4.2,6.49,3.0,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,False
1,California,1/24/2016,3.5,3.3,5.45,3.5,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,False
2,Carnitas,1/24/2016,,,4.85,1.5,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,False
3,Asada,1/24/2016,,,5.25,2.0,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,False
4,California,1/27/2016,4.0,3.8,6.59,4.0,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,True


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Burrito       421 non-null    object 
 1   Date          421 non-null    object 
 2   Yelp          87 non-null     float64
 3   Google        87 non-null     float64
 4   Cost          414 non-null    float64
 5   Hunger        418 non-null    float64
 6   Length        283 non-null    float64
 7   Circum        281 non-null    float64
 8   Volume        281 non-null    float64
 9   Tortilla      421 non-null    float64
 10  Temp          401 non-null    float64
 11  Meat          407 non-null    float64
 12  Fillings      418 non-null    float64
 13  Meat:filling  412 non-null    float64
 14  Uniformity    419 non-null    float64
 15  Salsa         396 non-null    float64
 16  Synergy       419 non-null    float64
 17  Wrap          418 non-null    float64
 18  Great         421 non-null    

# Subset

In [14]:
#2016 train 2017 validate 2018+ test
df['Date'] = pd.to_datetime(df['Date'])
df['Great'].replace({False:0,True:1},inplace=True)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Burrito       421 non-null    object        
 1   Date          421 non-null    datetime64[ns]
 2   Yelp          87 non-null     float64       
 3   Google        87 non-null     float64       
 4   Cost          414 non-null    float64       
 5   Hunger        418 non-null    float64       
 6   Length        283 non-null    float64       
 7   Circum        281 non-null    float64       
 8   Volume        281 non-null    float64       
 9   Tortilla      421 non-null    float64       
 10  Temp          401 non-null    float64       
 11  Meat          407 non-null    float64       
 12  Fillings      418 non-null    float64       
 13  Meat:filling  412 non-null    float64       
 14  Uniformity    419 non-null    float64       
 15  Salsa         396 non-null    float64   

In [16]:
train = df[df['Date'].dt.year < 2017 ].copy()
val = df[df['Date'].dt.year == 2017].copy()
test = df[df["Date"].dt.year > 2017].copy()

In [17]:
X_train = train.drop(['Great','Date'],axis=1)
y_train = train['Great']

X_val= val.drop(['Great','Date'],axis=1)
y_val = val['Great']

X_test = test.drop(['Great',"Date"],axis=1)
y_test = test['Great']

In [18]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 298 entries, 0 to 300
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Burrito       298 non-null    object 
 1   Yelp          71 non-null     float64
 2   Google        71 non-null     float64
 3   Cost          292 non-null    float64
 4   Hunger        297 non-null    float64
 5   Length        175 non-null    float64
 6   Circum        174 non-null    float64
 7   Volume        174 non-null    float64
 8   Tortilla      298 non-null    float64
 9   Temp          283 non-null    float64
 10  Meat          288 non-null    float64
 11  Fillings      297 non-null    float64
 12  Meat:filling  292 non-null    float64
 13  Uniformity    296 non-null    float64
 14  Salsa         278 non-null    float64
 15  Synergy       296 non-null    float64
 16  Wrap          296 non-null    float64
dtypes: float64(16), object(1)
memory usage: 41.9+ KB


# Pipeline 

In [19]:
column_trans = make_column_transformer(
    (OneHotEncoder(),['Burrito']),
    remainder='passthrough'
)

In [20]:
log_reg = LogisticRegression(max_iter=500)
imputer = SimpleImputer()

In [21]:
pipe = make_pipeline(column_trans,imputer,log_reg)

In [22]:
pipe.fit(X_train,y_train);

In [23]:
print('Train accuracy:', pipe.score(X_train, y_train))
print('Val accuracy:', pipe.score(X_val, y_val))
print('Test accuracy:', pipe.score(X_test,y_test))

Train accuracy: 0.8859060402684564
Val accuracy: 0.8352941176470589
Test accuracy: 0.7894736842105263


In [24]:
print(f"Baseline: {y_train.value_counts(normalize=True)[0]}")

Baseline: 0.5906040268456376


# Select K Best

In [25]:
selector = SelectKBest(k=5)
lr = LogisticRegression()
select_pipe = make_pipeline(column_trans,imputer,selector,lr)

In [26]:
select_pipe.fit(X_train,y_train);

In [27]:
print('Train accuracy:', select_pipe.score(X_train, y_train))
print('Val accuracy:', select_pipe.score(X_val, y_val))
print('Test accuracy:', select_pipe.score(X_test,y_test))

Train accuracy: 0.8791946308724832
Val accuracy: 0.8705882352941177
Test accuracy: 0.7894736842105263
