<a href="https://colab.research.google.com/github/BrokenShell/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [x] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [x] Get your model's test accuracy. (One time, at the end.)
- [x] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [x] Add your own stretch goal(s) ! **Ingredient, Type, Cat Rating**
- [x] Make exploratory visualizations.
- [x] Do one-hot encoding.
- [x] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [x] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california', case=False)
asada = df['Burrito'].str.contains('asada', case=False)
surf = df['Burrito'].str.contains('surf &|and turf', case=False)
carnitas = df['Burrito'].str.contains('carnitas', case=False)

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Carne Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
df['Burrito'].value_counts()

California     179
Other          156
Carne Asada     43
Carnitas        25
Surf & Turf     18
Name: Burrito, dtype: int64

### Data Cleaning

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall', 'Yelp', 'Google', 'Hunger', 'Unreliable'])

In [0]:
# removing more feature that don't seem interesting
df = df.drop(columns=['Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Cost', 'NonSD'])
df = df.drop(columns=['Lobster', 'Queso', 'Zucchini', 'Carrots'])

In [0]:
df_ratings = df.copy() # will use this later

df = df.drop(columns=[
    'Tortilla', 'Temp', 'Meat', 'Fillings', 'Meat:filling', 
    'Uniformity', 'Salsa', 'Synergy', 'Wrap',
])

In [0]:
def make_binary(item):
    """ custom encoding """
    return 1 if item in ('x', 'X') else 0

In [0]:
fillings = [
    'Chips', 'Beef', 'Pico', 'Guac', 'Cheese',
    'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice',
    'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Cabbage',
    'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito', 'Pineapple',
    'Ham', 'Chile relleno', 'Nopales', 'Egg',
    'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn',
]
for col in fillings:
    df[col] = df[col].apply(make_binary)

In [0]:
df['Great'] = df['Great'].apply(int)

In [0]:
df.head(10)

Unnamed: 0,Burrito,Date,Chips,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Great
0,California,1/18/2016,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,California,1/24/2016,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Carnitas,1/24/2016,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Carne Asada,1/24/2016,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,California,1/27/2016,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
5,Other,1/28/2016,0,0,0,1,1,0,1,0,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,California,1/30/2016,0,1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,Carnitas,1/30/2016,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,California,2/1/2016,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,Carne Asada,2/6/2016,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
df = df.dropna()

### Split: Train, Validate, Test

In [0]:
burrito = df.copy()
burrito = burrito.drop(columns=['Burrito'])
burrito['Date'] = burrito['Date'].apply(lambda date: int(date.split('/')[2]))

burrito_train = burrito[burrito['Date'] < 2017].drop(columns=['Date'])
burrito_validate = burrito[burrito['Date'] == 2017].drop(columns=['Date'])
burrito_test = burrito[burrito['Date'] > 2017].drop(columns=['Date'])

target_train = burrito_train['Great']
target_validate = burrito_validate['Great']
target_test = burrito_test['Great']

burrito_train = burrito_train.drop(columns=['Great'])
burrito_validate = burrito_validate.drop(columns=['Great'])
burrito_test = burrito_test.drop(columns=['Great'])

burrito_train.head()

Unnamed: 0,Chips,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Egg,Mushroom,Bacon,Sushi,Avocado,Corn
0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Reality Check

In [0]:
print(target_train.shape)
print(target_validate.shape)
print(target_test.shape)
print()
print(burrito_train.shape)
print(burrito_validate.shape)
print(burrito_test.shape)

(298,)
(85,)
(38,)

(298, 32)
(85, 32)
(38, 32)


### Baseline: Overall

In [0]:
obv = burrito['Great'].mode()
if not any(obv):
    print(f"Obvious Choice: Not Great")
else:
    print(f"Obvious Choice: Great")

Obvious Choice: Not Great


In [0]:
greatness = burrito['Great'].mean()
print(f"Percentage Greatness: {greatness*100:.2f}%")

Percentage Greatness: 43.23%


In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error

model = LogisticRegression(solver='lbfgs')
model.fit(burrito_train, target_train)
target_v_pred = model.predict(burrito_validate)
# MAE
mae = mean_absolute_error(target_train, [0] * len(target_train))
print(f"Baseline MAE: {mae:.5f}")

mae = mean_absolute_error(target_validate, target_v_pred)
print(f"Validation MAE: {mae:.5f}")

target_t_pred = model.predict(burrito_test)
mae = mean_absolute_error(target_test, target_t_pred)
print(f"Test MAE: {mae:.5f}")

print("\n*lower is better")

Baseline MAE: 0.40940
Validation MAE: 0.45882
Test MAE: 0.42105

*lower is better


## Ingredients Summary
Apparently ingredients alone is not enough to determine if a burrito is Great or Not. Obviously!

# Take 2: Just Type

In [0]:
burrito = df.copy()
burrito = burrito.drop(columns=fillings)

burrito['Date'] = burrito['Date'].apply(lambda date: int(date.split('/')[2]))

burrito_train = burrito[burrito['Date'] < 2017].drop(columns=['Date'])
burrito_validate = burrito[burrito['Date'] == 2017].drop(columns=['Date'])
burrito_test = burrito[burrito['Date'] > 2017].drop(columns=['Date'])

target_train = burrito_train['Great']
target_validate = burrito_validate['Great']
target_test = burrito_test['Great']

burrito_train = burrito_train.drop(columns=['Great'])
burrito_validate = burrito_validate.drop(columns=['Great'])
burrito_test = burrito_test.drop(columns=['Great'])

burrito_train.head()

Unnamed: 0,Burrito
0,California
1,California
2,Carnitas
3,Carne Asada
4,California


### One-hot Encoding

In [0]:
import category_encoders as ce

encoder = ce.OneHotEncoder(use_cat_names=True)
burrito_train = encoder.fit_transform(burrito_train)
burrito_validate = encoder.transform(burrito_validate)
burrito_test = encoder.transform(burrito_test)
burrito_train.head()

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Carne Asada,Burrito_Other,Burrito_Surf & Turf
0,1,0,0,0,0
1,1,0,0,0,0
2,0,1,0,0,0
3,0,0,1,0,0
4,1,0,0,0,0


### Reality Check

In [0]:
print(target_train.shape)
print(target_validate.shape)
print(target_test.shape)
print()
print(burrito_train.shape)
print(burrito_validate.shape)
print(burrito_test.shape)

(298,)
(85,)
(38,)

(298, 5)
(85, 5)
(38, 5)


In [0]:
model = LogisticRegression(solver='lbfgs')
model.fit(burrito_train, target_train)
target_v_pred = model.predict(burrito_validate)
# MAE
mae = mean_absolute_error(target_train, [0] * len(target_train))
print(f"Baseline MAE: {mae:.5f}")

mae = mean_absolute_error(target_validate, target_v_pred)
print(f"Validation MAE: {mae:.5f}")

target_t_pred = model.predict(burrito_test)
mae = mean_absolute_error(target_test, target_t_pred)
print(f"Test MAE: {mae:.5f}")

print("\n*lower is better")

Baseline MAE: 0.40940
Validation MAE: 0.44706
Test MAE: 0.57895

*lower is better


## Type Summary
Apparently the type alone is not enough to determine if a burrito is Great or Not. Obviously!

# Take 3: Ratings

In [0]:
df2 = df_ratings.drop(columns=fillings)
df2['Great'] = df2['Great'].apply(int)

df2.head()

Unnamed: 0,Burrito,Date,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
0,California,1/18/2016,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0
1,California,1/24/2016,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0
2,Carnitas,1/24/2016,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0
3,Carne Asada,1/24/2016,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0
4,California,1/27/2016,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,1


In [0]:
burrito = df2.copy()
burrito = burrito.dropna().drop(columns=['Burrito'])
burrito['Date'] = burrito['Date'].apply(lambda date: int(date.split('/')[2]))

burrito_train = burrito[burrito['Date'] < 2017].drop(columns=['Date'])
burrito_validate = burrito[burrito['Date'] == 2017].drop(columns=['Date'])
burrito_test = burrito[burrito['Date'] > 2017].drop(columns=['Date'])

target_train = burrito_train['Great']
target_validate = burrito_validate['Great']
target_test = burrito_test['Great']

burrito_train = burrito_train.drop(columns=['Great'])
burrito_validate = burrito_validate.drop(columns=['Great'])
burrito_test = burrito_test.drop(columns=['Great'])

burrito_train.head()

Unnamed: 0,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap
0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0
1,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0
2,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0
3,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0
4,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0


### Reality Check

In [0]:
print(target_train.shape)
print(target_validate.shape)
print(target_test.shape)
print()
print(burrito_train.shape)
print(burrito_validate.shape)
print(burrito_test.shape)

(254,)
(76,)
(32,)

(254, 9)
(76, 9)
(32, 9)


### Baseline: Ratings

In [0]:
obv = burrito['Great'].mode()[0]
if not obv:
    print(f"Obvious Choice: Not Great")
else:
    print(f"Obvious Choice: Great")

Obvious Choice: Not Great


In [0]:
greatness = burrito['Great'].mean()
print(f"Percentage Greatness: {greatness*100:.2f}%")

Percentage Greatness: 41.99%


## Logistic Regression: Ratings

In [0]:
model = LogisticRegression(solver='lbfgs')
model.fit(burrito_train, target_train)
# MAE
mae = mean_absolute_error(target_train, [0] * len(target_train))
print(f"Baseline MAE: {mae:.5f}")

target_v_pred = model.predict(burrito_validate)
mae = mean_absolute_error(target_validate, target_v_pred)
print(f"Validation MAE: {mae:.5f}")

target_t_pred = model.predict(burrito_test)
mae = mean_absolute_error(target_test, target_t_pred)
print(f"Test MAE: {mae:.5f}")

print("\n*lower is better")

Baseline MAE: 0.38976
Validation MAE: 0.14474
Test MAE: 0.28125

*lower is better


## Ratings Summary
Using the ratings of each filling category is far better than just an ingredient list or the type alone. Obviously!

# Visualizations

#### Is one type of Burrito best?

In [0]:
import altair as alt
alt.renderers.enable('colab')

RendererRegistry.enable('colab')

In [0]:
df2['Great'] = df2['Great'].apply(lambda itm: 'Great' if itm else 'Not')

In [0]:
burritos_types = alt.Chart(
    df2, 
    title="Burritos by Greatness", 
    width=120, 
    height=300
).mark_circle(size=250).encode(
    x=alt.X('Great:O', title=""),
    y=alt.Y('Burrito:N', title="", sort='-color'),
    color=alt.Color(
        'count(Burrito):Q', 
        title='Count', 
        scale=alt.Scale(scheme='plasma')
    )
)
print()
burritos_types




## Final Thoughts

Most of the data seems evenly split within each category, save one. The 'Other' category has significantly more 'Not Great' Burritos, as indicated in the graph above. This makes sense - the more popular burritos... are, wait for it... more popular. Maybe the people of San Deigo like to try new and rare things, but in the end, they still prefer the old favorites.

Further study could include taking a closer look at the 'Other' category and trying to see if there is more to the story. Are there certain ingredients that give high propability to score one way or another? Would could that mean to a Burrito chef? Also, why do Californians like fries in Burritos?

## Random Forest Pipeline


In [0]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [0]:
pipline = make_pipeline(
    ce.OrdinalEncoder(),
    StandardScaler(),
    RandomForestClassifier(n_jobs=-1, random_state=42)
)
pipline.fit(burrito_train, target_train)
print(f"Validation: {100*pipline.score(burrito_test, target_test):.2f}%")

Validation: 78.12%
