<a href="https://colab.research.google.com/github/jacob-torres/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/Jacob_Torres_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error
from category_encoders.one_hot import OneHotEncoder

In [None]:
# Load data downloaded from https://srcole.github.io/100burritos/
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

## EDA and Feature Engineering

In [None]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [None]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [None]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [None]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [None]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 59 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Burrito         421 non-null    object 
 1   Date            421 non-null    object 
 2   Yelp            87 non-null     float64
 3   Google          87 non-null     float64
 4   Chips           26 non-null     object 
 5   Cost            414 non-null    float64
 6   Hunger          418 non-null    float64
 7   Mass (g)        22 non-null     float64
 8   Density (g/mL)  22 non-null     float64
 9   Length          283 non-null    float64
 10  Circum          281 non-null    float64
 11  Volume          281 non-null    float64
 12  Tortilla        421 non-null    float64
 13  Temp            401 non-null    float64
 14  Meat            407 non-null    float64
 15  Fillings        418 non-null    float64
 16  Meat:filling    412 non-null    float64
 17  Uniformity      419 non-null    flo

In [None]:
# Convert date column to datetime
df['Date'] = pd.to_datetime(
    df['Date'].copy(), infer_datetime_format=True
)

In [None]:
# Drop columns if > half of their values == nulls
null_cols = []
max_nulls = df.shape[0] / 2

for col in df.columns:
  num_nulls = df[col].isnull().sum()
  if num_nulls >= max_nulls:
    null_cols.append(col)

df = df.drop(
    columns=null_cols
)

In [None]:
# Encode great column from bool to int (0 or 1)
df['Great'] = df['Great'].astype('int')

In [None]:
# Encode burrito column to int
df['Burrito'] = df['Burrito'].map(
    {'California': 0, 'Asada': 1, 'Surf': 2, 'Carnitas': 3}
)

In [None]:
df.head()

Unnamed: 0,Burrito,Date,Cost,Hunger,Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
0,0.0,2016-01-18,6.49,3.0,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0
1,0.0,2016-01-24,5.45,3.5,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0
2,3.0,2016-01-24,4.85,1.5,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0
3,1.0,2016-01-24,5.25,2.0,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0
4,0.0,2016-01-27,6.59,4.0,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,1


In [None]:
df.describe()

Unnamed: 0,Burrito,Cost,Hunger,Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
count,237.0,414.0,418.0,283.0,281.0,281.0,421.0,401.0,407.0,418.0,412.0,419.0,396.0,419.0,418.0,421.0
mean,0.49789,7.067343,3.495335,20.038233,22.135765,0.786477,3.519477,3.783042,3.620393,3.539833,3.586481,3.428998,3.37197,3.586993,3.979904,0.432304
std,0.94162,1.506742,0.812069,2.083518,1.779408,0.152531,0.794438,0.980338,0.829254,0.799549,0.997057,1.068794,0.924037,0.886807,1.118185,0.495985
min,0.0,2.99,0.5,15.0,17.0,0.4,1.0,1.0,1.0,1.0,0.5,0.0,0.0,1.0,0.0,0.0
25%,0.0,6.25,3.0,18.5,21.0,0.68,3.0,3.0,3.0,3.0,3.0,2.6,3.0,3.0,3.5,0.0
50%,0.0,6.99,3.5,20.0,22.0,0.77,3.5,4.0,3.8,3.5,4.0,3.5,3.5,3.8,4.0,0.0
75%,1.0,7.88,4.0,21.5,23.0,0.88,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,1.0
max,3.0,25.0,5.0,26.0,29.0,1.54,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1.0


## Training, Validating, and Testing

In [None]:
# Split data into train, validate, and test sets
train_mask = df['Date'].dt.year == 2016
val_mask = df['Date'].dt.year == 2017
test_mask = df['Date'].dt.year >= 2018

train_df = df[train_mask]
val_df = df[val_mask]
test_df = df[test_mask]

# Define feature and target matrices
X = ['Burrito', 'Cost', 'Hunger', 'Length',
  'Circum', 'Volume', 'Tortilla', 'Temp',
  'Meat', 'Fillings', 'Meat:filling', 'Uniformity',
  'Salsa', 'Synergy', 'Wrap']
y = ['Great']

X_train = train_df[X]
y_train = train_df[y]
X_val = val_df[X]
y_val = val_df[y]
X_test = test_df[X]
y_test = test_df[y]

# Concatenate encoded features
X_train = pd.concat([encoded_features_train, X_train])
X_val = pd.concat([encoded_features_val, X_val])
X_test = pd.concat([encoded_features_test, X_test])

print(f"""Train/validation/test proportions
  Training data: {
      (train_df.shape[0] / df.shape[0]) * 100}%
  Validation data: {
      (val_df.shape[0] / df.shape[0]) * 100}%
  Testing data: {
      (test_df.shape[0] / df.shape[0]) * 100}%
""")

Train/validation/test proportions
  Training data: 70.30878859857482%
  Validation data: 20.19002375296912%
  Testing data: 9.026128266033254%



In [None]:
# Define the baseline for the great column
df['Great'].value_counts()

0    239
1    182
Name: Great, dtype: int64

In [None]:
print(f"The baseline is {df['Great'].value_counts()[0] / len(df['Great'])}")

The baseline is 0.5676959619952494


In [None]:
# Train a logistic regression model on the features
y_train = np.ravel(y_train)
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict using the model and test the accuracy
y_train_pred = model.predict(X_train)
r2 = r2_score(y_train, y_train_pred)
mae = mean_absolute_error(y_train, y_train_pred)

print(f"""
r^2 score: {r2}
MAE: {mae}
""")

ValueError: ignored