<a href="https://colab.research.google.com/github/ttped/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/Trevor_Pedersen_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 59 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Burrito         421 non-null    object 
 1   Date            421 non-null    object 
 2   Yelp            87 non-null     float64
 3   Google          87 non-null     float64
 4   Chips           26 non-null     object 
 5   Cost            414 non-null    float64
 6   Hunger          418 non-null    float64
 7   Mass (g)        22 non-null     float64
 8   Density (g/mL)  22 non-null     float64
 9   Length          283 non-null    float64
 10  Circum          281 non-null    float64
 11  Volume          281 non-null    float64
 12  Tortilla        421 non-null    float64
 13  Temp            401 non-null    float64
 14  Meat            407 non-null    float64
 15  Fillings        418 non-null    float64
 16  Meat:filling    412 non-null    float64
 17  Uniformity      419 non-null    flo

In [9]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [14]:
#Change index to date and cast to date
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').sort_index()
df.head()

Unnamed: 0_level_0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1
2011-05-16,Other,,,,8.00,4.0,,,,,,3.0,,2.0,3.0,2.0,3.0,2.0,3.0,2.0,x,,x,x,x,,,x,,,,,,,x,,,,,,,,,,,,,,,,,,,,,,,False
2015-04-20,Other,,,,,4.0,,,,,,5.0,,5.0,5.0,5.0,4.0,5.0,5.0,5.0,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
2016-01-18,California,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2016-01-24,Asada,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2016-01-24,California,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-08-27,Asada,,,,6.75,3.0,,,19.00,25.0,0.94,3.0,4.0,4.0,3.0,4.0,4.0,3.0,3.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
2019-08-27,Other,,,,5.50,2.0,,,19.00,23.0,0.80,4.5,5.0,5.0,3.5,4.0,4.5,4.0,4.9,4.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
2019-08-27,Other,,,,6.00,3.0,,,17.50,21.5,0.64,4.0,4.0,4.5,4.0,3.0,3.0,4.5,4.0,4.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
2019-08-27,Other,,,,5.50,3.5,,,17.00,21.3,0.61,3.0,5.0,4.3,4.0,4.9,3.8,3.0,4.5,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [179]:
#Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
#Makes categorical variables into 1 or 0
from category_encoders import OneHotEncoder
#Fills NaN values
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import numpy as np


In [182]:
#Just testing things here, plz ignore
def model(df, target, features_arr):
  mask = df.index.map(lambda x: True if (x.year == 2016) else False)
  mask_validate = df.index.map(lambda x: True if x.year == 2017 else False)
  mask_test = df.index.map(lambda x: True if x.year >= 2018 else False)

  X = df[features_arr]
  y = df[target]

  X_train = X[mask]
  y_train = y[mask]

  X_val = X[mask_validate]
  y_val = y[mask_validate]

  #X_test = X[mask_test]
  #y_test = y[mask_test]

  #The fast way to do the above
  #X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
  #assert X_train.shape[0] + X_val.shape[0] == X.shape[0]

  #Baseline accuracy is determined by the majority class
  print('Baseline accuracy:', y_train.value_counts(normalize=True).max())

  #Creates a model that will automatically apply the 3 transformers below when calling .transform and .fit on our data 
  #Purpose is to remove the tedious proccess of assigning multiple variables I.E. X_train, XT_train, XTT_train, and so on
  model = make_pipeline(
      OneHotEncoder(use_cat_names=True), #Transformer
      SimpleImputer(), #Transformer
      LogisticRegression() #This is the predictor, you can only have 1 predictor, and the predictor must be at the end
  )

  transformer = OneHotEncoder(use_cat_names=True)
  transformer.fit(X_train)
  XT_train = transformer.transform(X_train)

  transformer_2 = SimpleImputer()
  transformer_2.fit(XT_train)
  XTT_train = transformer_2.transform(XT_train)

  model = LogisticRegression()
  model.fit(XTT_train, y_train)

  #model.fit(X_train, y_train)
 
  #XT_train = model.fit_transform(X_train, y_train)
  #XT_val = model.fit_transform(X_val, y_val)
  
  #model.fit(XT_train, y_train)
  #model.fit(X_train, y_train)

  y_pred_train = model.predict(XTT_train)
  #y_pred = model.predict(XT_val)

  print("Accuracy score:", accuracy_score(y_train, y_pred_train))

  #Get mean absolute error for the test set.
  #print("Training Mean absolute error:", mean_absolute_error(y_train, y_pred_train))
  #print("Validation Mean absolute error:", mean_absolute_error(y_val, y_pred), '\n')

  #print("Traning RMSE:", mean_squared_error(y_train, y_pred_train, squared=False))
  #print("Validation RMSE:", mean_squared_error(y_val, y_pred, squared=False), '\n')

  #print("Training R-squared score is:", model.score(XT_train, y_train))
  #print("Validation R-squared is:", model.score(XT_val, y_val), '\n')

In [183]:
temp = df[(df['Cost'] != np.NaN) & (df['Hunger'] != np.NaN)]
temp = temp[~temp['Great'].isna()]
temp = temp[~temp['Cost'].isna()]
temp = temp[~temp['Hunger'].isna()]
model(temp, 'Great', ['Cost'])

Baseline accuracy: 0.596551724137931
Accuracy score: 0.593103448275862


In [196]:
#Fixed model
def model2(df, target, features_arr):
  mask = df.index.map(lambda x: True if (x.year == 2016) else False)
  mask_validate = df.index.map(lambda x: True if x.year == 2017 else False)
  mask_test = df.index.map(lambda x: True if x.year >= 2018 else False)

  X = df[features_arr]
  y = df[target]

  X_train = X[mask]
  y_train = y[mask]

  X_val = X[mask_validate]
  y_val = y[mask_validate]

  X_test = X[mask_test]
  y_test = y[mask_test]

  #The fast way to do the above
  #X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
  #assert X_train.shape[0] + X_val.shape[0] == X.shape[0]

  #Baseline accuracy is determined by the majority class
  print('Baseline accuracy:', y_train.value_counts(normalize=True).max())

  #Creates a model that will automatically apply the 3 transformers below when calling .transform and .fit on our data 
  #Purpose is to remove the tedious proccess of assigning multiple variables I.E. X_train, XT_train, XTT_train, and so on
  model = make_pipeline(
      OneHotEncoder(use_cat_names=True), #Transformer
      SimpleImputer(), #Transformer
      LogisticRegression() #This is the predictor, you can only have 1 predictor, and the predictor must be at the end
  )
 
  model.fit(X_train, y_train)
  model.fit(X_val, y_val)

  y_pred_train = model.predict(X_train)
  y_pred = model.predict(X_val)
  y_pred_test = model.predict(X_test)

  print("Training accuracy score:", accuracy_score(y_train, y_pred_train))
  print("Validation accuracy score:", accuracy_score(y_val, y_pred))
  print("Test accuracy score:", accuracy_score(y_test, y_pred_test))


In [198]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421 entries, 2011-05-16 to 2026-04-25
Data columns (total 58 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Burrito         421 non-null    object 
 1   Yelp            87 non-null     float64
 2   Google          87 non-null     float64
 3   Chips           26 non-null     object 
 4   Cost            414 non-null    float64
 5   Hunger          418 non-null    float64
 6   Mass (g)        22 non-null     float64
 7   Density (g/mL)  22 non-null     float64
 8   Length          283 non-null    float64
 9   Circum          281 non-null    float64
 10  Volume          281 non-null    float64
 11  Tortilla        421 non-null    float64
 12  Temp            401 non-null    float64
 13  Meat            407 non-null    float64
 14  Fillings        418 non-null    float64
 15  Meat:filling    412 non-null    float64
 16  Uniformity      419 non-null    float64
 17  Salsa           

In [203]:
temp = df[(df['Cost'] != np.NaN) & (df['Hunger'] != np.NaN)]
temp = temp[~temp['Great'].isna()]
temp = temp[~temp['Cost'].isna()]
temp = temp[~temp['Hunger'].isna()]

features = df.columns.map(lambda x: x if x not in ['Great'] else np.NaN)
features = features.dropna()

model2(temp, 'Great', features)

Baseline accuracy: 0.596551724137931


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.803448275862069
Training accuracy score: 0.803448275862069
Validation accuracy score: 0.975609756097561
Test accuracy score: 0.7894736842105263
