<a href="https://colab.research.google.com/github/ssbyrne89/DS-Unit-2-Linear-Models/blob/master/DSPT5_HW_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall', 'Queso'])

In [0]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)

In [255]:
df['Date'].describe()

count                     421
unique                    169
top       2016-08-30 00:00:00
freq                       29
first     2011-05-16 00:00:00
last      2026-04-25 00:00:00
Name: Date, dtype: object

In [0]:
df[['Chips', 'Unreliable', 'NonSD', 'Beef', 'Pico',
       'Guac', 'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp',
       'Fish', 'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Egg',
       'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']
      ]=df[['Chips', 'Unreliable', 'NonSD', 'Beef', 'Pico',
       'Guac', 'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp',
       'Fish', 'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Egg',
       'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']
      ].fillna(0).replace( 'x', 1 ).replace( 'X', 1 )

In [0]:
df = df.fillna(0, axis=1)

In [258]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,0,6.49,3.0,0.0,0.0,0.0,0.0,0.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
1,California,2016-01-24,3.5,3.3,0,5.45,3.5,0.0,0.0,0.0,0.0,0.0,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
2,Carnitas,2016-01-24,0.0,0.0,0,4.85,1.5,0.0,0.0,0.0,0.0,0.0,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
3,Asada,2016-01-24,0.0,0.0,0,5.25,2.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,False
4,California,2016-01-27,4.0,3.8,1,6.59,4.0,0.0,0.0,0.0,0.0,0.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0,0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,True


# Do train/validate/test split.
# Train on reviews from 2016 & earlier.
# Validate on 2017. Test on 2018 & later.

In [0]:
train = df[df.Date.dt.year <= 2016]
val = df[(df.Date.dt.year < 2018) & (df.Date.dt.year > 2016)]
test = df[df.Date.dt.year >= 2018]

In [260]:
len(train['Date']), len(val['Date']), len(test['Date'])

(298, 85, 38)

# Begin with baselines for classification.

In [0]:
from pandas_profiling import ProfileReport


In [0]:
#ProfileReport(train)

In [263]:
## determine majority class
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [0]:
majority_class = y_train.mode()[0]
y_pred_train = [majority_class]*len(y_train)

In [265]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred_train)

0.5906040268456376

In [266]:
y_val = val[target]
y_pred = [majority_class]*len(y_val)
accuracy_score(y_val, y_pred)

0.5529411764705883

In [0]:
### BASELINE ABOVE

In [268]:
train.describe(exclude='number')

Unnamed: 0,Burrito,Date,Chips,Great
count,298,298,298.0,298
unique,5,110,2.0,2
top,California,2016-08-30 00:00:00,0.0,False
freq,118,29,276.0,176
first,,2011-05-16 00:00:00,,
last,,2016-12-15 00:00:00,,


In [269]:
train.describe()

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
count,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0,298.0
mean,0.928523,0.986913,6.757919,3.433725,0.0,0.0,11.645067,12.870302,0.450134,3.472315,3.519799,3.432047,3.507215,3.457819,3.373154,3.10151,3.516443,3.928523,0.090604,0.016779,0.563758,0.479866,0.466443,0.5,0.399329,0.285235,0.144295,0.067114,0.067114,0.016779,0.110738,0.107383,0.036913,0.02349,0.02349,0.003356,0.02349,0.124161,0.020134,0.050336,0.057047,0.013423,0.02349,0.003356,0.013423,0.013423,0.003356,0.013423,0.010067,0.010067,0.006711,0.043624,0.006711,0.003356
std,1.679213,1.776823,1.542546,0.873812,0.0,0.0,9.908151,10.958878,0.394904,0.797606,1.262157,1.068136,0.873048,1.143324,1.120343,1.254644,0.963831,1.207535,0.287528,0.128657,0.496752,0.500435,0.499712,0.500841,0.490584,0.452286,0.35198,0.25064,0.25064,0.128657,0.314336,0.31012,0.188865,0.151708,0.151708,0.057928,0.151708,0.33032,0.140696,0.219004,0.232322,0.11527,0.151708,0.057928,0.11527,0.11527,0.057928,0.11527,0.099997,0.099997,0.081785,0.204601,0.081785,0.057928
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,6.25,3.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,6.725,3.5,0.0,0.0,17.89,20.5,0.64,3.5,4.0,3.5,3.5,4.0,3.5,3.0,3.725,4.0,0.0,0.0,1.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,7.5,4.0,0.0,0.0,20.0,22.0,0.77,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,4.5,4.9,11.95,5.0,0.0,0.0,26.0,27.0,1.24,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
features = train[
      ['Chips', 'Unreliable', 'NonSD', 'Beef', 'Pico',
       'Guac', 'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp',
       'Fish', 'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Egg',
       'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']
      ]

In [271]:
train[features.astype(bool)].drop(columns=['Date'])

Unnamed: 0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,1,,,,,,,,,,,,,,,,,,,1.0,1.0,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
297,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
298,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
299,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [272]:
X_train = train[features.astype(bool)].drop(columns=['Date', 'Great'])
y_train = train[target]
X_val = val[features.astype(bool)].drop(columns=['Date', 'Great'])
y_val = val[target]

X_train.shape, X_val.shape

((298, 56), (85, 56))

In [273]:
features

Unnamed: 0,Chips,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
297,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
298,0,0,0,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
299,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [274]:
X_train

Unnamed: 0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,1,,,,,,,,,,,,,,,,,,,1.0,1.0,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
297,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
298,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,
299,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [275]:
train.shape, val.shape, test.shape, df.shape

((298, 58), (85, 58), (38, 58), (421, 58))

# Use scikit-learn for logistic regression.

In [0]:

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV

encoder = ce.one_hot.OneHotEncoder(use_cat_names=True)
X_train_enc = encoder.fit_transform(X_train.fillna(0))
X_val_enc = encoder.transform(X_val.fillna(0))

In [277]:
X_train_enc.shape, X_val_enc.shape

((298, 56), (85, 56))

In [278]:
X_val_enc.head()

Unnamed: 0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
301,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
302,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
303,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
304,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
305,0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [279]:
imputer = SimpleImputer()
X_train_imp = imputer.fit_transform(X_train_enc)
X_val_imp = imputer.transform(X_val_enc)
X_train_imp.shape, X_val_imp.shape

((298, 56), (85, 56))

In [0]:
X_train_imp = pd.DataFrame(X_train_imp, columns=X_train_enc.columns)
X_val_imp = pd.DataFrame(X_val_imp, columns = X_val_enc.columns)

In [0]:
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train_imp)
X_val_sc = scaler.transform(X_val_imp)

In [0]:
X_train_sc = pd.DataFrame(X_train_sc, columns=X_train_enc.columns)
X_val_sc = pd.DataFrame(X_val_sc, columns = X_val_enc.columns)

In [283]:
X_train_sc

Unnamed: 0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,0.879664,1.041113,1.069526,1.0,1.226459,-0.631713,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
1,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,0.879664,1.041113,1.069526,1.0,1.226459,-0.631713,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
2,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,-1.136797,1.041113,1.069526,-1.0,-0.815355,-0.631713,2.435207,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
3,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,0.879664,1.041113,1.069526,-1.0,-0.815355,-0.631713,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
4,0.0,0.0,0.0,3.541956,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,0.879664,1.041113,-0.934994,1.0,1.226459,-0.631713,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
293,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,-1.136797,-0.960511,-0.934994,-1.0,-0.815355,-0.631713,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
294,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,-1.136797,-0.960511,-0.934994,-1.0,-0.815355,-0.631713,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
295,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,0.879664,1.041113,-0.934994,1.0,1.226459,1.582998,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026
296,0.0,0.0,0.0,-0.282330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.315644,-0.130632,0.879664,1.041113,1.069526,-1.0,-0.815355,-0.631713,-0.410643,-0.268221,-0.268221,-0.130632,-0.352886,-0.346844,-0.195774,-0.155097,-0.155097,-0.058026,-0.155097,-0.376514,-0.143346,-0.230225,-0.245964,-0.116642,-0.155097,-0.058026,-0.116642,-0.116642,-0.058026,-0.116642,-0.100844,-0.100844,-0.082199,-0.213574,-0.082199,-0.058026


In [284]:
model = LogisticRegressionCV()
model.fit(X_train_sc, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [285]:
print(f'Validation Score: {model.score(X_val_sc, y_val)}')

Validation Score: 0.5529411764705883


In [286]:
coefs = pd.Series(model.coef_[0], X_train_sc.columns)
coefs

Burrito           0.000000
Yelp              0.000000
Google            0.000000
Chips            -0.017468
Cost              0.000000
Hunger            0.000000
Mass (g)          0.000000
Density (g/mL)    0.000000
Length            0.000000
Circum            0.000000
Volume            0.000000
Tortilla          0.000000
Temp              0.000000
Meat              0.000000
Fillings          0.000000
Meat:filling      0.000000
Uniformity        0.000000
Salsa             0.000000
Synergy           0.000000
Wrap              0.000000
Unreliable        0.143771
NonSD            -0.025262
Beef             -0.041177
Pico             -0.069633
Guac              0.019125
Cheese            0.007641
Fries             0.012302
Sour cream        0.019878
Pork             -0.021640
Chicken          -0.057567
Shrimp           -0.049140
Fish             -0.002980
Rice             -0.046371
Beans            -0.098631
Lettuce          -0.013192
Tomato           -0.015255
Bell peper       -0.048896
C

In [290]:
coefs.sort_values().plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x7fce747d7f60>

In [291]:
X_test = test[features]
X_test_enc = encoder.transform(X_test)
X_test_imp = imputer.transform(X_test_enc)
X_test_scaled = scaler.transform(X_test_imp)
X_test_scaled

ValueError: ignored

In [292]:
y_pred = model.predict(X_test_scaled)
y_pred

NameError: ignored