<a href="https://colab.research.google.com/github/rsmecking/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/Ryan_Mecking_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [x] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
import warnings
warnings.filterwarnings("ignore")

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [0]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)

In [0]:
# df['Date']

In [0]:
#  Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

#I tried to create variables to make datetime split easier but ended up doing it the hard way for 'val'.
time_2017 = pd.to_datetime('2017-01-01')
time_2018 = pd.to_datetime('2018-01-01')

train = df[df.Date < time_2017]
val = df[(df['Date'] >= time_2017) & (df['Date'] < time_2018)]
test  = df[df.Date >= time_2018]

In [11]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,2016-01-24,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,2016-01-24,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,2016-01-27,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [12]:
train.shape, val.shape, test.shape

((298, 59), (85, 59), (38, 59))

In [0]:
# df.dtypes

In [0]:
# target = 'Great'
# features = ['Yelp', 'Google', 'Chips', 'Cost', 'Hunger', 'Mass (g)', 'Density (g/mL)', 
#             'Length', 'Circum', 'Volume', 'Tortilla', 'Temp', 'Fillings', 
#             'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap',  
#             'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries', 
#             'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice', 'Beans', 'Lettuce', 
#             'Tomato', 'Bell peper', 'Carrots', 'Cabbage', 'Sauce', 'Salsa', 'Cilantro',
#             'Onion', 'Taquito', 'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 
#             'Queso', 'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']

In [0]:
target = 'Great'
high_cardinality = ['Burrito', 'Date']
features = train.columns.drop([target] + high_cardinality)

In [0]:
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

In [17]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((298, 56), (298,), (85, 56), (85,))

In [18]:
# Begin with baselines for classification. 
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [0]:
from sklearn.metrics import accuracy_score
#accuracy test 
great_burrito = y_train.mode()[0]
y_pred = [great_burrito] * len(y_train)

In [20]:
accuracy_score(y_train, y_pred)

0.5906040268456376

In [0]:
# y_pred = [great_burrito] * len(y_val)
# accuracy_score(y_val, y_pred)

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

In [28]:
# Encode
encoder = ce.OneHotEncoder(use_cat_names=True)

X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)
X_train_encoded

Unnamed: 0,Yelp,Google,Chips_nan,Chips_x,Chips_X,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable_nan,Unreliable_x,NonSD_nan,NonSD_x,NonSD_X,Beef_x,Beef_nan,Beef_X,Pico_x,Pico_nan,Pico_X,Guac_x,Guac_nan,Guac_X,Cheese_x,Cheese_nan,Cheese_X,Fries_x,Fries_nan,...,Sauce_X,Salsa.1_nan,Salsa.1_x,Salsa.1_X,Cilantro_nan,Cilantro_x,Cilantro_X,Onion_nan,Onion_x,Onion_X,Taquito_nan,Taquito_x,Taquito_X,Pineapple_nan,Pineapple_x,Pineapple_X,Ham_nan,Ham_x,Chile relleno_nan,Chile relleno_x,Nopales_nan,Nopales_x,Lobster_nan,Lobster_x,Queso,Egg_nan,Egg_x,Mushroom_nan,Mushroom_x,Bacon_nan,Bacon_x,Sushi_nan,Sushi_x,Avocado_nan,Avocado_x,Corn_nan,Corn_x,Corn_X,Zucchini_nan,Zucchini_x
0,3.5,4.2,1,0,0,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
1,3.5,3.3,1,0,0,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
2,,,1,0,0,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
3,,,1,0,0,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
4,4.0,3.8,0,1,0,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,4.0,4.3,1,0,0,5.65,3.0,,,19.5,22.0,0.75,4.0,1.5,2.0,3.0,4.2,4.0,3.0,2.0,4.5,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
297,,,1,0,0,5.49,3.0,,,19.0,20.5,0.64,4.5,5.0,2.0,2.0,2.5,3.5,3.0,2.5,3.0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
298,3.5,3.7,1,0,0,7.75,4.0,,,20.0,21.0,0.70,3.5,2.5,3.0,3.3,1.4,2.3,2.2,3.3,4.5,1,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0
299,,,1,0,0,7.75,4.0,,,19.5,21.0,0.68,4.0,4.5,2.0,2.0,3.5,3.5,2.0,2.0,4.0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,...,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,,1,0,1,0,1,0,1,0,1,0,1,0,0,1,0


In [27]:
X_val_encoded.shape

(85, 81)

In [29]:
 # Use scikit-learn for logistic regression.
 # 1. Import estimator class
from sklearn.linear_model import LinearRegression

# 2. Instantiate this class
linear_reg = LinearRegression()

# Impute missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

# 4. Fit the model
linear_reg.fit(X_train_imputed, y_train)

# 5. Apply the model to new data.
# The predictions look like this ...
linear_reg.predict(X_val_imputed)

array([ 6.69921875e-01,  4.68750000e-01,  7.14843750e-01,  6.46484375e-01,
       -1.95312500e-03, -4.10156250e-02,  9.78515625e-01,  6.34765625e-01,
        1.77734375e-01,  8.65234375e-01,  8.18359375e-01,  4.49218750e-02,
        4.04296875e-01,  3.41796875e-01,  1.06054688e+00,  4.21875000e-01,
        4.12109375e-01,  7.48046875e-01,  7.55859375e-01,  1.22656250e+00,
        7.03125000e-01,  4.84375000e-01,  1.32812500e-01,  6.30859375e-01,
        6.67968750e-01,  7.48046875e-01,  6.56250000e-01,  6.15234375e-01,
        5.83984375e-01,  8.06640625e-01,  4.55078125e-01,  1.86418744e+12,
        5.31250000e-01,  5.07812500e-01,  2.73437500e-01,  3.88671875e-01,
        5.52734375e-01,  6.30859375e-01,  1.32812500e-01,  1.26953125e-01,
        3.67187500e-01, -1.86418744e+12,  5.97656250e-01,  8.02734375e-01,
        7.07031250e-01,  6.40625000e-01,  3.37890625e-01,  2.63671875e-01,
        8.39843750e-01, -1.54296875e-01,  3.39843750e-01,  5.11718750e-01,
        8.78906250e-01,  

In [30]:
# Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)
print('Validation Accuracy', log_reg.score(X_val_imputed, y_val))

Validation Accuracy 0.8235294117647058


In [35]:
X_train_val = pd.concat([X_train, X_val], join='inner')
y_train_val = pd.concat([y_train, y_val], join='inner')
X_train_val.shape, y_train_val.shape

((383, 56), (383,))

In [36]:
#accuracy test
y_train_val.value_counts(normalize=True)

False    0.582245
True     0.417755
Name: Great, dtype: float64

In [0]:
# Encode
encoder = ce.OneHotEncoder(use_cat_names=True)

X_train_val_encoded = encoder.fit_transform(X_train_val)
X_test_encoded = encoder.transform(X_test)

In [39]:
 # Use scikit-learn for logistic regression.
 # 1. Import estimator class
from sklearn.linear_model import LinearRegression

# 2. Instantiate this class
linear_reg = LinearRegression()

# Impute missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_train_val_imputed = imputer.fit_transform(X_train_val_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

# 4. Fit the model
linear_reg.fit(X_train_val_imputed, y_train_val)

# 5. Apply the model to new data.
# The predictions look like this ...
linear_reg.predict(X_test_imputed)

array([1.33453484, 0.89994849, 1.30250449, 1.07685602, 0.2847337 ,
       0.52625176, 0.77035215, 1.05367508, 0.65818962, 0.84023485,
       0.7369771 , 0.61915607, 0.53245733, 0.64297591, 0.66936175,
       0.58249514, 0.66871886, 0.15291938, 0.04833986, 0.17765588,
       0.26414402, 0.79518265, 0.46807205, 0.60711878, 0.66890209,
       0.45802793, 0.42352856, 1.18932959, 0.70116031, 0.90230094,
       0.4019065 , 0.64935519, 0.7289261 , 0.19619488, 0.68587423,
       0.59248162, 1.11358282, 0.7894795 ])

In [40]:
# Get your model's test accuracy. (One time, at the end.)
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_val_imputed, y_train_val)
print('Validation Accuracy', log_reg.score(X_test_imputed, y_test))

Validation Accuracy 0.7631578947368421
