<a href="https://colab.research.google.com/github/mherbert93/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [x] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [x] Get your model's test accuracy. (One time, at the end.)
- [x] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [x] Do one-hot encoding.
- [x] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [x] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [51]:
df.head(1)

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False


In [0]:
df.Date = pd.to_datetime(df.Date, infer_datetime_format=True)

train = df[df['Date'].dt.year <= 2016]
validation = df[df['Date'].dt.year == 2017]
test = df[df['Date'].dt.year >= 2018]

In [53]:
train.shape, validation.shape, test.shape #note the small test set

((298, 59), (85, 59), (38, 59))

In [54]:
excluded_columns = train['Date']
train.drop('Date', inplace=True, axis=1) #drop date as its not needed
train.describe(include='all')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
count,298,71.0,71.0,22,292.0,297.0,0.0,0.0,175.0,174.0,174.0,298.0,283.0,288.0,297.0,292.0,296.0,278.0,296.0,296.0,27,5,168,143,139,149,119,85,43,20,20,5,33,32,11,7,7,1,7,37,6,15,17,4,7,1,4,4,1,0.0,4,3,3,2,13,2,1,298
unique,5,,,2,,,,,,,,,,,,,,,,,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,1,1,1,1,,1,1,1,1,1,2,1,2
top,California,,,x,,,,,,,,,,,,,,,,,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,,x,x,x,x,x,X,x,False
freq,118,,,19,,,,,,,,,,,,,,,,,27,3,130,115,101,121,97,63,29,19,17,3,24,24,9,5,4,1,5,33,5,9,9,3,5,1,4,4,1,,4,3,3,2,13,1,1,176
mean,,3.897183,4.142254,,6.896781,3.445286,,,19.829886,22.042241,0.77092,3.472315,3.70636,3.551215,3.519024,3.52887,3.395946,3.32464,3.540203,3.955068,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
std,,0.47868,0.371738,,1.211412,0.85215,,,2.081275,1.685043,0.137833,0.797606,0.991897,0.869483,0.850348,1.040457,1.089044,0.971226,0.922426,1.167341,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
min,,2.5,2.9,,2.99,0.5,,,15.0,17.0,0.4,1.4,1.0,1.0,1.0,0.5,1.0,0.0,1.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
25%,,3.5,4.0,,6.25,3.0,,,18.5,21.0,0.6625,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
50%,,4.0,4.2,,6.85,3.5,,,19.5,22.0,0.75,3.5,4.0,3.5,3.5,4.0,3.5,3.5,3.75,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
75%,,4.0,4.4,,7.5,4.0,,,21.0,23.0,0.87,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [55]:
train.isnull().sum() == len(train) #what columns are completely null?

Burrito           False
Yelp              False
Google            False
Chips             False
Cost              False
Hunger            False
Mass (g)           True
Density (g/mL)     True
Length            False
Circum            False
Volume            False
Tortilla          False
Temp              False
Meat              False
Fillings          False
Meat:filling      False
Uniformity        False
Salsa             False
Synergy           False
Wrap              False
Unreliable        False
NonSD             False
Beef              False
Pico              False
Guac              False
Cheese            False
Fries             False
Sour cream        False
Pork              False
Chicken           False
Shrimp            False
Fish              False
Rice              False
Beans             False
Lettuce           False
Tomato            False
Bell peper        False
Carrots           False
Cabbage           False
Sauce             False
Salsa.1           False
Cilantro        

In [0]:
features = train.drop(['Great', 'Mass (g)', 'Density (g/mL)', 'Queso'], axis=1).columns #drop null columns and target.
target = 'Great'
y_train = train[target]
X_train = train[features]
y_validation = validation[target]
X_validation = validation[features]
y_test = test[target]
X_test = test[features]

majority_class = y_train.mode()[0]
y_pred = [majority_class] * len(y_train)


In [57]:
from sklearn.metrics import accuracy_score

print("Train dataset baseline accuracy is: ", accuracy_score(y_train, y_pred))

Train dataset baseline accuracy is:  0.5906040268456376


In [58]:
y_pred = [majority_class] * len(y_validation)

print("Validation dataset baseline accuracy is: ", accuracy_score(y_validation, y_pred))

Validation dataset baseline accuracy is:  0.5529411764705883


In [59]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
import warnings
warnings.filterwarnings(action='ignore', category=RuntimeWarning, module='sklearn')
warnings.filterwarnings(action='ignore', category=RuntimeWarning, module='scipy')



for k in range(1, len(X_train.columns)+1):

    train_pipeline = Pipeline([('encoder', ce.OneHotEncoder(use_cat_names=True)),
                               ('imputer', SimpleImputer()),
                               ('scaler', StandardScaler()),
                               ('kbest', SelectKBest(score_func=f_regression, k=k)),
                               ('model', LogisticRegression())])

    train_pipeline.fit(X_train, y_train)


    y_pred = train_pipeline.predict(X_validation)
    #print(f'{k} features')
    print("Validation accuracy score, with", k, "features: ", accuracy_score(y_validation, y_pred))

Validation accuracy score, with 1 features:  0.8117647058823529
Validation accuracy score, with 2 features:  0.8470588235294118
Validation accuracy score, with 3 features:  0.8705882352941177
Validation accuracy score, with 4 features:  0.9176470588235294
Validation accuracy score, with 5 features:  0.8705882352941177
Validation accuracy score, with 6 features:  0.8941176470588236
Validation accuracy score, with 7 features:  0.8823529411764706
Validation accuracy score, with 8 features:  0.8470588235294118
Validation accuracy score, with 9 features:  0.8705882352941177
Validation accuracy score, with 10 features:  0.8705882352941177
Validation accuracy score, with 11 features:  0.8470588235294118
Validation accuracy score, with 12 features:  0.8470588235294118
Validation accuracy score, with 13 features:  0.8470588235294118
Validation accuracy score, with 14 features:  0.8470588235294118
Validation accuracy score, with 15 features:  0.8352941176470589
Validation accuracy score, with 16

In [60]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

train_pipeline = Pipeline([('encoder', ce.OneHotEncoder(use_cat_names=True)),
                           ('imputer', SimpleImputer()), 
                           ('scaler', StandardScaler()), 
                           ('kbest', SelectKBest(score_func=f_regression, k=4)), #k value as determined above
                           ('model', LogisticRegression())])

train_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('encoder',
                 OneHotEncoder(cols=['Burrito', 'Chips', 'Unreliable', 'NonSD',
                                     'Beef', 'Pico', 'Guac', 'Cheese', 'Fries',
                                     'Sour cream', 'Pork', 'Chicken', 'Shrimp',
                                     'Fish', 'Rice', 'Beans', 'Lettuce',
                                     'Tomato', 'Bell peper', 'Carrots',
                                     'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro',
                                     'Onion', 'Taquito', 'Pineapple', 'Ham',
                                     'Chile relleno', 'Nopales', ...],
                               drop_inv...
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('kbest',
                 SelectKBest(k=4,
                             score_func=<function f_regression at 0x7fc5303e8d08>)),
                ('model',
                 LogisticRegression(C=1.0, clas

In [63]:
print("Testing accuracy score is: ", train_pipeline.score(X_test, y_test))

Testing accuracy score is:  0.7368421052631579
