<a href="https://colab.research.google.com/github/VS-Coder/DS-Unit-2-Linear-Models/blob/master/Michael_Davis_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
import numpy as np
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [7]:
df.sample(10)

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
276,Other,11/6/2016,4.0,4.4,X,7.35,2.3,,,19.5,23.5,0.86,4.0,4.0,3.5,2.0,1.5,2.0,2.0,3.0,1.2,,,,,X,,,,,,,X,,,,X,,,X,,,,,,,,,,,,,,,,,,,False
88,Other,5/6/2016,,,,8.95,5.0,,,17.0,22.0,0.65,4.0,4.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
202,Surf & Turf,8/30/2016,,,,8.25,1.0,,,,,,3.0,2.0,4.0,3.0,5.0,5.0,4.0,3.5,5.0,,,x,x,,x,x,,,,x,,,,,,,,,x,,,,,,,,,,,,,,,x,,,False
64,Asada,4/14/2016,,,,7.89,3.5,,,,,,3.0,2.0,4.5,4.0,4.0,1.5,4.5,4.5,4.5,,,X,X,X,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
191,California,8/27/2016,4.0,4.1,x,5.99,4.0,,,18.5,21.0,0.65,4.0,4.5,2.5,2.5,2.5,1.5,3.0,4.0,5.0,,,X,,,X,X,X,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
294,Asada,11/26/2016,,,,6.49,3.0,,,17.5,17.0,0.4,3.5,4.0,3.0,4.0,3.5,4.0,3.0,4.0,4.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
47,Other,3/21/2016,,,,6.95,4.5,,,,,,2.0,5.0,3.5,4.0,2.5,5.0,3.5,2.5,4.0,,,x,x,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
369,California,8/9/2017,,,,8.35,4.5,,,21.0,23.0,0.88,3.0,3.0,4.0,3.0,4.0,2.5,4.0,4.5,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
350,California,5/30/2017,,,,7.0,3.5,,,23.0,21.5,0.85,3.0,3.0,2.0,2.0,3.0,2.0,3.0,2.0,3.0,,,x,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
61,Surf & Turf,4/7/2016,,,,7.45,3.5,,,,,,3.0,5.0,3.5,2.5,3.0,2.5,3.75,3.0,4.0,,,x,,,x,,,,,x,,x,,,,,,,,,,,,,,,,,,,,,,,,,False


In [8]:
# Dropping the Mass and Density columns due to NaN values.
df = df.drop(columns=["Mass (g)"])
df = df.drop(columns=['Density (g/mL)'])

In [9]:
df.columns

Index(['Burrito', 'Date', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
       'Length', 'Circum', 'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings',
       'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'Unreliable',
       'NonSD', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries', 'Sour cream',
       'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice', 'Beans', 'Lettuce',
       'Tomato', 'Bell peper', 'Carrots', 'Cabbage', 'Sauce', 'Salsa.1',
       'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Ham', 'Chile relleno',
       'Nopales', 'Lobster', 'Queso', 'Egg', 'Mushroom', 'Bacon', 'Sushi',
       'Avocado', 'Corn', 'Zucchini', 'Great'],
      dtype='object')

In [10]:
# Dropping the NaN values from the three features selected.
df = df.dropna(subset=['Length', 'Circum', 'Hunger'])

In [11]:
# More imports
import numpy as np
from sklearn.model_selection import train_test_split

In [12]:
features = ['Length', "Circum"]
target = "Great"

In [17]:
y_val = df['Great']
X_val = df[features]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
         df[features], df[target], test_size=0.7, random_state=42)

In [15]:
y_train.sample(10)

285     True
356    False
210    False
82      True
189    False
323    False
185     True
251    False
78     False
182     True
Name: Great, dtype: bool

In [18]:
# Impute missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

In [19]:
X_train_imputed

array([[25.   , 23.   ],
       [18.5  , 22.5  ],
       [20.   , 22.   ],
       [22.   , 25.   ],
       [24.5  , 22.7  ],
       [20.   , 22.   ],
       [19.5  , 20.5  ],
       [17.5  , 23.5  ],
       [19.   , 20.   ],
       [19.5  , 24.5  ],
       [25.   , 22.   ],
       [19.   , 23.5  ],
       [21.   , 21.   ],
       [22.   , 22.   ],
       [19.   , 21.5  ],
       [19.5  , 21.5  ],
       [19.5  , 25.5  ],
       [20.5  , 21.5  ],
       [18.   , 21.5  ],
       [18.5  , 21.   ],
       [15.5  , 19.5  ],
       [18.   , 20.   ],
       [20.   , 23.   ],
       [19.   , 22.   ],
       [20.   , 21.   ],
       [20.   , 20.   ],
       [17.   , 21.5  ],
       [20.   , 23.5  ],
       [17.   , 20.   ],
       [18.5  , 21.5  ],
       [22.   , 22.   ],
       [20.5  , 21.5  ],
       [18.5  , 21.   ],
       [16.5  , 25.   ],
       [17.   , 22.   ],
       [19.   , 24.   ],
       [21.   , 19.5  ],
       [20.   , 22.   ],
       [17.78 , 22.225],
       [20.5  , 22.5  ],


In [20]:
# Importing the SelectKBest and f_regression libraries.
from sklearn.feature_selection import f_regression, SelectKBest

selector = SelectKBest(score_func=f_regression, k=2)

X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

In [21]:

y_train

394     True
139    False
197    False
368     True
390     True
       ...  
319     True
171    False
211    False
414     True
207     True
Name: Great, Length: 83, dtype: bool

In [22]:
# Import estimator class
from sklearn.linear_model import LinearRegression

# Instantiate this class
linear_reg = LinearRegression()
# Fit the model
linear_reg.fit(X_train_imputed, y_train)

# Apply the model to new data.
linear_reg.predict(X_val_imputed)

array([0.50490176, 0.43818137, 0.50631287, 0.69351555, 0.64706766,
       0.65720391, 0.54262452, 0.467179  , 0.54685786, 0.59471687,
       0.38914179, 0.59189464, 0.603442  , 0.39173348, 0.60061977,
       0.555583  , 0.43818137, 0.43420617, 0.61948115, 0.58034728,
       0.61807004, 0.66592904, 0.58034728, 0.73406054, 0.70506292,
       0.77037219, 0.5687353 , 0.46859012, 0.55276077, 0.47731525,
       0.56289701, 0.63975365, 0.3990475 , 0.41932   , 0.31346574,
       0.61075602, 0.62018671, 0.22929509, 0.62230338, 0.23802022,
       0.51644912, 0.5483982 , 0.55417188, 0.3990475 , 0.33373823,
       0.82836744, 0.50490176, 0.47731525, 0.34246336, 0.49476551,
       0.39032237, 0.39032237, 0.49476551, 0.48462927, 0.61075602,
       0.28587923, 0.27715409, 0.603442  , 0.51644912, 0.58316951,
       0.60793379, 0.59920866, 0.59920866, 0.64706766, 0.44549539,
       0.67606529, 0.47449302, 0.58907241, 0.51075297, 0.57021103,
       0.58907241, 0.57162215, 0.46435677, 0.50926433, 0.40777

In [23]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)
print(log_reg.score(X_val_imputed, y_val))

0.5842293906810035


In [29]:
# test accuracy
log_reg.predict_proba(X_test).mean()

0.5