<a href="https://colab.research.google.com/github/tyleretheridge/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS14_214_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression



## Assignment 🌯



You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.




## Stretch Goals



- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

##Work

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
df.dtypes

Location        object
Burrito         object
Date            object
Neighborhood    object
Address         object
                 ...  
Bacon           object
Sushi           object
Avocado         object
Corn            object
Zucchini        object
Length: 66, dtype: object

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [8]:
print(df.shape)
df.head()

(421, 59)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [25]:
df['Volume'].value_counts(dropna=False)

NaN     140
0.77     19
0.65     17
0.75     13
0.68     13
       ... 
1.24      1
0.40      1
0.98      1
0.47      1
1.03      1
Name: Volume, Length: 65, dtype: int64

In [9]:
# Do train/validate/test split. 
# Train on reviews from 2016 & earlier. 
# Validate on 2017. 
# Test on 2018 & later.

# Cast date dtype
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)


train_date = pd.to_datetime('1/1/2017', infer_datetime_format=True)
test_date = pd.to_datetime('1/1/2018', infer_datetime_format=True)
print(df.Date)


0     2016-01-18
1     2016-01-24
2     2016-01-24
3     2016-01-24
4     2016-01-27
         ...    
418   2019-08-27
419   2019-08-27
420   2019-08-27
421   2019-08-27
422   2019-08-27
Name: Date, Length: 421, dtype: datetime64[ns]


In [0]:
# Date based splitting
train = df[df['Date'] < train_date]
val = df[(df['Date'] >= train_date) & (df['Date'] < test_date)]
test = df[df['Date'] >= test_date]

In [0]:
# # Equivalent? 
train = df[df['Date'].dt.year <= 2016]
val = df[df['Date'].dt.year == 2017]
test = df[df['Date'].dt.year >= 2018]

In [14]:
train.head(), val.head(), test.head()

(      Burrito       Date  Yelp  Google Chips  ...  Sushi  Avocado  Corn  Zucchini  Great
 0  California 2016-01-18   3.5     4.2   NaN  ...    NaN      NaN   NaN       NaN  False
 1  California 2016-01-24   3.5     3.3   NaN  ...    NaN      NaN   NaN       NaN  False
 2    Carnitas 2016-01-24   NaN     NaN   NaN  ...    NaN      NaN   NaN       NaN  False
 3       Asada 2016-01-24   NaN     NaN   NaN  ...    NaN      NaN   NaN       NaN  False
 4  California 2016-01-27   4.0     3.8     x  ...    NaN      NaN   NaN       NaN   True
 
 [5 rows x 59 columns],
         Burrito       Date  Yelp  Google  ... Avocado  Corn  Zucchini  Great
 301  California 2017-01-04   NaN     NaN  ...     NaN   NaN       NaN  False
 302       Other 2017-01-04   NaN     NaN  ...     NaN   NaN       NaN  False
 303       Other 2017-01-07   NaN     NaN  ...     NaN   NaN       NaN  False
 304       Other 2017-01-07   NaN     NaN  ...     NaN   NaN       NaN  False
 305       Other 2017-01-10   NaN     NaN  .

##Regression

Baseline Classification Model

In [15]:
# Begin with baselines for classification.
target = 'Great'
y_train = train[target]

# Look at value distributions for target
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

Value counts returns majority class value 'False'.  
Baseline model will predict false for all observations at a train accuracy percentage of 59.1%

SKLearn Logistic Regresssion

In [0]:
# Split > impute > scaler > regression

In [0]:
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

In [0]:
features = ['Cost', 'Tortilla', 'Temp', 'Meat', 
            'Fillings','Meat:filling','Uniformity',
            'Salsa', 'Synergy', 'Wrap']
target = 'Great'

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

y_test = test[target]

In [31]:
X_train.head()

Unnamed: 0,Cost,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap
0,6.49,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0
1,5.45,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0
2,4.85,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0
3,5.25,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0
4,6.59,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0


In [32]:
print(X_train.shape)
X_train.isnull().sum()

(298, 10)


Cost             6
Tortilla         0
Temp            15
Meat            10
Fillings         1
Meat:filling     6
Uniformity       2
Salsa           20
Synergy          2
Wrap             2
dtype: int64

In [41]:
# Instantiate imputer
imputer = SimpleImputer(strategy='mean')

# Impute on feature matrices using mean strategy
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

np.isnan(X_train_imputed).sum()

0

In [0]:
# Perform scaling 

# Instantiate Scaler
scaler = StandardScaler()

# Scale imputed feature matrices
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [43]:
# Instantiate model
model = LogisticRegressionCV()

# Perform fit
model.fit(X_train_scaled, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [0]:
# Create predictions based on model using validation feature matrix
y_pred = model.predict(X_val_scaled)

In [48]:
# Calculate accuracy score for validation
from sklearn.metrics import accuracy_score

val_acc = accuracy_score(y_val,y_pred)
print("The model's validation accuracy is: ", val_acc)

The model's validation accuracy is:  0.8941176470588236


In [0]:
# Create test set matrix and target vector
X_test = test[features]
y_test = test[target]

# Process test data set with imputer and scaler
X_test_imputed = imputer.transform(X_test)
X_test_scaled = scaler.transform(X_test_imputed)


# Use previously '.fitted' model from train data to create 'test'_y_pred
y_pred = model.predict(X_test_scaled)





In [50]:
# Calculate accuracy score for test data
test_acc = accuracy_score(y_test, y_pred)
print("The model's test accuracy is: ", test_acc)

The model's test accuracy is:  0.7631578947368421
