<a href="https://colab.research.google.com/github/lopez-isaac/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

# code start

In [0]:
print(df.shape)
df.head()

(421, 14)


Unnamed: 0,Great,Meat,Length,Circum,Volume,Tortilla,Temp,Fillings,Meat:filling,Uniformity,Wrap,Synergy,Salsa,Date
0,False,3.0,20.038233,22.135765,0.786477,3.0,5.0,3.5,4.0,4.0,4.0,4.0,4.0,2016-01-18
1,False,2.5,20.038233,22.135765,0.786477,2.0,3.5,2.5,2.0,4.0,5.0,2.5,3.5,2016-01-24
2,False,2.5,20.038233,22.135765,0.786477,3.0,2.0,3.0,4.5,4.0,5.0,3.0,3.0,2016-01-24
3,False,3.5,20.038233,22.135765,0.786477,3.0,2.0,3.0,4.0,5.0,5.0,4.0,4.0,2016-01-24
4,True,4.0,20.038233,22.135765,0.786477,4.0,5.0,3.5,4.5,5.0,4.0,4.5,2.5,2016-01-27


In [0]:
#apply to.datetime 
df["Date"] = pd.to_datetime(df["Date"])

In [0]:
df = df[['Great',"Meat","Length","Circum","Volume","Tortilla","Temp","Fillings","Meat:filling","Uniformity","Wrap","Synergy","Salsa","Date",]]

In [0]:
df = df.apply(lambda x: x.fillna(x.mean())) 

In [0]:
df.isnull().sum()

Great           0
Meat            0
Hunger          0
Length          0
Circum          0
Volume          0
Tortilla        0
Temp            0
Fillings        0
Meat:filling    0
Uniformity      0
Wrap            0
Synergy         0
Salsa           0
Date            0
dtype: int64

#### Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [0]:
#apply to.datetime 
df["Date"] = pd.to_datetime(df["Date"])

In [0]:
###split by date the train,validate, and test

split_date1 = pd.datetime(2016,12,31)
split_date2 = pd.datetime(2018,1,1)

#train data
train = df.loc[df["Date"] <= split_date1]

#val data 
vali = df.loc[(df["Date"]>split_date1) & (df["Date"]<split_date2)]

#test data 
test = df.loc[df["Date"] >= split_date2]

#### Begin with baselines for classification.

In [0]:
print(train.shape)
train.head()

(298, 15)


Unnamed: 0,Great,Meat,Hunger,Length,Circum,Volume,Tortilla,Temp,Fillings,Meat:filling,Uniformity,Wrap,Synergy,Salsa,Date
0,False,3.0,3.0,20.038233,22.135765,0.786477,3.0,5.0,3.5,4.0,4.0,4.0,4.0,4.0,2016-01-18
1,False,2.5,3.5,20.038233,22.135765,0.786477,2.0,3.5,2.5,2.0,4.0,5.0,2.5,3.5,2016-01-24
2,False,2.5,1.5,20.038233,22.135765,0.786477,3.0,2.0,3.0,4.5,4.0,5.0,3.0,3.0,2016-01-24
3,False,3.5,2.0,20.038233,22.135765,0.786477,3.0,2.0,3.0,4.0,5.0,5.0,4.0,4.0,2016-01-24
4,True,4.0,4.0,20.038233,22.135765,0.786477,4.0,5.0,3.5,4.5,5.0,4.0,4.5,2.5,2016-01-27


In [0]:
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

#### Use scikit-learn for logistic regression.

In [0]:
#importe needed libraries
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import category_encoders as ce

In [0]:
# make features and target
target = "Great"
features = train.columns.drop(["Date","Great"])
y_train = train[target]
x_train = train[features]

x_vali = vali[features]
y_vali = vali[target]

In [0]:
#one hot encode 
encoder = ce.OneHotEncoder(use_cat_names=True)

x_train_encoded = encoder.fit_transform(x_train)
x_vali_encoded = encoder.transform(x_vali)


In [0]:
#impute NAN values 
imputer = SimpleImputer()

x_train_imputed = imputer.fit_transform(x_train_encoded)
x_val_imputed = imputer.transform(x_vali_encoded)

In [0]:
#scale the datasets 
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train_imputed)
x_vali_scaled = scaler.transform(x_val_imputed)

#### Get your model's validation accuracy. (Multiple times if you try multiple iterations.)

In [0]:
model = LogisticRegressionCV(random_state=42)
model.fit(x_train_scaled, y_train)
print('Validation Accuracy', model.score(x_vali_scaled, y_vali))

Validation Accuracy 0.8705882352941177




####Get your model's test accuracy. (One time, at the end.)

In [0]:
#apply all fitted transformations to test 
x_test = test[features]
y_test = test[target]
x_test_encoded = encoder.transform(x_test)
x_test_imputed = imputer.transform(x_test_encoded)
x_test_scaled = scaler.transform(x_test_imputed)
y_pred = model.predict(x_test_scaled)

In [0]:
print('Validation Accuracy', model.score(x_test_scaled, y_test))

Validation Accuracy 0.7894736842105263
