<a href="https://colab.research.google.com/github/dylan0stewart/DS-Unit-1-Build/blob/master/module4-logistic-regression/DylanStewart_assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [9]:
# Split into my 3 sets

# first make it all dt format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)

# now specify where to split them
train = df[(df['Date']<= '2016-12-31' )]
val = df[(df['Date'] >= '2017-01-01') &(df['Date'] <= '2017-12-31')]
test = df[(df['Date']>= '2018-01-01') & (df['Date'] <= '2020-12-31')]

# verify shape
train.shape, val.shape, test.shape

((298, 59), (85, 59), (37, 59))

In [10]:
df['Length'].dropna().mean()

20.038233215547702

In [11]:
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [0]:
majority = y_train.mode()[0]
y_pred = [majority] * len(y_train)

In [13]:
# Import acc score and use it to find my baseline to beat
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

0.5906040268456376

In [0]:
# Import everything else ill need
import category_encoders as ce
from sklearn.linear_model import LogisticRegressionCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Decide on some features, define my X's and y's for train/val
features = ['Burrito', 'Cost', 'Hunger', 'Length', 'Tortilla', 'Temp',
            'Uniformity', 'Salsa', 'Synergy', 'Wrap']

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

In [25]:
# one-hot encode features
encoder = ce.OneHotEncoder(use_cat_names=True)

X_train_encoded= encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

X_train_encoded.shape, X_val_encoded.shape

((298, 14), (85, 14))

In [0]:
# impute missing values for train/val sets
imputer = SimpleImputer()

X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

In [0]:
# Scale train/val sets
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [29]:
model = LogisticRegressionCV(cv=5, random_state=42)

model.fit(X_train_scaled, y_train)
print('Val score:', model.score(X_val_scaled, y_val))
# Validation set score

Val score: 0.7647058823529411


In [35]:
# Test accuracy time-- Doing what i did in a few cells above all in one now!
X_test = test[features]
X_test_encoded = encoder.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_scaled = scaler.transform(X_test_imputed)
y_pred = model.predict(X_test_scaled)
y_test = test[target]
print('Test Acc. Score:', model.score(X_test_scaled, y_test))

Test Acc. Score: 0.7567567567567568
