<a href="https://colab.research.google.com/github/EvidenceN/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/Evidence.N%20Answers_Assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/burritos/burritos.csv")

In [55]:
df.head()
df.shape

(423, 66)

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
df.head()

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
df.head()

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [0]:
df.head()

In [0]:
# change 'Great' into 0's and 1's because right now it says true and false
# the operations we are performing below only works with numbers not str or bools

#df['Great'] = df['Great'].replace({True: 1, False: 0})

# just kidding, it works with bools too. 

In [0]:
df.head()

In [65]:
# Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
df['Date'] = pd.to_datetime(df['Date'])
train = df[df.Date.dt.year <= 2016]
test = df[df.Date.dt.year >= 2018]
val = df[df.Date.dt.year == 2017]
val['Date'].dt.year.value_counts()

2017    85
Name: Date, dtype: int64

In [77]:
#  Begin with baselines for classification.
# Baseline = majority class = mode

target = 'Great'

y_train = train[target]
y_train_class = y_train.value_counts(normalize=True)
print(f'Burrito rating:\n{y_train_class}')
# only 41% of the burritos has a rating of 4 or more
# 59% of burritos has less than 4 rating. 
# training baseline = highest freqeuncy/mode from the data.

majority_class = y_train.mode()[0]

y_pred = [majority_class] * len(y_train)

# how accurate is our prediction?
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_train, y_pred)
print(f"accuracy: {accuracy}")

Burrito rating:
False    0.590604
True     0.409396
Name: Great, dtype: float64
accuracy 0.5906040268456376


The steps to logistic regression: 
Fit this sequence of transformers & estimator:

- [category_encoders.one_hot.OneHotEncoder](https://contrib.scikit-learn.org/categorical-encoding/onehot.html)
- [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
- [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
- [sklearn.linear_model.LogisticRegressionCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score

In [0]:
# checking number of cardinality before encoding. 
train.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

In [0]:
# defining training and validation x values and y values
target = 'Great'
features = train.columns.drop([target] + ['Date'])

In [0]:
x_train = train[features]
x_val = val[features]
y_train = train[target]
y_val = val[target]

In [118]:
# encoding the categorical variables
encoder = ce.OneHotEncoder(use_cat_names=True)
x_train_encoded = encoder.fit_transform(x_train)
x_val_encoded = encoder.transform(x_val)

# inputing nan values
imputer = SimpleImputer()
x_train_imputed = imputer.fit_transform(x_train_encoded)
x_val_imputed = imputer.transform(x_val_encoded)

# standardizing the data
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train_imputed)
x_val_scaled = scaler.transform(x_val_imputed)

# doing logistic regression 
# cv = cross validation. default in feature iterations of sklearn is 5
# that is why cv=5 is specified. n_jobs = number of cores to use. 
# n_jobs = 1 means use 1 core, = -1 means use all cores. 
# random state = 42 guarantees that the same features will be used
# everytime which basically guarantees reproducible result. 
# if randome state = none which is the default, np.random is used
# everytime the function is called which means randome series everytime
# and this removes reproducibilty of prior result. 
model = LogisticRegressionCV(cv=5, n_jobs=-1, random_state=42)
model.fit(x_train_scaled, y_train)

# checking the validation accuracy of the model. 
validation = model.score(x_val_scaled, y_val)
print(f'Validation Score: {validation}')

# accuracy score of validation data
y_val_pred = model.predict(x_val_scaled)
y_val_accuracy = accuracy_score(y_val, y_val_pred)

print(f'Accuracy Score for validation: {y_val_accuracy}')

# this score of 76% is better than 59% that we had earlier. 
# model.score and accurace score are the exact same thinig. 
# good to know. 

Validation Score: 0.7647058823529411
Accuracy Score for validation: 0.7647058823529411


In [112]:
# getting the coefficients and intercepts

coef = model.coef_[0]
intercept = model.intercept_

array([-1.27035802])

In [119]:
# test validation accuracy
y_test = test[target]
x_test = test[features]
x_test_encoded = encoder.transform(x_test)
x_test_imputed = imputer.transform(x_test_encoded)
x_test_scaled = scaler.transform(x_test_imputed)
y_pred = model.predict(x_test_scaled)
x_test_validation = model.score(x_test_scaled, y_test)
print('Test Validation Accuracy:', x_test_validation)
y_test_accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy Score for Testing: {y_test_accuracy}')

Test Validation Accuracy: 0.7368421052631579
Accuracy Score for Testing: 0.7368421052631579
