<a href="https://colab.research.google.com/github/maiali13/DS-Unit-2-Linear-Models/blob/master/M_Ali_DS13_U2_S1_Regression_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [8]:
print(df.shape)
df.head()

(421, 59)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [0]:
df['Date'] = pd.to_datetime(df['Date'])

Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [0]:
train = df[df.Date.dt.year <= 2016]

val = df[df.Date.dt.year == 2017]

test = df[df.Date.dt.year >= 2018]

In [12]:
val['Date'].dt.year.value_counts()

2017    85
Name: Date, dtype: int64

Begin with baselines for classification.


Get your model's validation accuracy. (Multiple times if you try multiple iterations.)

stretch goal: OneHotEncoding


In [20]:
target = 'Great'

y_train = train[target]
y_val = val[target]
#training baseline, in this case the mode of great class
baseline, t = y_train.value_counts(normalize=True) #only need true values

print(f'Our baseline for {target} burritos is {baseline:,.2%}.') #% inside curly braces makes it percent not 0.0 decimal

Our baseline for Great burritos is 59.06%.


In [0]:
rating = y_train.mode()[0]
y_pred = [rating] * len(y_train)

In [24]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_train, y_pred)

print(f'Our training accuracy is: {acc:,.3}')

Our training accuracy is: 0.591


In [0]:
features = train.columns.drop(['Date'] + [target]) #remember target must be removed!

x_train = train[features]
x_val = val[features]

In [0]:
train.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

In [0]:
import category_encoders as ce
#encode all cat variables
encoder = ce.OneHotEncoder(use_cat_names=True)
x_train_ce = encoder.fit_transform(x_train)
x_val_ce = encoder.transform(x_val)

In [0]:
from sklearn.impute import SimpleImputer
#impute the NaNs
imputer = SimpleImputer()
x_train_im = imputer.fit_transform(x_train_ce)
x_val_im = imputer.transform(x_val_ce)

In [0]:
from sklearn.preprocessing import StandardScaler
#standardize/'scale'
scaler = StandardScaler()
x_train_sc = scaler.fit_transform(x_train_im)
x_val_sc = scaler.transform(x_val_im)

Use scikit-learn for logistic regression.

In [57]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train_sc, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [58]:
#check val acc of model
val = model.score(x_val_sc, y_val)
print(f'Validation score: {val}')

Validation score: 0.7647058823529411


In [59]:
#acc of val data subset
y_val_pred = model.predict(x_val_sc)
y_val_acc = accuracy_score(y_val, y_val_pred)
print(f'Accuracy score for validation data: {y_val_acc}')
#big improvement from our .59 score earlier!!

Accuracy score for validation data: 0.7647058823529411


 Get your model's test accuracy. (One time, at the end.)

In [60]:
#test val accuracy overall
x_test = test[features]
y_test = test[target]

x_test_ce = encoder.transform(x_test)
x_test_im = imputer.transform(x_test_ce)
x_test_sc = scaler.transform(x_test_im)

y_pred = model.predict(x_test_sc)
x_test_val = model.score(x_test_sc, y_test)
print(f'Test Validation Accuracy {x_test_val}')

Test Validation Accuracy 0.7631578947368421


In [61]:
y_test_acc = accuracy_score(y_test, y_pred)
print(f'Accuracy Score for Testing Data: {y_test_acc}')

Accuracy Score for Testing Data: 0.7631578947368421
