<a href="https://colab.research.google.com/github/willstauffernorris/DS-Unit-2-Linear-Models/blob/master/Copy_of_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [106]:
df

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
418,Other,8/27/2019,,,,6.00,1.0,,,17.0,20.5,0.57,5.0,4.0,3.5,,4.0,4.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
419,Other,8/27/2019,,,,6.00,4.0,,,19.0,26.0,1.02,4.0,5.0,,3.5,4.0,4.0,5.0,4.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
420,California,8/27/2019,,,,7.90,3.0,,,20.0,22.0,0.77,4.0,4.0,4.0,3.7,3.0,2.0,3.5,4.0,4.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
421,Other,8/27/2019,,,,7.90,3.0,,,22.5,24.5,1.07,5.0,2.0,5.0,5.0,5.0,2.0,5.0,5.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


 # Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [107]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
type(df['Date'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [108]:
train = df[df['Date'] < '1/1/2017' ]
print(train.shape)
val = df[(df['Date'] < '1/1/2018') & (df['Date'] >= '1/1/2017')]
print(val.shape)
test = df[df['Date'] >= '1/1/2018']
print(test.shape)

print(f'The total number of observations for the OG dataframe is{df.shape}')
print(298+85+38)

(298, 59)
(85, 59)
(38, 59)
The total number of observations for the OG dataframe is(421, 59)
421


# Begin with baselines for classification.


In [109]:
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

### If you guessed a burrito was not 'Great', you would be right almost 60% of the time.

False    0.590604
True     0.409396
Name: Great, dtype: float64

 # Use scikit-learn for logistic regression.
 

In [0]:
from sklearn.linear_model import LogisticRegression

In [111]:
train.describe()

##Looking at the numerical features

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Queso
count,71.0,71.0,292.0,297.0,0.0,0.0,175.0,174.0,174.0,298.0,283.0,288.0,297.0,292.0,296.0,278.0,296.0,296.0,0.0
mean,3.897183,4.142254,6.896781,3.445286,,,19.829886,22.042241,0.77092,3.472315,3.70636,3.551215,3.519024,3.52887,3.395946,3.32464,3.540203,3.955068,
std,0.47868,0.371738,1.211412,0.85215,,,2.081275,1.685043,0.137833,0.797606,0.991897,0.869483,0.850348,1.040457,1.089044,0.971226,0.922426,1.167341,
min,2.5,2.9,2.99,0.5,,,15.0,17.0,0.4,1.4,1.0,1.0,1.0,0.5,1.0,0.0,1.0,0.0,
25%,3.5,4.0,6.25,3.0,,,18.5,21.0,0.6625,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,
50%,4.0,4.2,6.85,3.5,,,19.5,22.0,0.75,3.5,4.0,3.5,3.5,4.0,3.5,3.5,3.75,4.0,
75%,4.0,4.4,7.5,4.0,,,21.0,23.0,0.87,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,
max,4.5,4.9,11.95,5.0,,,26.0,27.0,1.24,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,


In [112]:
##This is hard to read, not too useful
##Doesn't work as well for a True/False target as a continuous variable
train.groupby('Volume')[target].mean().sort_values()

Volume
0.40    0.000000
1.16    0.000000
1.07    0.000000
1.00    0.000000
0.96    0.000000
0.91    0.000000
0.89    0.000000
0.81    0.000000
0.78    0.000000
1.17    0.000000
0.76    0.000000
0.69    0.000000
1.24    0.000000
0.50    0.000000
0.51    0.000000
0.55    0.000000
0.62    0.000000
0.57    0.000000
0.58    0.000000
0.61    0.000000
0.64    0.200000
0.83    0.200000
0.75    0.222222
0.90    0.250000
0.88    0.250000
0.86    0.250000
0.84    0.250000
0.60    0.250000
0.68    0.250000
0.72    0.250000
0.66    0.333333
0.92    0.333333
1.01    0.333333
0.70    0.333333
0.77    0.363636
0.74    0.400000
0.65    0.428571
0.95    0.500000
0.54    0.500000
0.87    0.571429
0.85    0.625000
0.67    0.666667
0.73    0.750000
0.93    0.800000
1.05    1.000000
0.97    1.000000
0.59    1.000000
0.71    1.000000
0.63    1.000000
0.82    1.000000
0.80    1.000000
0.79    1.000000
0.94    1.000000
0.56    1.000000
Name: Great, dtype: float64

In [0]:
##Just guessing on some good features here

features = ['Yelp', 'Google']

##Let's try it based soley on 3rd party reviews

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)



In [185]:
##Fitting the model
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [186]:
log_reg.predict(X_val_imputed)

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])

In [187]:
log_reg.coef_

##Yelp is more positively correlated than Google reviews

array([[0.56985211, 0.29980575]])

In [0]:
##Applying Standard Scaler to the model

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [190]:
from sklearn.linear_model import LogisticRegressionCV
model = LogisticRegressionCV()
model.fit(X_train_scaled, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

# Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
 

In [191]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_val_scaled)
accuracy_score(y_val, y_pred)

0.5529411764705883

In [192]:
print(f'For the {len(features)} features {features}, the Validation Accuracy is:')
model.score(X_val_scaled, y_val)

##The shortcut method

For the 2 features ['Yelp', 'Google'], the Validation Accuracy is:


0.5529411764705883

# Get your model's test accuracy. (One time, at the end.)
 

In [193]:
X_test = test[features]
X_test_imputed = imputer.transform(X_test)
X_test_scaled = scaler.transform(X_test_imputed)
y_pred = model.predict(X_test_scaled)
y_test = test[target]
y_test.shape

(38,)

In [197]:
print(f'For the {len(features)} features {features}, the **Test Accuracy** is:')
model.score(X_test_scaled, y_test)

For the 2 features ['Yelp', 'Google'], the **Test Accuracy** is:


0.42105263157894735

Commit your notebook to your fork of the GitHub repo.

# Adding a few more features to my training model.


In [165]:
features = ['Cost', 'Hunger', 'Mass (g)',	'Density (g/mL)',	'Length',	'Circum']
            #'Volume',	'Tortilla', 'Temp',	'Meat',	'Fillings']

##Let's try it based soley on 3rd party reviews

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)

log_reg.predict(X_val_imputed)

log_reg.coef_

array([[0.24009568, 0.40290208, 0.04436039, 0.04912141]])

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [167]:
from sklearn.linear_model import LogisticRegressionCV
model = LogisticRegressionCV()
model.fit(X_train_scaled, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [168]:
print(f'For the {len(features)} features {features}, the Validation Accuracy is:')
model.score(X_val_scaled, y_val)

For the 6 features ['Cost', 'Hunger', 'Mass (g)', 'Density (g/mL)', 'Length', 'Circum'], the Validation Accuracy is:


0.5529411764705883

In [0]:
X_test = test[features]
X_test_imputed = imputer.transform(X_test)
X_test_scaled = scaler.transform(X_test_imputed)
y_pred = model.predict(X_test_scaled)

In [198]:
print(f'For the {len(features)} features {features}, the **Test Accuracy** is:')
model.score(X_test_scaled, y_test)

For the 2 features ['Yelp', 'Google'], the **Test Accuracy** is:


0.42105263157894735

In [0]:
#train.info()

In [153]:
train.shape

(298, 59)

In [0]:
#test.info()

In [154]:
test.shape

(38, 59)