Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [19]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [20]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Yelp,87.0,3.887356,0.475396,2.5,3.5,4.0,4.0,4.5
Google,87.0,4.167816,0.373698,2.9,4.0,4.2,4.4,5.0
Cost,416.0,7.065216,1.503645,2.99,6.25,6.99,7.86,25.0
Hunger,420.0,3.496095,0.811466,0.5,3.0,3.5,4.0,5.0
Mass (g),22.0,546.181818,144.445619,350.0,450.0,540.0,595.0,925.0
Density (g/mL),22.0,0.675277,0.080468,0.56,0.619485,0.658099,0.721726,0.865672
Length,284.0,20.046901,2.084957,15.0,18.5,20.0,21.5,26.0
Circum,282.0,22.131738,1.777526,17.0,21.0,22.0,23.0,29.0
Volume,282.0,0.786489,0.15226,0.4,0.68,0.77,0.88,1.54
Tortilla,423.0,3.519385,0.793301,1.0,3.0,3.5,4.0,5.0


In [21]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Location,423,108,Taco Villa,28
Burrito,423,132,California,101
Date,423,169,8/30/2016,29
Neighborhood,92,41,Clairemont,9
Address,88,87,9500 Gilman Dr,2
URL,87,86,https://www.yelp.com/biz/el-dorado-mexican-foo...,2
Chips,26,4,x,21
Rec,233,6,Yes,157
Reviewer,422,106,Scott,147
Notes,146,145,Bland,2


In [22]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [23]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [24]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [25]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])
df=df.dropna(axis=1, how='all')

In [48]:
object_columns=[]
for column in df.columns: 
    if df[column].dtype=='object':
        print(column)
        print(df[column].value_counts(), '\n\n')
        object_columns.append(column)

Burrito
California     169
Other          156
Asada           43
Surf & Turf     28
Carnitas        25
Name: Burrito, dtype: int64 


Date
8/30/2016    29
8/27/2019     9
6/24/2016     9
5/13/2016     7
5/6/2016      7
             ..
2/23/2017     1
3/16/2017     1
7/19/2017     1
1/2/2018      1
6/9/2016      1
Name: Date, Length: 169, dtype: int64 


Chips
x      21
X       3
Yes     1
No      1
Name: Chips, dtype: int64 


Unreliable
x    33
Name: Unreliable, dtype: int64 


NonSD
x    5
X    2
Name: NonSD, dtype: int64 


Beef
x    137
X     42
Name: Beef, dtype: int64 


Pico
x    127
X     31
Name: Pico, dtype: int64 


Guac
x    114
X     40
Name: Guac, dtype: int64 


Cheese
x    128
X     31
Name: Cheese, dtype: int64 


Fries
x    102
X     25
Name: Fries, dtype: int64 


Sour cream
x    67
X    25
Name: Sour cream, dtype: int64 


Pork
x    36
X    15
Name: Pork, dtype: int64 


Chicken
x    20
X     1
Name: Chicken, dtype: int64 


Shrimp
x    17
X     4
Name: Shrimp, dtyp

In [64]:
import numpy as np

for column in object_columns[2:]:
    df[column]=df[column].str.lower()
    df[column]=df[column].str.replace('x', '1')
    df[column]=    df[column].str.replace('yes', '1')
    df[column]=df[column].str.replace('no', '0')
#    df[column]=df[column].str.replace(np.nan, '0')
    df[column]=pd.to_numeric(df[column])
    print(column)
    print(df[column].value_counts(), '\n\n')
    
    
df

Chips
1.0    25
0.0     1
Name: Chips, dtype: int64 


Unreliable
1.0    33
Name: Unreliable, dtype: int64 


NonSD
1.0    7
Name: NonSD, dtype: int64 


Beef
1.0    179
Name: Beef, dtype: int64 


Pico
1.0    158
Name: Pico, dtype: int64 


Guac
1.0    154
Name: Guac, dtype: int64 


Cheese
1.0    159
Name: Cheese, dtype: int64 


Fries
1.0    127
Name: Fries, dtype: int64 


Sour cream
1.0    92
Name: Sour cream, dtype: int64 


Pork
1.0    51
Name: Pork, dtype: int64 


Chicken
1.0    21
Name: Chicken, dtype: int64 


Shrimp
1.0    21
Name: Shrimp, dtype: int64 


Fish
1.0    6
Name: Fish, dtype: int64 


Rice
1.0    36
Name: Rice, dtype: int64 


Beans
1.0    35
Name: Beans, dtype: int64 


Lettuce
1.0    11
Name: Lettuce, dtype: int64 


Tomato
1.0    7
Name: Tomato, dtype: int64 


Bell peper
1.0    7
Name: Bell peper, dtype: int64 


Carrots
1.0    1
Name: Carrots, dtype: int64 


Cabbage
1.0    8
Name: Cabbage, dtype: int64 


Sauce
1.0    38
Name: Sauce, dtype: int64 


Salsa.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: htt

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,...,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,...,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,...,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,...,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,...,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,1.0,6.59,4.0,,,,...,,,,,,,,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
418,Other,8/27/2019,,,,6.00,1.0,,,17.0,...,,,,,,,,,,False
419,Other,8/27/2019,,,,6.00,4.0,,,19.0,...,,,,,,,,,,True
420,California,8/27/2019,,,,7.90,3.0,,,20.0,...,,,,,,,,,,False
421,Other,8/27/2019,,,,7.90,3.0,,,22.5,...,,,,,,,,,,True


In [68]:
df.describe(include='object')

Unnamed: 0,Burrito,Date
count,421,421
unique,5,169
top,California,8/30/2016
freq,169,29


#### Now I can do a test/train split.
Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [81]:

train = df[pd.to_datetime(df['Date']).dt.year<2017]
val=df[pd.to_datetime(df['Date']).dt.year==2017]
test= df[pd.to_datetime(df['Date']).dt.year>2017]

train.shape, val.shape, test.shape

((298, 58), (85, 58), (38, 58))

In [110]:
target='Great'
features=train.columns.drop('Date')

X_train=train[features]
X_val=val[features]
X_test=test[features]

y_train=train[target]
y_val=val[target]
y_test=test[target]

<h3>Begin with baselines for classification</h3>
Determine majority class


In [111]:
majority_class=y_train.mode()
y_pred=[majority_class]*len(y_train)

Use classification metric: accuracy 

In [112]:
from sklearn.metrics import accuracy_score
accuracy_train= accuracy_score(y_train, y_pred)

y_pred=[majority_class]*len(y_val)
accuracy_val = accuracy_score(y_val, y_pred)

print(f'Baseline: majority class of our data set is {majority_class[0]}, \nour model accuracy is TRAIN: {accuracy_train:.2f}, \nVAL: {accuracy_val:.2f}')


Baseline: majority class of our data set is False, 
our model accuracy is TRAIN: 0.59, 
VAL: 0.55


<h3>Use scikit-learn for logistic regression.</h3>

<h4>First, transformer and estimator sequence</h4>
<li>category_encoders.one_hot.OneHotEncoder</li>
<li>sklearn.impute.SimpleImputer</li>
<li>sklearn.preprocessing.StandardScaler</li>
<li>sklearn.linear_model.LogisticRegressionCV</li>

In [113]:

import category_encoders as ce 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV


In [114]:
#Encoding Burrito column

X_train.head()

Unnamed: 0,Burrito,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,...,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,3.5,4.2,,6.49,3.0,,,,,...,,,,,,,,,,False
1,California,3.5,3.3,,5.45,3.5,,,,,...,,,,,,,,,,False
2,Carnitas,,,,4.85,1.5,,,,,...,,,,,,,,,,False
3,Asada,,,,5.25,2.0,,,,,...,,,,,,,,,,False
4,California,4.0,3.8,1.0,6.59,4.0,,,,,...,,,,,,,,,,True


In [116]:
encoder=ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)


In [117]:
X_train_encoded.head()

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Yelp,Google,Chips,Cost,Hunger,...,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,1,0,0,0,0,3.5,4.2,,6.49,3.0,...,,,,,,,,,,False
1,1,0,0,0,0,3.5,3.3,,5.45,3.5,...,,,,,,,,,,False
2,0,1,0,0,0,,,,4.85,1.5,...,,,,,,,,,,False
3,0,0,1,0,0,,,,5.25,2.0,...,,,,,,,,,,False
4,1,0,0,0,0,4.0,3.8,1.0,6.59,4.0,...,,,,,,,,,,True
