Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
#Other imports
from sklearn.model_selection import train_test_split
import numpy

In [3]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [4]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [5]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [6]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [7]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [8]:
# Look at data
df

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
418,Other,8/27/2019,,,,6.00,1.0,,,17.0,20.5,0.57,5.0,4.0,3.5,,4.0,4.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
419,Other,8/27/2019,,,,6.00,4.0,,,19.0,26.0,1.02,4.0,5.0,,3.5,4.0,4.0,5.0,4.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
420,California,8/27/2019,,,,7.90,3.0,,,20.0,22.0,0.77,4.0,4.0,4.0,3.7,3.0,2.0,3.5,4.0,4.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
421,Other,8/27/2019,,,,7.90,3.0,,,22.5,24.5,1.07,5.0,2.0,5.0,5.0,5.0,2.0,5.0,5.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


## **Do train/validate/test split:**
--------------------
Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [9]:
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(['Date'])
df = df.sort_index()

In [13]:
df.index.unique()

DatetimeIndex(['2011-05-16', '2015-04-20', '2016-01-18', '2016-01-24',
               '2016-01-27', '2016-01-28', '2016-01-30', '2016-02-01',
               '2016-02-06', '2016-02-11',
               ...
               '2018-04-20', '2018-04-21', '2018-04-29', '2018-05-01',
               '2018-05-07', '2018-05-22', '2018-11-06', '2019-08-24',
               '2019-08-27', '2026-04-25'],
              dtype='datetime64[ns]', name='Date', length=169, freq=None)

In [None]:
train = df['2011-05-16':'2016-12-15']
val   = df['2017-04']
test  = df['2016-12-21':]

In [9]:
train = df[df['Date'] <= '12/15/2016']
val1  = df[df['Date'] > '2016-30-12']
val   = df[df['Date'] < '2018-01-01']
test  = df[df['Date'] >= '2018-01-01']

print(df.shape)
train.shape,val.shape,test.shape

(421, 59)


((79, 59), (111, 59), (310, 59))

In [9]:
df['Date'].unique()

array(['1/18/2016', '1/24/2016', '1/27/2016', '1/28/2016', '1/30/2016',
       '2/1/2016', '2/6/2016', '2/11/2016', '2/12/2016', '2/14/2016',
       '2/17/2016', '2/24/2016', '2/28/2016', '2/29/2016', '3/3/2016',
       '3/8/2016', '3/11/2016', '3/14/2016', '3/17/2016', '3/18/2016',
       '3/19/2016', '3/20/2016', '3/21/2016', '3/23/2016', '3/30/2016',
       '4/2/2016', '4/3/2016', '4/7/2016', '4/9/2016', '4/14/2016',
       '4/15/2016', '4/24/2016', '4/25/2026', '4/27/2016', '5/4/2016',
       '5/6/2016', '5/9/2016', '5/12/2016', '5/13/2016', '5/15/2016',
       '5/18/2016', '5/20/2016', '5/22/2016', '5/16/2016', '5/16/2011',
       '4/20/2015', '5/23/2016', '5/24/2016', '5/26/2016', '5/27/2016',
       '5/29/2016', '6/1/2016', '5/21/2016', '6/20/2016', '5/31/2016',
       '5/5/2016', '5/19/2016', '5/28/2016', '6/3/2016', '6/5/2016',
       '6/6/2016', '6/8/2016', '6/9/2016', '6/11/2016', '6/16/2016',
       '6/23/2016', '6/24/2016', '8/1/2016', '8/6/2016', '8/9/2016',
       '8/10/

## **Begin with baselines for classification:**

## **Use scikit-learn for logistic regression:**

## **Get your model's validation accuracy:**
(Multiple times if you try multiple iterations.)

## **Get your model's test accuracy:**
(One time, at the end.)