Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [46]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
pd.set_option('display.max_columns', 999)
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [47]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [48]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [49]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [50]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [51]:
df['Date'] = pd.to_datetime(df['Date'])

In [52]:
from IPython.display import display
display(df.columns)
display(df)

Index(['Burrito', 'Date', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
       'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Tortilla',
       'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa',
       'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',
       'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini',
       'Great'],
      dtype='object')

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,2016-01-24,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,2016-01-24,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,2016-01-27,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
418,Other,2019-08-27,,,,6.00,1.0,,,17.0,20.5,0.57,5.0,4.0,3.5,,4.0,4.0,2.0,2.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
419,Other,2019-08-27,,,,6.00,4.0,,,19.0,26.0,1.02,4.0,5.0,,3.5,4.0,4.0,5.0,4.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True
420,California,2019-08-27,,,,7.90,3.0,,,20.0,22.0,0.77,4.0,4.0,4.0,3.7,3.0,2.0,3.5,4.0,4.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
421,Other,2019-08-27,,,,7.90,3.0,,,22.5,24.5,1.07,5.0,2.0,5.0,5.0,5.0,2.0,5.0,5.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [53]:
train = df[df['Date'].apply(lambda x : x.year < 2017)]
val = df[df['Date'].apply(lambda x : x.year == 2017)]
test = df[df['Date'].apply(lambda x : x.year > 2017)]
df.shape, train.shape, val.shape, test.shape

((421, 59), (298, 59), (85, 59), (38, 59))

In [59]:
for c in ['Burrito', 'Date', 'Chips', 'Unreliable', 'NonSD', 'Beef', 'Pico',
          'Guac', 'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp',
          'Fish', 'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
          'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
          'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Egg',
          'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini', 'Great']:
    display(train[c].value_counts(dropna=False))

California     118
Other          110
Asada           35
Surf & Turf     21
Carnitas        14
Name: Burrito, dtype: int64

2016-08-30    29
2016-06-24     9
2016-05-13     7
2016-05-06     7
2016-04-15     7
              ..
2016-10-07     1
2016-09-26     1
2016-11-14     1
2016-09-13     1
2016-12-15     1
Name: Date, Length: 110, dtype: int64

False    276
True      22
Name: Chips, dtype: int64

False    271
True      27
Name: Unreliable, dtype: int64

False    293
True       5
Name: NonSD, dtype: int64

True     168
False    130
Name: Beef, dtype: int64

False    155
True     143
Name: Pico, dtype: int64

False    159
True     139
Name: Guac, dtype: int64

True     149
False    149
Name: Cheese, dtype: int64

False    179
True     119
Name: Fries, dtype: int64

False    213
True      85
Name: Sour cream, dtype: int64

False    255
True      43
Name: Pork, dtype: int64

False    278
True      20
Name: Chicken, dtype: int64

False    278
True      20
Name: Shrimp, dtype: int64

False    293
True       5
Name: Fish, dtype: int64

False    265
True      33
Name: Rice, dtype: int64

False    266
True      32
Name: Beans, dtype: int64

False    287
True      11
Name: Lettuce, dtype: int64

False    291
True       7
Name: Tomato, dtype: int64

False    291
True       7
Name: Bell peper, dtype: int64

False    297
True       1
Name: Carrots, dtype: int64

False    291
True       7
Name: Cabbage, dtype: int64

False    261
True      37
Name: Sauce, dtype: int64

False    292
True       6
Name: Salsa.1, dtype: int64

False    283
True      15
Name: Cilantro, dtype: int64

False    281
True      17
Name: Onion, dtype: int64

False    294
True       4
Name: Taquito, dtype: int64

False    291
True       7
Name: Pineapple, dtype: int64

False    297
True       1
Name: Ham, dtype: int64

False    294
True       4
Name: Chile relleno, dtype: int64

False    294
True       4
Name: Nopales, dtype: int64

False    297
True       1
Name: Lobster, dtype: int64

False    294
True       4
Name: Egg, dtype: int64

False    295
True       3
Name: Mushroom, dtype: int64

False    295
True       3
Name: Bacon, dtype: int64

False    296
True       2
Name: Sushi, dtype: int64

False    285
True      13
Name: Avocado, dtype: int64

False    296
True       2
Name: Corn, dtype: int64

False    297
True       1
Name: Zucchini, dtype: int64

False    176
True     122
Name: Great, dtype: int64

In [58]:
import numpy as np
mp = {'x': True, 'X': True, np.nan: False, 'Yes': True, 'No': False}
for c in train:
    train[c] = train[c].map(lambda x : mp[x] if x in mp else x)
train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,False,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,False,False,True,True,True,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False
1,California,2016-01-24,3.5,3.3,False,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,False,False,True,True,True,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False
2,Carnitas,2016-01-24,,,False,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,False,False,False,True,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False
3,Asada,2016-01-24,,,False,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,False,False,True,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False
4,California,2016-01-27,4.0,3.8,True,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,False,False,True,True,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,California,2016-12-02,4.0,4.3,False,5.65,3.0,,,19.5,22.0,0.75,4.0,1.5,2.0,3.0,4.2,4.0,3.0,2.0,4.5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False
297,Other,2016-12-02,,,False,5.49,3.0,,,19.0,20.5,0.64,4.5,5.0,2.0,2.0,2.5,3.5,3.0,2.5,3.0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False
298,California,2016-12-10,3.5,3.7,False,7.75,4.0,,,20.0,21.0,0.70,3.5,2.5,3.0,3.3,1.4,2.3,2.2,3.3,4.5,False,False,True,True,False,True,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False
299,Asada,2016-12-10,,,False,7.75,4.0,,,19.5,21.0,0.68,4.0,4.5,2.0,2.0,3.5,3.5,2.0,2.0,4.0,False,False,True,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,False,False,False,False,False,False,False,False


In [60]:
for c in train:
    train[c].sum()

TypeError: DatetimeIndex cannot perform the operation sum

In [63]:
train = train.drop(['Mass (g)', 'Density (g/mL)', 'Queso'], axis=1)

In [69]:
for c in train:
    display(train[c].value_counts())

California     118
Other          110
Asada           35
Surf & Turf     21
Carnitas        14
Name: Burrito, dtype: int64

2016-08-30    29
2016-06-24     9
2016-05-13     7
2016-05-06     7
2016-04-15     7
              ..
2016-10-07     1
2016-09-26     1
2016-11-14     1
2016-09-13     1
2016-12-15     1
Name: Date, Length: 110, dtype: int64

4.0    34
4.5    15
3.5    14
3.0     5
2.5     2
4.2     1
Name: Yelp, dtype: int64

4.4    11
4.1    11
4.2     9
4.0     8
3.9     5
4.3     5
4.7     3
4.6     3
3.8     3
4.5     3
3.4     2
4.9     2
3.7     2
3.3     2
2.9     1
3.5     1
Name: Google, dtype: int64

False    276
True      22
Name: Chips, dtype: int64

6.25    38
6.99    16
7.50    15
8.25    15
7.49    13
        ..
6.02     1
3.99     1
6.70     1
4.59     1
6.30     1
Name: Cost, Length: 80, dtype: int64

4.00    85
3.00    59
3.50    56
2.00    21
2.50    18
4.50    16
5.00    16
1.00     6
3.75     3
4.30     2
3.70     2
3.20     2
1.50     2
2.80     1
4.75     1
4.25     1
0.50     1
3.80     1
4.10     1
2.30     1
2.20     1
3.90     1
Name: Hunger, dtype: int64

20.00    20
19.00    19
18.50    16
18.00    16
19.50    13
20.50    13
22.00    10
17.00     9
22.50     9
23.00     9
21.00     8
21.50     7
17.50     7
16.50     5
23.50     3
25.50     2
24.00     1
26.00     1
25.00     1
20.75     1
15.00     1
15.50     1
17.78     1
16.00     1
17.70     1
Name: Length, dtype: int64

22.000    29
21.000    19
21.500    17
23.000    15
23.500    15
20.000    14
20.500    13
22.500    11
24.000     7
25.000     6
24.500     5
25.500     4
19.500     4
19.000     3
22.125     1
22.225     1
20.800     1
21.750     1
26.000     1
26.500     1
18.000     1
27.000     1
17.000     1
18.500     1
21.200     1
22.750     1
Name: Circum, dtype: int64

0.65    14
0.77    11
0.75     9
0.68     8
0.85     8
0.87     7
0.70     6
0.74     5
0.83     5
0.64     5
0.93     5
0.86     4
0.73     4
0.72     4
0.88     4
0.84     4
0.60     4
0.90     4
0.79     3
1.01     3
0.57     3
0.66     3
0.94     3
0.96     3
0.69     3
0.92     3
0.67     3
0.62     2
0.82     2
0.97     2
0.71     2
0.91     2
0.63     2
0.51     2
0.95     2
0.54     2
0.59     1
0.50     1
1.00     1
0.76     1
0.78     1
0.56     1
0.58     1
0.81     1
0.40     1
1.07     1
1.24     1
0.55     1
0.61     1
0.89     1
1.16     1
0.80     1
1.17     1
1.05     1
Name: Volume, dtype: int64

4.00    92
3.00    72
3.50    45
2.00    21
4.50    19
5.00    15
2.50    15
1.50     6
3.80     5
3.60     2
1.40     1
2.80     1
3.20     1
2.10     1
4.80     1
3.75     1
Name: Tortilla, dtype: int64

4.0    70
5.0    51
3.0    40
4.5    35
2.5    25
3.5    25
2.0    20
1.5     3
1.0     3
3.8     3
4.7     2
4.4     1
1.3     1
3.6     1
3.2     1
1.9     1
3.7     1
Name: Temp, dtype: int64

4.00    81
3.00    51
3.50    47
5.00    25
4.50    25
2.00    19
2.50    16
1.50     5
1.00     3
3.70     2
2.75     2
3.80     2
3.30     2
4.20     2
3.75     1
2.60     1
4.70     1
2.70     1
2.80     1
3.20     1
Name: Meat, dtype: int64

4.00    81
3.00    66
3.50    45
5.00    25
4.50    21
2.00    20
2.50    18
2.80     4
1.00     3
4.20     2
2.40     2
4.70     2
1.50     2
3.40     1
2.75     1
4.30     1
3.20     1
3.30     1
4.40     1
Name: Fillings, dtype: int64

4.00    83
3.00    39
3.50    32
5.00    32
4.50    30
2.00    21
2.50    18
1.50    10
1.00     9
3.75     3
4.70     3
2.90     1
3.40     1
3.60     1
0.50     1
4.80     1
4.20     1
2.80     1
3.20     1
1.40     1
3.80     1
3.70     1
3.78     1
Name: Meat:filling, dtype: int64

4.0    80
3.0    38
2.0    37
3.5    33
5.0    30
4.5    30
2.5    16
1.0    12
1.5    10
2.4     2
4.3     1
4.2     1
2.3     1
1.6     1
2.7     1
3.2     1
4.4     1
2.2     1
Name: Uniformity, dtype: int64

3.00    58
4.00    56
3.50    39
2.00    29
2.50    27
4.50    25
5.00    18
1.50     7
1.00     4
4.30     2
4.20     2
3.80     2
4.75     1
0.00     1
2.75     1
2.20     1
3.75     1
0.50     1
3.20     1
3.70     1
1.80     1
Name: Salsa, dtype: int64

4.00    81
3.00    49
3.50    36
4.50    33
5.00    24
2.00    23
2.50    21
1.50     6
1.00     4
3.80     3
3.75     2
4.70     2
3.70     2
3.40     1
2.90     1
4.30     1
2.80     1
2.30     1
3.30     1
4.90     1
4.40     1
2.70     1
4.20     1
Name: Synergy, dtype: int64

5.0    98
4.0    64
4.5    49
3.0    30
3.5    12
2.0    12
1.5     8
1.0     8
2.5     5
0.5     4
1.2     1
2.2     1
3.8     1
2.6     1
3.9     1
0.0     1
Name: Wrap, dtype: int64

False    271
True      27
Name: Unreliable, dtype: int64

False    293
True       5
Name: NonSD, dtype: int64

True     168
False    130
Name: Beef, dtype: int64

False    155
True     143
Name: Pico, dtype: int64

False    159
True     139
Name: Guac, dtype: int64

True     149
False    149
Name: Cheese, dtype: int64

False    179
True     119
Name: Fries, dtype: int64

False    213
True      85
Name: Sour cream, dtype: int64

False    255
True      43
Name: Pork, dtype: int64

False    278
True      20
Name: Chicken, dtype: int64

False    278
True      20
Name: Shrimp, dtype: int64

False    293
True       5
Name: Fish, dtype: int64

False    265
True      33
Name: Rice, dtype: int64

False    266
True      32
Name: Beans, dtype: int64

False    287
True      11
Name: Lettuce, dtype: int64

False    291
True       7
Name: Tomato, dtype: int64

False    291
True       7
Name: Bell peper, dtype: int64

False    297
True       1
Name: Carrots, dtype: int64

False    291
True       7
Name: Cabbage, dtype: int64

False    261
True      37
Name: Sauce, dtype: int64

False    292
True       6
Name: Salsa.1, dtype: int64

False    283
True      15
Name: Cilantro, dtype: int64

False    281
True      17
Name: Onion, dtype: int64

False    294
True       4
Name: Taquito, dtype: int64

False    291
True       7
Name: Pineapple, dtype: int64

False    297
True       1
Name: Ham, dtype: int64

False    294
True       4
Name: Chile relleno, dtype: int64

False    294
True       4
Name: Nopales, dtype: int64

False    297
True       1
Name: Lobster, dtype: int64

False    294
True       4
Name: Egg, dtype: int64

False    295
True       3
Name: Mushroom, dtype: int64

False    295
True       3
Name: Bacon, dtype: int64

False    296
True       2
Name: Sushi, dtype: int64

False    285
True      13
Name: Avocado, dtype: int64

False    296
True       2
Name: Corn, dtype: int64

False    297
True       1
Name: Zucchini, dtype: int64

False    176
True     122
Name: Great, dtype: int64