<a href="https://colab.research.google.com/github/ssbyrne89/DS-Unit-2-Linear-Models/blob/master/DSPT5_HW_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [0]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)

In [82]:
df['Date'].describe()

count                     421
unique                    169
top       2016-08-30 00:00:00
freq                       29
first     2011-05-16 00:00:00
last      2026-04-25 00:00:00
Name: Date, dtype: object

# Do train/validate/test split.
# Train on reviews from 2016 & earlier.
# Validate on 2017. Test on 2018 & later.

In [0]:
train = df[df.Date.dt.year <= 2016]
val = df[(df.Date.dt.year < 2018) & (df.Date.dt.year > 2016)]
test = df[df.Date.dt.year >= 2018]

In [93]:
len(train['Date']) + len(val['Date']) + len(test['Date'])

421

# Begin with baselines for classification.

In [0]:
from pandas_profiling import ProfileReport


In [0]:
ProfileReport(train)

In [101]:
## determine majority class
target = 'Great'
y_train = train[target]
y_train.value_counts(normalize=True)

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [0]:
majority_class = y_train.mode()[0]
y_pred_train = [majority_class]*len(y_train)

In [103]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred_train)

0.5906040268456376

In [104]:
y_val = val[target]
y_pred = [majority_class]*len(y_val)
accuracy_score(y_val, y_pred)

0.5529411764705883

In [123]:
train.describe(exclude='number')

Unnamed: 0,Burrito,Date,Chips,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
count,298,298,22,27,5,168,143,139,149,119,85,43,20,20,5,33,32,11,7,7,1,7,37,6,15,17,4,7,1,4,4,1,4,3,3,2,13,2,1,298
unique,5,110,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,1,2
top,California,2016-08-30 00:00:00,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,False
freq,118,29,19,27,3,130,115,101,121,97,63,29,19,17,3,24,24,9,5,4,1,5,33,5,9,9,3,5,1,4,4,1,4,3,3,2,13,1,1,176
first,,2011-05-16 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
last,,2016-12-15 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [117]:
train.Burrito

0      California
1      California
2        Carnitas
3           Asada
4      California
          ...    
296    California
297         Other
298    California
299         Asada
300         Other
Name: Burrito, Length: 298, dtype: object

In [99]:
train.shape, val.shape, test.shape, df.shape

((298, 59), (85, 59), (38, 59), (421, 59))

# Use scikit-learn for logistic regression.

In [127]:
from pandas_profiling import ProfileReport
ProfileReport(train)


The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.



0,1
Number of variables,60
Number of observations,298
Total Missing (%),65.1%
Total size in memory,137.8 KiB
Average record size in memory,473.4 B

0,1
Numeric,17
Categorical,38
Boolean,1
Date,1
Text (Unique),0
Rejected,3
Unsupported,0

0,1
Distinct count,298
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,150.09
Minimum,0
Maximum,300
Zeros (%),0.3%

0,1
Minimum,0.0
5-th percentile,14.85
Q1,74.25
Median,149.5
Q3,225.75
95-th percentile,285.15
Maximum,300.0
Range,300.0
Interquartile range,151.5

0,1
Standard deviation,87.352
Coef of variation,0.58198
Kurtosis,-1.2098
Mean,150.09
MAD,75.611
Skewness,-0.0013863
Sum,44728
Variance,7630.3
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
300,1,0.3%,
94,1,0.3%,
96,1,0.3%,
97,1,0.3%,
98,1,0.3%,
99,1,0.3%,
100,1,0.3%,
101,1,0.3%,
102,1,0.3%,
103,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.3%,
1,1,0.3%,
2,1,0.3%,
3,1,0.3%,
4,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
296,1,0.3%,
297,1,0.3%,
298,1,0.3%,
299,1,0.3%,
300,1,0.3%,

0,1
Distinct count,5
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0

0,1
California,118
Other,110
Asada,35
Other values (2),35

Value,Count,Frequency (%),Unnamed: 3
California,118,39.6%,
Other,110,36.9%,
Asada,35,11.7%,
Surf & Turf,21,7.0%,
Carnitas,14,4.7%,

0,1
Distinct count,110
Unique (%),36.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2011-05-16 00:00:00
Maximum,2016-12-15 00:00:00

0,1
Distinct count,7
Unique (%),2.3%
Missing (%),76.2%
Missing (n),227
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.8972
Minimum,2.5
Maximum,4.5
Zeros (%),0.0%

0,1
Minimum,2.5
5-th percentile,3.0
Q1,3.5
Median,4.0
Q3,4.0
95-th percentile,4.5
Maximum,4.5
Range,2.0
Interquartile range,0.5

0,1
Standard deviation,0.47868
Coef of variation,0.12283
Kurtosis,0.78092
Mean,3.8972
MAD,0.36171
Skewness,-0.88353
Sum,276.7
Variance,0.22913
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,34,11.4%,
4.5,15,5.0%,
3.5,14,4.7%,
3.0,5,1.7%,
2.5,2,0.7%,
4.2,1,0.3%,
(Missing),227,76.2%,

Value,Count,Frequency (%),Unnamed: 3
2.5,2,0.7%,
3.0,5,1.7%,
3.5,14,4.7%,
4.0,34,11.4%,
4.2,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
3.0,5,1.7%,
3.5,14,4.7%,
4.0,34,11.4%,
4.2,1,0.3%,
4.5,15,5.0%,

0,1
Distinct count,17
Unique (%),5.7%
Missing (%),76.2%
Missing (n),227
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.1423
Minimum,2.9
Maximum,4.9
Zeros (%),0.0%

0,1
Minimum,2.9
5-th percentile,3.4
Q1,4.0
Median,4.2
Q3,4.4
95-th percentile,4.7
Maximum,4.9
Range,2.0
Interquartile range,0.4

0,1
Standard deviation,0.37174
Coef of variation,0.089743
Kurtosis,1.3381
Mean,4.1423
MAD,0.27546
Skewness,-0.73185
Sum,294.1
Variance,0.13819
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.4,11,3.7%,
4.1,11,3.7%,
4.2,9,3.0%,
4.0,8,2.7%,
3.9,5,1.7%,
4.3,5,1.7%,
4.5,3,1.0%,
4.6,3,1.0%,
3.8,3,1.0%,
4.7,3,1.0%,

Value,Count,Frequency (%),Unnamed: 3
2.9,1,0.3%,
3.3,2,0.7%,
3.4,2,0.7%,
3.5,1,0.3%,
3.7,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
4.4,11,3.7%,
4.5,3,1.0%,
4.6,3,1.0%,
4.7,3,1.0%,
4.9,2,0.7%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),92.6%
Missing (n),276

0,1
x,19
X,3
(Missing),276

Value,Count,Frequency (%),Unnamed: 3
x,19,6.4%,
X,3,1.0%,
(Missing),276,92.6%,

0,1
Distinct count,81
Unique (%),27.2%
Missing (%),2.0%
Missing (n),6
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.8968
Minimum,2.99
Maximum,11.95
Zeros (%),0.0%

0,1
Minimum,2.99
5-th percentile,5.0
Q1,6.25
Median,6.85
Q3,7.5
95-th percentile,8.95
Maximum,11.95
Range,8.96
Interquartile range,1.25

0,1
Standard deviation,1.2114
Coef of variation,0.17565
Kurtosis,2.6059
Mean,6.8968
MAD,0.90142
Skewness,0.70218
Sum,2013.9
Variance,1.4675
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
6.25,38,12.8%,
6.99,16,5.4%,
8.25,15,5.0%,
7.5,15,5.0%,
7.49,13,4.4%,
6.6,13,4.4%,
7.0,12,4.0%,
5.99,11,3.7%,
6.5,11,3.7%,
7.9,9,3.0%,

Value,Count,Frequency (%),Unnamed: 3
2.99,1,0.3%,
3.5,1,0.3%,
3.75,1,0.3%,
3.99,1,0.3%,
4.59,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
9.5,3,1.0%,
9.99,1,0.3%,
10.5,1,0.3%,
11.75,1,0.3%,
11.95,2,0.7%,

0,1
Distinct count,23
Unique (%),7.7%
Missing (%),0.3%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.4453
Minimum,0.5
Maximum,5
Zeros (%),0.0%

0,1
Minimum,0.5
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.5
Interquartile range,1.0

0,1
Standard deviation,0.85215
Coef of variation,0.24734
Kurtosis,0.61122
Mean,3.4453
MAD,0.66446
Skewness,-0.64639
Sum,1023.2
Variance,0.72616
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,85,28.5%,
3.0,59,19.8%,
3.5,56,18.8%,
2.0,21,7.0%,
2.5,18,6.0%,
4.5,16,5.4%,
5.0,16,5.4%,
1.0,6,2.0%,
3.75,3,1.0%,
1.5,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
0.5,1,0.3%,
1.0,6,2.0%,
1.5,2,0.7%,
2.0,21,7.0%,
2.2,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
4.25,1,0.3%,
4.3,2,0.7%,
4.5,16,5.4%,
4.75,1,0.3%,
5.0,16,5.4%,

0,1
Constant value,

0,1
Constant value,

0,1
Distinct count,26
Unique (%),8.7%
Missing (%),41.3%
Missing (n),123
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,19.83
Minimum,15
Maximum,26
Zeros (%),0.0%

0,1
Minimum,15.0
5-th percentile,17.0
Q1,18.5
Median,19.5
Q3,21.0
95-th percentile,23.0
Maximum,26.0
Range,11.0
Interquartile range,2.5

0,1
Standard deviation,2.0813
Coef of variation,0.10496
Kurtosis,-0.032747
Mean,19.83
MAD,1.671
Skewness,0.43629
Sum,3470.2
Variance,4.3317
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
20.0,20,6.7%,
19.0,19,6.4%,
18.5,16,5.4%,
18.0,16,5.4%,
19.5,13,4.4%,
20.5,13,4.4%,
22.0,10,3.4%,
23.0,9,3.0%,
22.5,9,3.0%,
17.0,9,3.0%,

Value,Count,Frequency (%),Unnamed: 3
15.0,1,0.3%,
15.5,1,0.3%,
16.0,1,0.3%,
16.5,5,1.7%,
17.0,9,3.0%,

Value,Count,Frequency (%),Unnamed: 3
23.5,3,1.0%,
24.0,1,0.3%,
25.0,1,0.3%,
25.5,2,0.7%,
26.0,1,0.3%,

0,1
Distinct count,27
Unique (%),9.1%
Missing (%),41.6%
Missing (n),124
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,22.042
Minimum,17
Maximum,27
Zeros (%),0.0%

0,1
Minimum,17.0
5-th percentile,19.5
Q1,21.0
Median,22.0
Q3,23.0
95-th percentile,25.0
Maximum,27.0
Range,10.0
Interquartile range,2.0

0,1
Standard deviation,1.685
Coef of variation,0.076446
Kurtosis,0.27308
Mean,22.042
MAD,1.301
Skewness,0.24235
Sum,3835.3
Variance,2.8394
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
22.0,29,9.7%,
21.0,19,6.4%,
21.5,17,5.7%,
23.0,15,5.0%,
23.5,15,5.0%,
20.0,14,4.7%,
20.5,13,4.4%,
22.5,11,3.7%,
24.0,7,2.3%,
25.0,6,2.0%,

Value,Count,Frequency (%),Unnamed: 3
17.0,1,0.3%,
18.0,1,0.3%,
18.5,1,0.3%,
19.0,3,1.0%,
19.5,4,1.3%,

Value,Count,Frequency (%),Unnamed: 3
25.0,6,2.0%,
25.5,4,1.3%,
26.0,1,0.3%,
26.5,1,0.3%,
27.0,1,0.3%,

0,1
Distinct count,55
Unique (%),18.5%
Missing (%),41.6%
Missing (n),124
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.77092
Minimum,0.4
Maximum,1.24
Zeros (%),0.0%

0,1
Minimum,0.4
5-th percentile,0.57
Q1,0.6625
Median,0.75
Q3,0.87
95-th percentile,0.9805
Maximum,1.24
Range,0.84
Interquartile range,0.2075

0,1
Standard deviation,0.13783
Coef of variation,0.17879
Kurtosis,0.37086
Mean,0.77092
MAD,0.1114
Skewness,0.39528
Sum,134.14
Variance,0.018998
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.65,14,4.7%,
0.77,11,3.7%,
0.75,9,3.0%,
0.85,8,2.7%,
0.68,8,2.7%,
0.87,7,2.3%,
0.7,6,2.0%,
0.83,5,1.7%,
0.74,5,1.7%,
0.93,5,1.7%,

Value,Count,Frequency (%),Unnamed: 3
0.4,1,0.3%,
0.5,1,0.3%,
0.51,2,0.7%,
0.54,2,0.7%,
0.55,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
1.05,1,0.3%,
1.07,1,0.3%,
1.16,1,0.3%,
1.17,1,0.3%,
1.24,1,0.3%,

0,1
Distinct count,16
Unique (%),5.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.4723
Minimum,1.4
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.4
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,4.83
Maximum,5.0
Range,3.6
Interquartile range,1.0

0,1
Standard deviation,0.79761
Coef of variation,0.2297
Kurtosis,-0.12845
Mean,3.4723
MAD,0.6425
Skewness,-0.37461
Sum,1034.8
Variance,0.63618
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,92,30.9%,
3.0,72,24.2%,
3.5,45,15.1%,
2.0,21,7.0%,
4.5,19,6.4%,
5.0,15,5.0%,
2.5,15,5.0%,
1.5,6,2.0%,
3.8,5,1.7%,
3.6,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1.4,1,0.3%,
1.5,6,2.0%,
2.0,21,7.0%,
2.1,1,0.3%,
2.5,15,5.0%,

Value,Count,Frequency (%),Unnamed: 3
3.8,5,1.7%,
4.0,92,30.9%,
4.5,19,6.4%,
4.8,1,0.3%,
5.0,15,5.0%,

0,1
Distinct count,18
Unique (%),6.0%
Missing (%),5.0%
Missing (n),15
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.7064
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,4.0
Q3,4.5
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.5

0,1
Standard deviation,0.9919
Coef of variation,0.26762
Kurtosis,-0.53165
Mean,3.7064
MAD,0.82876
Skewness,-0.50226
Sum,1048.9
Variance,0.98386
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,70,23.5%,
5.0,51,17.1%,
3.0,40,13.4%,
4.5,35,11.7%,
2.5,25,8.4%,
3.5,25,8.4%,
2.0,20,6.7%,
1.5,3,1.0%,
1.0,3,1.0%,
3.8,3,1.0%,

Value,Count,Frequency (%),Unnamed: 3
1.0,3,1.0%,
1.3,1,0.3%,
1.5,3,1.0%,
1.9,1,0.3%,
2.0,20,6.7%,

Value,Count,Frequency (%),Unnamed: 3
4.0,70,23.5%,
4.4,1,0.3%,
4.5,35,11.7%,
4.7,2,0.7%,
5.0,51,17.1%,

0,1
Distinct count,21
Unique (%),7.0%
Missing (%),3.4%
Missing (n),10
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5512
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.0

0,1
Standard deviation,0.86948
Coef of variation,0.24484
Kurtosis,0.0073186
Mean,3.5512
MAD,0.69258
Skewness,-0.46557
Sum,1022.8
Variance,0.756
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,81,27.2%,
3.0,51,17.1%,
3.5,47,15.8%,
5.0,25,8.4%,
4.5,25,8.4%,
2.0,19,6.4%,
2.5,16,5.4%,
1.5,5,1.7%,
1.0,3,1.0%,
3.7,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,3,1.0%,
1.5,5,1.7%,
2.0,19,6.4%,
2.5,16,5.4%,
2.6,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
4.0,81,27.2%,
4.2,2,0.7%,
4.5,25,8.4%,
4.7,1,0.3%,
5.0,25,8.4%,

0,1
Distinct count,20
Unique (%),6.7%
Missing (%),0.3%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.519
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.0

0,1
Standard deviation,0.85035
Coef of variation,0.24164
Kurtosis,-0.13853
Mean,3.519
MAD,0.68667
Skewness,-0.29014
Sum,1045.2
Variance,0.72309
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,81,27.2%,
3.0,66,22.1%,
3.5,45,15.1%,
5.0,25,8.4%,
4.5,21,7.0%,
2.0,20,6.7%,
2.5,18,6.0%,
2.8,4,1.3%,
1.0,3,1.0%,
4.7,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,3,1.0%,
1.5,2,0.7%,
2.0,20,6.7%,
2.4,2,0.7%,
2.5,18,6.0%,

Value,Count,Frequency (%),Unnamed: 3
4.3,1,0.3%,
4.4,1,0.3%,
4.5,21,7.0%,
4.7,2,0.7%,
5.0,25,8.4%,

0,1
Distinct count,24
Unique (%),8.1%
Missing (%),2.0%
Missing (n),6
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5289
Minimum,0.5
Maximum,5
Zeros (%),0.0%

0,1
Minimum,0.5
5-th percentile,1.5
Q1,3.0
Median,4.0
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.5
Interquartile range,1.0

0,1
Standard deviation,1.0405
Coef of variation,0.29484
Kurtosis,-0.12828
Mean,3.5289
MAD,0.83697
Skewness,-0.68293
Sum,1030.4
Variance,1.0826
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,83,27.9%,
3.0,39,13.1%,
3.5,32,10.7%,
5.0,32,10.7%,
4.5,30,10.1%,
2.0,21,7.0%,
2.5,18,6.0%,
1.5,10,3.4%,
1.0,9,3.0%,
4.7,3,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.5,1,0.3%,
1.0,9,3.0%,
1.4,1,0.3%,
1.5,10,3.4%,
2.0,21,7.0%,

Value,Count,Frequency (%),Unnamed: 3
4.2,1,0.3%,
4.5,30,10.1%,
4.7,3,1.0%,
4.8,1,0.3%,
5.0,32,10.7%,

0,1
Distinct count,19
Unique (%),6.4%
Missing (%),0.7%
Missing (n),2
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.3959
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1.5
Q1,2.5
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.5

0,1
Standard deviation,1.089
Coef of variation,0.32069
Kurtosis,-0.69816
Mean,3.3959
MAD,0.91698
Skewness,-0.46168
Sum,1005.2
Variance,1.186
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,80,26.8%,
3.0,38,12.8%,
2.0,37,12.4%,
3.5,33,11.1%,
5.0,30,10.1%,
4.5,30,10.1%,
2.5,16,5.4%,
1.0,12,4.0%,
1.5,10,3.4%,
2.4,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,12,4.0%,
1.5,10,3.4%,
1.6,1,0.3%,
2.0,37,12.4%,
2.2,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
4.2,1,0.3%,
4.3,1,0.3%,
4.4,1,0.3%,
4.5,30,10.1%,
5.0,30,10.1%,

0,1
Distinct count,22
Unique (%),7.4%
Missing (%),6.7%
Missing (n),20
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.3246
Minimum,0
Maximum,5
Zeros (%),0.3%

0,1
Minimum,0.0
5-th percentile,1.97
Q1,2.5
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,5.0
Interquartile range,1.5

0,1
Standard deviation,0.97123
Coef of variation,0.29213
Kurtosis,-0.13777
Mean,3.3246
MAD,0.79912
Skewness,-0.36758
Sum,924.25
Variance,0.94328
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
3.0,58,19.5%,
4.0,56,18.8%,
3.5,39,13.1%,
2.0,29,9.7%,
2.5,27,9.1%,
4.5,25,8.4%,
5.0,18,6.0%,
1.5,7,2.3%,
1.0,4,1.3%,
3.8,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.3%,
0.5,1,0.3%,
1.0,4,1.3%,
1.5,7,2.3%,
1.8,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
4.2,2,0.7%,
4.3,2,0.7%,
4.5,25,8.4%,
4.75,1,0.3%,
5.0,18,6.0%,

0,1
Distinct count,24
Unique (%),8.1%
Missing (%),0.7%
Missing (n),2
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5402
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,3.75
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.0

0,1
Standard deviation,0.92243
Coef of variation,0.26056
Kurtosis,-0.2556
Mean,3.5402
MAD,0.7529
Skewness,-0.50578
Sum,1047.9
Variance,0.85087
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,81,27.2%,
3.0,49,16.4%,
3.5,36,12.1%,
4.5,33,11.1%,
5.0,24,8.1%,
2.0,23,7.7%,
2.5,21,7.0%,
1.5,6,2.0%,
1.0,4,1.3%,
3.8,3,1.0%,

Value,Count,Frequency (%),Unnamed: 3
1.0,4,1.3%,
1.5,6,2.0%,
2.0,23,7.7%,
2.3,1,0.3%,
2.5,21,7.0%,

Value,Count,Frequency (%),Unnamed: 3
4.4,1,0.3%,
4.5,33,11.1%,
4.7,2,0.7%,
4.9,1,0.3%,
5.0,24,8.1%,

0,1
Distinct count,17
Unique (%),5.7%
Missing (%),0.7%
Missing (n),2
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.9551
Minimum,0
Maximum,5
Zeros (%),0.3%

0,1
Minimum,0.0
5-th percentile,1.5
Q1,3.5
Median,4.0
Q3,5.0
95-th percentile,5.0
Maximum,5.0
Range,5.0
Interquartile range,1.5

0,1
Standard deviation,1.1673
Coef of variation,0.29515
Kurtosis,0.93791
Mean,3.9551
MAD,0.89176
Skewness,-1.2724
Sum,1170.7
Variance,1.3627
Memory size,2.5 KiB

Value,Count,Frequency (%),Unnamed: 3
5.0,98,32.9%,
4.0,64,21.5%,
4.5,49,16.4%,
3.0,30,10.1%,
3.5,12,4.0%,
2.0,12,4.0%,
1.0,8,2.7%,
1.5,8,2.7%,
2.5,5,1.7%,
0.5,4,1.3%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.3%,
0.5,4,1.3%,
1.0,8,2.7%,
1.2,1,0.3%,
1.5,8,2.7%,

Value,Count,Frequency (%),Unnamed: 3
3.8,1,0.3%,
3.9,1,0.3%,
4.0,64,21.5%,
4.5,49,16.4%,
5.0,98,32.9%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),90.9%
Missing (n),271

0,1
x,27
(Missing),271

Value,Count,Frequency (%),Unnamed: 3
x,27,9.1%,
(Missing),271,90.9%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),98.3%
Missing (n),293

0,1
x,3
X,2
(Missing),293

Value,Count,Frequency (%),Unnamed: 3
x,3,1.0%,
X,2,0.7%,
(Missing),293,98.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),43.6%
Missing (n),130

0,1
x,130
X,38
(Missing),130

Value,Count,Frequency (%),Unnamed: 3
x,130,43.6%,
X,38,12.8%,
(Missing),130,43.6%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),52.0%
Missing (n),155

0,1
x,115
X,28
(Missing),155

Value,Count,Frequency (%),Unnamed: 3
x,115,38.6%,
X,28,9.4%,
(Missing),155,52.0%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),53.4%
Missing (n),159

0,1
x,101
X,38
(Missing),159

Value,Count,Frequency (%),Unnamed: 3
x,101,33.9%,
X,38,12.8%,
(Missing),159,53.4%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),50.0%
Missing (n),149

0,1
x,121
X,28
(Missing),149

Value,Count,Frequency (%),Unnamed: 3
x,121,40.6%,
X,28,9.4%,
(Missing),149,50.0%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),60.1%
Missing (n),179

0,1
x,97
X,22
(Missing),179

Value,Count,Frequency (%),Unnamed: 3
x,97,32.6%,
X,22,7.4%,
(Missing),179,60.1%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),71.5%
Missing (n),213

0,1
x,63
X,22
(Missing),213

Value,Count,Frequency (%),Unnamed: 3
x,63,21.1%,
X,22,7.4%,
(Missing),213,71.5%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),85.6%
Missing (n),255

0,1
x,29
X,14
(Missing),255

Value,Count,Frequency (%),Unnamed: 3
x,29,9.7%,
X,14,4.7%,
(Missing),255,85.6%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),93.3%
Missing (n),278

0,1
x,19
X,1
(Missing),278

Value,Count,Frequency (%),Unnamed: 3
x,19,6.4%,
X,1,0.3%,
(Missing),278,93.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),93.3%
Missing (n),278

0,1
x,17
X,3
(Missing),278

Value,Count,Frequency (%),Unnamed: 3
x,17,5.7%,
X,3,1.0%,
(Missing),278,93.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),98.3%
Missing (n),293

0,1
x,3
X,2
(Missing),293

Value,Count,Frequency (%),Unnamed: 3
x,3,1.0%,
X,2,0.7%,
(Missing),293,98.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),88.9%
Missing (n),265

0,1
x,24
X,9
(Missing),265

Value,Count,Frequency (%),Unnamed: 3
x,24,8.1%,
X,9,3.0%,
(Missing),265,88.9%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),89.3%
Missing (n),266

0,1
x,24
X,8
(Missing),266

Value,Count,Frequency (%),Unnamed: 3
x,24,8.1%,
X,8,2.7%,
(Missing),266,89.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),96.3%
Missing (n),287

0,1
x,9
X,2
(Missing),287

Value,Count,Frequency (%),Unnamed: 3
x,9,3.0%,
X,2,0.7%,
(Missing),287,96.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),97.7%
Missing (n),291

0,1
x,5
X,2
(Missing),291

Value,Count,Frequency (%),Unnamed: 3
x,5,1.7%,
X,2,0.7%,
(Missing),291,97.7%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),97.7%
Missing (n),291

0,1
x,4
X,3
(Missing),291

Value,Count,Frequency (%),Unnamed: 3
x,4,1.3%,
X,3,1.0%,
(Missing),291,97.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),99.7%
Missing (n),297

0,1
x,1
(Missing),297

Value,Count,Frequency (%),Unnamed: 3
x,1,0.3%,
(Missing),297,99.7%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),97.7%
Missing (n),291

0,1
x,5
X,2
(Missing),291

Value,Count,Frequency (%),Unnamed: 3
x,5,1.7%,
X,2,0.7%,
(Missing),291,97.7%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),87.6%
Missing (n),261

0,1
x,33
X,4
(Missing),261

Value,Count,Frequency (%),Unnamed: 3
x,33,11.1%,
X,4,1.3%,
(Missing),261,87.6%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),98.0%
Missing (n),292

0,1
x,5
X,1
(Missing),292

Value,Count,Frequency (%),Unnamed: 3
x,5,1.7%,
X,1,0.3%,
(Missing),292,98.0%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),95.0%
Missing (n),283

0,1
x,9
X,6
(Missing),283

Value,Count,Frequency (%),Unnamed: 3
x,9,3.0%,
X,6,2.0%,
(Missing),283,95.0%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),94.3%
Missing (n),281

0,1
x,9
X,8
(Missing),281

Value,Count,Frequency (%),Unnamed: 3
x,9,3.0%,
X,8,2.7%,
(Missing),281,94.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),98.7%
Missing (n),294

0,1
x,3
X,1
(Missing),294

Value,Count,Frequency (%),Unnamed: 3
x,3,1.0%,
X,1,0.3%,
(Missing),294,98.7%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),97.7%
Missing (n),291

0,1
x,5
X,2
(Missing),291

Value,Count,Frequency (%),Unnamed: 3
x,5,1.7%,
X,2,0.7%,
(Missing),291,97.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),99.7%
Missing (n),297

0,1
x,1
(Missing),297

Value,Count,Frequency (%),Unnamed: 3
x,1,0.3%,
(Missing),297,99.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),98.7%
Missing (n),294

0,1
x,4
(Missing),294

Value,Count,Frequency (%),Unnamed: 3
x,4,1.3%,
(Missing),294,98.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),98.7%
Missing (n),294

0,1
x,4
(Missing),294

Value,Count,Frequency (%),Unnamed: 3
x,4,1.3%,
(Missing),294,98.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),99.7%
Missing (n),297

0,1
x,1
(Missing),297

Value,Count,Frequency (%),Unnamed: 3
x,1,0.3%,
(Missing),297,99.7%,

0,1
Constant value,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),98.7%
Missing (n),294

0,1
x,4
(Missing),294

Value,Count,Frequency (%),Unnamed: 3
x,4,1.3%,
(Missing),294,98.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),99.0%
Missing (n),295

0,1
x,3
(Missing),295

Value,Count,Frequency (%),Unnamed: 3
x,3,1.0%,
(Missing),295,99.0%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),99.0%
Missing (n),295

0,1
x,3
(Missing),295

Value,Count,Frequency (%),Unnamed: 3
x,3,1.0%,
(Missing),295,99.0%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),99.3%
Missing (n),296

0,1
x,2
(Missing),296

Value,Count,Frequency (%),Unnamed: 3
x,2,0.7%,
(Missing),296,99.3%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),95.6%
Missing (n),285

0,1
x,13
(Missing),285

Value,Count,Frequency (%),Unnamed: 3
x,13,4.4%,
(Missing),285,95.6%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),99.3%
Missing (n),296

0,1
x,1
X,1
(Missing),296

Value,Count,Frequency (%),Unnamed: 3
x,1,0.3%,
X,1,0.3%,
(Missing),296,99.3%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),99.7%
Missing (n),297

0,1
x,1
(Missing),297

Value,Count,Frequency (%),Unnamed: 3
x,1,0.3%,
(Missing),297,99.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.4094

0,1
True,122
(Missing),176

Value,Count,Frequency (%),Unnamed: 3
True,122,40.9%,
(Missing),176,59.1%,

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,2016-01-24,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,2016-01-24,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,2016-01-27,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [131]:
train.describe(exclude='number')

Unnamed: 0,Burrito,Date,Chips,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
count,298,298,22,27,5,168,143,139,149,119,85,43,20,20,5,33,32,11,7,7,1,7,37,6,15,17,4,7,1,4,4,1,4,3,3,2,13,2,1,298
unique,5,110,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,1,2
top,California,2016-08-30 00:00:00,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,False
freq,118,29,19,27,3,130,115,101,121,97,63,29,19,17,3,24,24,9,5,4,1,5,33,5,9,9,3,5,1,4,4,1,4,3,3,2,13,1,1,176
first,,2011-05-16 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
last,,2016-12-15 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [0]:
features = ['Chips', 'Cost', 'Hunger',
       'Mass (g)', 'Density (g/mL)', 'Length',
       'Circum', 'Volume', 'Tortilla',
       'Temp', 'Meat', 'Fillings', 'Meat:filling',
       'Uniformity', 'Salsa',
       'Synergy', 'Wrap', 'Unreliable', 'NonSD',
       'Beef', 'Pico', 'Guac',                                                                                                                                                             
       'Cheese', 'Fries', 'Sour cream', 'Pork',
       'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper',
       'Carrots', 'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro',
       'Onion', 'Taquito', 'Pineapple', 'Ham', 'Chile relleno',
       'Nopales', 'Lobster', 'Queso', 'Egg', 'Mushroom',
       'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']
X_train = train[features]
X_val = val[features]

In [111]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

ValueError: ignored

In [105]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)
print('Validation Accuracy', log_reg.score(X_val_imputed, y_val))

NameError: ignored