Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [13]:
print(df.shape)
df.head()

(421, 59)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [12]:
df.describe()

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Queso
count,87.0,87.0,414.0,418.0,22.0,22.0,283.0,281.0,281.0,421.0,401.0,407.0,418.0,412.0,419.0,396.0,419.0,418.0,0.0
mean,3.887356,4.167816,7.067343,3.495335,546.181818,0.675277,20.038233,22.135765,0.786477,3.519477,3.783042,3.620393,3.539833,3.586481,3.428998,3.37197,3.586993,3.979904,
std,0.475396,0.373698,1.506742,0.812069,144.445619,0.080468,2.083518,1.779408,0.152531,0.794438,0.980338,0.829254,0.799549,0.997057,1.068794,0.924037,0.886807,1.118185,
min,2.5,2.9,2.99,0.5,350.0,0.56,15.0,17.0,0.4,1.0,1.0,1.0,1.0,0.5,0.0,0.0,1.0,0.0,
25%,3.5,4.0,6.25,3.0,450.0,0.619485,18.5,21.0,0.68,3.0,3.0,3.0,3.0,3.0,2.6,3.0,3.0,3.5,
50%,4.0,4.2,6.99,3.5,540.0,0.658099,20.0,22.0,0.77,3.5,4.0,3.8,3.5,4.0,3.5,3.5,3.8,4.0,
75%,4.0,4.4,7.88,4.0,595.0,0.721726,21.5,23.0,0.88,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,
max,4.5,5.0,25.0,5.0,925.0,0.865672,26.0,29.0,1.54,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,


In [10]:
#Well lets check out this new fangled profiling.  It seems like a pretty great
#tool.
import pandas_profiling
profile = pandas_profiling.ProfileReport(df)
profile

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,60
Number of observations,421
Total Missing (%),65.0%
Total size in memory,194.6 KiB
Average record size in memory,473.3 B

0,1
Numeric,17
Categorical,39
Boolean,1
Date,0
Text (Unique),0
Rejected,3
Unsupported,0

0,1
Distinct count,421
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,211.18
Minimum,0
Maximum,422
Zeros (%),0.2%

0,1
Minimum,0
5-th percentile,21
Q1,105
Median,212
Q3,317
95-th percentile,401
Maximum,422
Range,422
Interquartile range,212

0,1
Standard deviation,122.52
Coef of variation,0.58014
Kurtosis,-1.2069
Mean,211.18
MAD,106.07
Skewness,-0.0043514
Sum,88908
Variance,15010
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
422,1,0.2%,
156,1,0.2%,
132,1,0.2%,
133,1,0.2%,
134,1,0.2%,
135,1,0.2%,
136,1,0.2%,
137,1,0.2%,
138,1,0.2%,
139,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.2%,
1,1,0.2%,
2,1,0.2%,
3,1,0.2%,
4,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
418,1,0.2%,
419,1,0.2%,
420,1,0.2%,
421,1,0.2%,
422,1,0.2%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0

0,1
California,169
Other,156
Asada,43
Other values (2),53

Value,Count,Frequency (%),Unnamed: 3
California,169,40.1%,
Other,156,37.1%,
Asada,43,10.2%,
Surf & Turf,28,6.7%,
Carnitas,25,5.9%,

0,1
Distinct count,169
Unique (%),40.1%
Missing (%),0.0%
Missing (n),0

0,1
8/30/2016,29
6/24/2016,9
8/27/2019,9
Other values (166),374

Value,Count,Frequency (%),Unnamed: 3
8/30/2016,29,6.9%,
6/24/2016,9,2.1%,
8/27/2019,9,2.1%,
5/6/2016,7,1.7%,
4/15/2016,7,1.7%,
5/13/2016,7,1.7%,
4/7/2017,6,1.4%,
3/21/2016,6,1.4%,
6/3/2016,6,1.4%,
5/22/2018,5,1.2%,

0,1
Distinct count,7
Unique (%),1.7%
Missing (%),79.3%
Missing (n),334
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.8874
Minimum,2.5
Maximum,4.5
Zeros (%),0.0%

0,1
Minimum,2.5
5-th percentile,3.0
Q1,3.5
Median,4.0
Q3,4.0
95-th percentile,4.5
Maximum,4.5
Range,2.0
Interquartile range,0.5

0,1
Standard deviation,0.4754
Coef of variation,0.12229
Kurtosis,0.48244
Mean,3.8874
MAD,0.36686
Skewness,-0.78091
Sum,338.2
Variance,0.226
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,41,9.7%,
4.5,18,4.3%,
3.5,18,4.3%,
3.0,7,1.7%,
2.5,2,0.5%,
4.2,1,0.2%,
(Missing),334,79.3%,

Value,Count,Frequency (%),Unnamed: 3
2.5,2,0.5%,
3.0,7,1.7%,
3.5,18,4.3%,
4.0,41,9.7%,
4.2,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
3.0,7,1.7%,
3.5,18,4.3%,
4.0,41,9.7%,
4.2,1,0.2%,
4.5,18,4.3%,

0,1
Distinct count,19
Unique (%),4.5%
Missing (%),79.3%
Missing (n),334
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.1678
Minimum,2.9
Maximum,5
Zeros (%),0.0%

0,1
Minimum,2.9
5-th percentile,3.43
Q1,4.0
Median,4.2
Q3,4.4
95-th percentile,4.7
Maximum,5.0
Range,2.1
Interquartile range,0.4

0,1
Standard deviation,0.3737
Coef of variation,0.089663
Kurtosis,1.1153
Mean,4.1678
MAD,0.28231
Skewness,-0.54038
Sum,362.6
Variance,0.13965
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.4,12,2.9%,
4.1,12,2.9%,
4.2,11,2.6%,
4.0,9,2.1%,
4.3,6,1.4%,
3.9,6,1.4%,
4.5,6,1.4%,
3.8,6,1.4%,
4.7,4,1.0%,
4.6,3,0.7%,

Value,Count,Frequency (%),Unnamed: 3
2.9,1,0.2%,
3.3,2,0.5%,
3.4,2,0.5%,
3.5,1,0.2%,
3.7,2,0.5%,

Value,Count,Frequency (%),Unnamed: 3
4.6,3,0.7%,
4.7,4,1.0%,
4.8,1,0.2%,
4.9,2,0.5%,
5.0,1,0.2%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),93.8%
Missing (n),395

0,1
x,21
X,3
Yes,1
(Missing),395

Value,Count,Frequency (%),Unnamed: 3
x,21,5.0%,
X,3,0.7%,
Yes,1,0.2%,
No,1,0.2%,
(Missing),395,93.8%,

0,1
Distinct count,99
Unique (%),23.5%
Missing (%),1.7%
Missing (n),7
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.0673
Minimum,2.99
Maximum,25
Zeros (%),0.0%

0,1
Minimum,2.99
5-th percentile,5.25
Q1,6.25
Median,6.99
Q3,7.88
95-th percentile,9.0
Maximum,25.0
Range,22.01
Interquartile range,1.63

0,1
Standard deviation,1.5067
Coef of variation,0.2132
Kurtosis,48.352
Mean,7.0673
MAD,0.9721
Skewness,4.3618
Sum,2925.9
Variance,2.2703
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
6.25,40,9.5%,
5.99,23,5.5%,
6.99,23,5.5%,
7.9,20,4.8%,
7.5,17,4.0%,
8.25,17,4.0%,
7.0,16,3.8%,
7.49,15,3.6%,
6.6,15,3.6%,
6.5,14,3.3%,

Value,Count,Frequency (%),Unnamed: 3
2.99,1,0.2%,
3.5,1,0.2%,
3.75,1,0.2%,
3.99,1,0.2%,
4.19,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
10.5,1,0.2%,
11.5,1,0.2%,
11.75,1,0.2%,
11.95,2,0.5%,
25.0,1,0.2%,

0,1
Distinct count,26
Unique (%),6.2%
Missing (%),0.7%
Missing (n),3
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.4953
Minimum,0.5
Maximum,5
Zeros (%),0.0%

0,1
Minimum,0.5
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.5
Interquartile range,1.0

0,1
Standard deviation,0.81207
Coef of variation,0.23233
Kurtosis,0.85698
Mean,3.4953
MAD,0.61923
Skewness,-0.6989
Sum,1461.1
Variance,0.65946
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,126,29.9%,
3.5,82,19.5%,
3.0,80,19.0%,
2.0,26,6.2%,
5.0,22,5.2%,
2.5,22,5.2%,
4.5,21,5.0%,
1.0,7,1.7%,
3.2,4,1.0%,
4.3,4,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.5,1,0.2%,
1.0,7,1.7%,
1.5,2,0.5%,
2.0,26,6.2%,
2.2,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4.3,4,1.0%,
4.4,1,0.2%,
4.5,21,5.0%,
4.75,1,0.2%,
5.0,22,5.2%,

0,1
Distinct count,19
Unique (%),4.5%
Missing (%),94.8%
Missing (n),399
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,546.18
Minimum,350
Maximum,925
Zeros (%),0.0%

0,1
Minimum,350.0
5-th percentile,415.25
Q1,450.0
Median,540.0
Q3,595.0
95-th percentile,905.75
Maximum,925.0
Range,575.0
Interquartile range,145.0

0,1
Standard deviation,144.45
Coef of variation,0.26446
Kurtosis,2.7742
Mean,546.18
MAD,98.926
Skewness,1.5636
Sum,12016
Variance,20865
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
450.0,3,0.7%,
550.0,2,0.5%,
540.0,2,0.5%,
430.0,1,0.2%,
520.0,1,0.2%,
425.0,1,0.2%,
476.0,1,0.2%,
560.0,1,0.2%,
920.0,1,0.2%,
635.0,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
350.0,1,0.2%,
415.0,1,0.2%,
420.0,1,0.2%,
425.0,1,0.2%,
430.0,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
610.0,1,0.2%,
620.0,1,0.2%,
635.0,1,0.2%,
920.0,1,0.2%,
925.0,1,0.2%,

0,1
Distinct count,22
Unique (%),5.2%
Missing (%),94.8%
Missing (n),399
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.67528
Minimum,0.56
Maximum,0.86567
Zeros (%),0.0%

0,1
Minimum,0.56
5-th percentile,0.56724
Q1,0.61949
Median,0.6581
Q3,0.72173
95-th percentile,0.78563
Maximum,0.86567
Range,0.30567
Interquartile range,0.10224

0,1
Standard deviation,0.080468
Coef of variation,0.11916
Kurtosis,-0.16921
Mean,0.67528
MAD,0.066504
Skewness,0.6531
Sum,14.856
Variance,0.0064751
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
0.625,2,0.5%,
0.7857142856999999,1,0.2%,
0.703125,1,0.2%,
0.6875,1,0.2%,
0.6323529412,1,0.2%,
0.6176470588,1,0.2%,
0.56,1,0.2%,
0.6326530612,1,0.2%,
0.8656716418000001,1,0.2%,
0.7083333333,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.56,1,0.2%,
0.5656565657,1,0.2%,
0.5974025974,1,0.2%,
0.6006493506,1,0.2%,
0.6136363636,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.7446808511,1,0.2%,
0.7792207792,1,0.2%,
0.7839506173,1,0.2%,
0.7857142856999999,1,0.2%,
0.8656716418000001,1,0.2%,

0,1
Distinct count,30
Unique (%),7.1%
Missing (%),32.8%
Missing (n),138
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,20.038
Minimum,15
Maximum,26
Zeros (%),0.0%

0,1
Minimum,15.0
5-th percentile,17.0
Q1,18.5
Median,20.0
Q3,21.5
95-th percentile,23.5
Maximum,26.0
Range,11.0
Interquartile range,3.0

0,1
Standard deviation,2.0835
Coef of variation,0.10398
Kurtosis,-0.28315
Mean,20.038
MAD,1.681
Skewness,0.25651
Sum,5670.8
Variance,4.341
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
20.0,34,8.1%,
19.0,29,6.9%,
18.0,24,5.7%,
18.5,23,5.5%,
19.5,21,5.0%,
21.0,19,4.5%,
20.5,17,4.0%,
22.0,16,3.8%,
23.0,16,3.8%,
22.5,16,3.8%,

Value,Count,Frequency (%),Unnamed: 3
15.0,2,0.5%,
15.5,2,0.5%,
16.0,1,0.2%,
16.5,5,1.2%,
17.0,14,3.3%,

Value,Count,Frequency (%),Unnamed: 3
24.0,4,1.0%,
24.5,2,0.5%,
25.0,2,0.5%,
25.5,2,0.5%,
26.0,1,0.2%,

0,1
Correlation,0.92562

0,1
Correlation,0.92942

0,1
Distinct count,18
Unique (%),4.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5195
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.0

0,1
Standard deviation,0.79444
Coef of variation,0.22573
Kurtosis,-0.017905
Mean,3.5195
MAD,0.63717
Skewness,-0.38845
Sum,1481.7
Variance,0.63113
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,130,30.9%,
3.0,100,23.8%,
3.5,63,15.0%,
4.5,30,7.1%,
2.0,27,6.4%,
5.0,25,5.9%,
2.5,20,4.8%,
3.8,7,1.7%,
1.5,6,1.4%,
4.2,2,0.5%,

Value,Count,Frequency (%),Unnamed: 3
1.0,1,0.2%,
1.4,1,0.2%,
1.5,6,1.4%,
2.0,27,6.4%,
2.1,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4.0,130,30.9%,
4.2,2,0.5%,
4.5,30,7.1%,
4.8,1,0.2%,
5.0,25,5.9%,

0,1
Distinct count,19
Unique (%),4.5%
Missing (%),4.8%
Missing (n),20
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.783
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,4.0
Q3,4.5
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.5

0,1
Standard deviation,0.98034
Coef of variation,0.25914
Kurtosis,-0.4254
Mean,3.783
MAD,0.81274
Skewness,-0.59353
Sum,1517
Variance,0.96106
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,101,24.0%,
5.0,80,19.0%,
3.0,58,13.8%,
4.5,56,13.3%,
3.5,30,7.1%,
2.5,28,6.7%,
2.0,28,6.7%,
1.0,4,1.0%,
1.5,3,0.7%,
3.8,3,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,4,1.0%,
1.3,1,0.2%,
1.5,3,0.7%,
1.9,1,0.2%,
2.0,28,6.7%,

Value,Count,Frequency (%),Unnamed: 3
4.2,1,0.2%,
4.4,2,0.5%,
4.5,56,13.3%,
4.7,2,0.5%,
5.0,80,19.0%,

0,1
Distinct count,24
Unique (%),5.7%
Missing (%),3.3%
Missing (n),14
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.6204
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,3.8
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.0

0,1
Standard deviation,0.82925
Coef of variation,0.22905
Kurtosis,0.20758
Mean,3.6204
MAD,0.65836
Skewness,-0.54946
Sum,1473.5
Variance,0.68766
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,120,28.5%,
3.5,69,16.4%,
3.0,62,14.7%,
4.5,37,8.8%,
5.0,35,8.3%,
2.0,25,5.9%,
2.5,18,4.3%,
1.5,5,1.2%,
3.8,5,1.2%,
4.2,4,1.0%,

Value,Count,Frequency (%),Unnamed: 3
1.0,3,0.7%,
1.5,5,1.2%,
2.0,25,5.9%,
2.5,18,4.3%,
2.6,2,0.5%,

Value,Count,Frequency (%),Unnamed: 3
4.2,4,1.0%,
4.3,3,0.7%,
4.5,37,8.8%,
4.7,1,0.2%,
5.0,35,8.3%,

0,1
Distinct count,23
Unique (%),5.5%
Missing (%),0.7%
Missing (n),3
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5398
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.0

0,1
Standard deviation,0.79955
Coef of variation,0.22587
Kurtosis,0.10127
Mean,3.5398
MAD,0.63508
Skewness,-0.32638
Sum,1479.7
Variance,0.63928
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,115,27.3%,
3.0,91,21.6%,
3.5,74,17.6%,
5.0,31,7.4%,
4.5,27,6.4%,
2.0,24,5.7%,
2.5,22,5.2%,
3.8,6,1.4%,
2.8,4,1.0%,
1.5,3,0.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,3,0.7%,
1.5,3,0.7%,
2.0,24,5.7%,
2.4,2,0.5%,
2.5,22,5.2%,

Value,Count,Frequency (%),Unnamed: 3
4.3,2,0.5%,
4.4,1,0.2%,
4.5,27,6.4%,
4.7,2,0.5%,
5.0,31,7.4%,

0,1
Distinct count,26
Unique (%),6.2%
Missing (%),2.1%
Missing (n),9
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5865
Minimum,0.5
Maximum,5
Zeros (%),0.0%

0,1
Minimum,0.5
5-th percentile,1.5
Q1,3.0
Median,4.0
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.5
Interquartile range,1.0

0,1
Standard deviation,0.99706
Coef of variation,0.278
Kurtosis,0.081909
Mean,3.5865
MAD,0.80097
Skewness,-0.71321
Sum,1477.6
Variance,0.99412
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,114,27.1%,
3.0,63,15.0%,
4.5,48,11.4%,
5.0,46,10.9%,
3.5,43,10.2%,
2.5,24,5.7%,
2.0,23,5.5%,
1.5,11,2.6%,
1.0,11,2.6%,
3.8,4,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.5,1,0.2%,
1.0,11,2.6%,
1.4,1,0.2%,
1.5,11,2.6%,
1.8,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4.5,48,11.4%,
4.7,3,0.7%,
4.8,1,0.2%,
4.9,1,0.2%,
5.0,46,10.9%,

0,1
Distinct count,29
Unique (%),6.9%
Missing (%),0.5%
Missing (n),2
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.429
Minimum,0
Maximum,5
Zeros (%),0.2%

0,1
Minimum,0.0
5-th percentile,1.5
Q1,2.6
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,5.0
Interquartile range,1.4

0,1
Standard deviation,1.0688
Coef of variation,0.31169
Kurtosis,-0.44547
Mean,3.429
MAD,0.88646
Skewness,-0.53849
Sum,1436.8
Variance,1.1423
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,110,26.1%,
3.0,54,12.8%,
2.0,48,11.4%,
3.5,48,11.4%,
4.5,44,10.5%,
5.0,41,9.7%,
2.5,24,5.7%,
1.0,15,3.6%,
1.5,11,2.6%,
3.8,2,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.2%,
1.0,15,3.6%,
1.5,11,2.6%,
1.6,1,0.2%,
1.8,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4.4,2,0.5%,
4.5,44,10.5%,
4.6,1,0.2%,
4.75,1,0.2%,
5.0,41,9.7%,

0,1
Distinct count,28
Unique (%),6.7%
Missing (%),5.9%
Missing (n),25
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.372
Minimum,0
Maximum,5
Zeros (%),0.2%

0,1
Minimum,0.0
5-th percentile,2.0
Q1,3.0
Median,3.5
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,5.0
Interquartile range,1.0

0,1
Standard deviation,0.92404
Coef of variation,0.27403
Kurtosis,0.0027793
Mean,3.372
MAD,0.75449
Skewness,-0.42729
Sum,1335.3
Variance,0.85385
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,86,20.4%,
3.0,84,20.0%,
3.5,55,13.1%,
2.0,38,9.0%,
2.5,33,7.8%,
4.5,31,7.4%,
5.0,23,5.5%,
1.5,8,1.9%,
3.8,6,1.4%,
1.0,5,1.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.2%,
0.5,1,0.2%,
1.0,5,1.2%,
1.5,8,1.9%,
1.8,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4.5,31,7.4%,
4.7,1,0.2%,
4.75,1,0.2%,
4.8,1,0.2%,
5.0,23,5.5%,

0,1
Distinct count,28
Unique (%),6.7%
Missing (%),0.5%
Missing (n),2
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.587
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,2.0
Q1,3.0
Median,3.8
Q3,4.0
95-th percentile,5.0
Maximum,5.0
Range,4.0
Interquartile range,1.0

0,1
Standard deviation,0.88681
Coef of variation,0.24723
Kurtosis,-0.020224
Mean,3.587
MAD,0.71617
Skewness,-0.57215
Sum,1502.9
Variance,0.78643
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,118,28.0%,
3.0,66,15.7%,
3.5,55,13.1%,
4.5,42,10.0%,
5.0,33,7.8%,
2.0,28,6.7%,
2.5,24,5.7%,
3.8,8,1.9%,
1.5,7,1.7%,
1.0,5,1.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,5,1.2%,
1.5,7,1.7%,
2.0,28,6.7%,
2.2,1,0.2%,
2.3,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
4.6,1,0.2%,
4.7,2,0.5%,
4.8,2,0.5%,
4.9,2,0.5%,
5.0,33,7.8%,

0,1
Distinct count,24
Unique (%),5.7%
Missing (%),0.7%
Missing (n),3
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.9799
Minimum,0
Maximum,5
Zeros (%),0.2%

0,1
Minimum,0.0
5-th percentile,1.5
Q1,3.5
Median,4.0
Q3,5.0
95-th percentile,5.0
Maximum,5.0
Range,5.0
Interquartile range,1.5

0,1
Standard deviation,1.1182
Coef of variation,0.28096
Kurtosis,0.94208
Mean,3.9799
MAD,0.85334
Skewness,-1.2402
Sum,1663.6
Variance,1.2503
Memory size,3.4 KiB

Value,Count,Frequency (%),Unnamed: 3
5.0,137,32.5%,
4.0,89,21.1%,
4.5,65,15.4%,
3.0,41,9.7%,
3.5,20,4.8%,
2.0,18,4.3%,
1.5,11,2.6%,
1.0,9,2.1%,
2.5,8,1.9%,
0.5,4,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1,0.2%,
0.5,4,1.0%,
1.0,9,2.1%,
1.2,1,0.2%,
1.5,11,2.6%,

Value,Count,Frequency (%),Unnamed: 3
4.2,1,0.2%,
4.3,1,0.2%,
4.5,65,15.4%,
4.8,3,0.7%,
5.0,137,32.5%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),92.2%
Missing (n),388

0,1
x,33
(Missing),388

Value,Count,Frequency (%),Unnamed: 3
x,33,7.8%,
(Missing),388,92.2%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),98.3%
Missing (n),414

0,1
x,5
X,2
(Missing),414

Value,Count,Frequency (%),Unnamed: 3
x,5,1.2%,
X,2,0.5%,
(Missing),414,98.3%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),57.5%
Missing (n),242

0,1
x,137
X,42
(Missing),242

Value,Count,Frequency (%),Unnamed: 3
x,137,32.5%,
X,42,10.0%,
(Missing),242,57.5%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),62.5%
Missing (n),263

0,1
x,127
X,31
(Missing),263

Value,Count,Frequency (%),Unnamed: 3
x,127,30.2%,
X,31,7.4%,
(Missing),263,62.5%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),63.4%
Missing (n),267

0,1
x,114
X,40
(Missing),267

Value,Count,Frequency (%),Unnamed: 3
x,114,27.1%,
X,40,9.5%,
(Missing),267,63.4%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),62.2%
Missing (n),262

0,1
x,128
X,31
(Missing),262

Value,Count,Frequency (%),Unnamed: 3
x,128,30.4%,
X,31,7.4%,
(Missing),262,62.2%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),69.8%
Missing (n),294

0,1
x,102
X,25
(Missing),294

Value,Count,Frequency (%),Unnamed: 3
x,102,24.2%,
X,25,5.9%,
(Missing),294,69.8%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),78.1%
Missing (n),329

0,1
x,67
X,25
(Missing),329

Value,Count,Frequency (%),Unnamed: 3
x,67,15.9%,
X,25,5.9%,
(Missing),329,78.1%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),87.9%
Missing (n),370

0,1
x,36
X,15
(Missing),370

Value,Count,Frequency (%),Unnamed: 3
x,36,8.6%,
X,15,3.6%,
(Missing),370,87.9%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),95.0%
Missing (n),400

0,1
x,20
X,1
(Missing),400

Value,Count,Frequency (%),Unnamed: 3
x,20,4.8%,
X,1,0.2%,
(Missing),400,95.0%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),95.0%
Missing (n),400

0,1
x,17
X,4
(Missing),400

Value,Count,Frequency (%),Unnamed: 3
x,17,4.0%,
X,4,1.0%,
(Missing),400,95.0%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),98.6%
Missing (n),415

0,1
x,4
X,2
(Missing),415

Value,Count,Frequency (%),Unnamed: 3
x,4,1.0%,
X,2,0.5%,
(Missing),415,98.6%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),91.4%
Missing (n),385

0,1
x,26
X,10
(Missing),385

Value,Count,Frequency (%),Unnamed: 3
x,26,6.2%,
X,10,2.4%,
(Missing),385,91.4%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),91.7%
Missing (n),386

0,1
x,27
X,8
(Missing),386

Value,Count,Frequency (%),Unnamed: 3
x,27,6.4%,
X,8,1.9%,
(Missing),386,91.7%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),97.4%
Missing (n),410

0,1
x,9
X,2
(Missing),410

Value,Count,Frequency (%),Unnamed: 3
x,9,2.1%,
X,2,0.5%,
(Missing),410,97.4%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),98.3%
Missing (n),414

0,1
x,5
X,2
(Missing),414

Value,Count,Frequency (%),Unnamed: 3
x,5,1.2%,
X,2,0.5%,
(Missing),414,98.3%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),98.3%
Missing (n),414

0,1
x,4
X,3
(Missing),414

Value,Count,Frequency (%),Unnamed: 3
x,4,1.0%,
X,3,0.7%,
(Missing),414,98.3%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.8%
Missing (n),420

0,1
x,1
(Missing),420

Value,Count,Frequency (%),Unnamed: 3
x,1,0.2%,
(Missing),420,99.8%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),98.1%
Missing (n),413

0,1
x,6
X,2
(Missing),413

Value,Count,Frequency (%),Unnamed: 3
x,6,1.4%,
X,2,0.5%,
(Missing),413,98.1%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),91.0%
Missing (n),383

0,1
x,33
X,5
(Missing),383

Value,Count,Frequency (%),Unnamed: 3
x,33,7.8%,
X,5,1.2%,
(Missing),383,91.0%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),98.3%
Missing (n),414

0,1
x,6
X,1
(Missing),414

Value,Count,Frequency (%),Unnamed: 3
x,6,1.4%,
X,1,0.2%,
(Missing),414,98.3%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),96.4%
Missing (n),406

0,1
x,9
X,6
(Missing),406

Value,Count,Frequency (%),Unnamed: 3
x,9,2.1%,
X,6,1.4%,
(Missing),406,96.4%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),96.0%
Missing (n),404

0,1
x,9
X,8
(Missing),404

Value,Count,Frequency (%),Unnamed: 3
x,9,2.1%,
X,8,1.9%,
(Missing),404,96.0%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),99.0%
Missing (n),417

0,1
x,3
X,1
(Missing),417

Value,Count,Frequency (%),Unnamed: 3
x,3,0.7%,
X,1,0.2%,
(Missing),417,99.0%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),98.3%
Missing (n),414

0,1
x,5
X,2
(Missing),414

Value,Count,Frequency (%),Unnamed: 3
x,5,1.2%,
X,2,0.5%,
(Missing),414,98.3%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.5%
Missing (n),419

0,1
x,2
(Missing),419

Value,Count,Frequency (%),Unnamed: 3
x,2,0.5%,
(Missing),419,99.5%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.0%
Missing (n),417

0,1
x,4
(Missing),417

Value,Count,Frequency (%),Unnamed: 3
x,4,1.0%,
(Missing),417,99.0%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.0%
Missing (n),417

0,1
x,4
(Missing),417

Value,Count,Frequency (%),Unnamed: 3
x,4,1.0%,
(Missing),417,99.0%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.8%
Missing (n),420

0,1
x,1
(Missing),420

Value,Count,Frequency (%),Unnamed: 3
x,1,0.2%,
(Missing),420,99.8%,

0,1
Constant value,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),98.8%
Missing (n),416

0,1
x,5
(Missing),416

Value,Count,Frequency (%),Unnamed: 3
x,5,1.2%,
(Missing),416,98.8%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.3%
Missing (n),418

0,1
x,3
(Missing),418

Value,Count,Frequency (%),Unnamed: 3
x,3,0.7%,
(Missing),418,99.3%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.3%
Missing (n),418

0,1
x,3
(Missing),418

Value,Count,Frequency (%),Unnamed: 3
x,3,0.7%,
(Missing),418,99.3%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.5%
Missing (n),419

0,1
x,2
(Missing),419

Value,Count,Frequency (%),Unnamed: 3
x,2,0.5%,
(Missing),419,99.5%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),96.9%
Missing (n),408

0,1
x,13
(Missing),408

Value,Count,Frequency (%),Unnamed: 3
x,13,3.1%,
(Missing),408,96.9%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),99.3%
Missing (n),418

0,1
x,2
X,1
(Missing),418

Value,Count,Frequency (%),Unnamed: 3
x,2,0.5%,
X,1,0.2%,
(Missing),418,99.3%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),99.8%
Missing (n),420

0,1
x,1
(Missing),420

Value,Count,Frequency (%),Unnamed: 3
x,1,0.2%,
(Missing),420,99.8%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.4323

0,1
True,182
(Missing),239

Value,Count,Frequency (%),Unnamed: 3
True,182,43.2%,
(Missing),239,56.8%,

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


#Initial Observations:
A) Yelp & Google features will cause leakage.  They are straight up scores of the burrito.  So we need to drop them.<br><br>
B) Next there are a lot of boolean columns that represent their booleanness with x = true, and nan = false.  Lets just go ahead and turn those into actual boolean columns.<br><br>
C) Looks like a few columns are just worthless anyways, like mass, density, length, circumfrence, volume, queso.  Just too much missing data to really be useful.<br><br>
~~D) There might also be some overlap with meat, fillings, and fillings: meat columns.  I should take a closer look at those.  It might also be meat score, filling score, and a meat to filling ratio?~~ Further investigating the link confirmed this. A meat rating, a filling rating, and a ratio column.<br><br>
E) Salsa has a rating and a boolean.  Maybe just rename those.

In [15]:
#maybe this is silly, but I still like to leave that orignal data frame as
#intact as possible.

#A dropping leakers
leakers = ['Yelp', 'Google']
#B dropping most empty columns
junk = ['Mass (g)',	'Density (g/mL)',	'Length',	'Circum',	'Volume', 'Queso']


working = df.drop(leakers + junk, axis = 1).copy()
print(working.shape)
working.head()

(421, 51)


Unnamed: 0,Burrito,Date,Chips,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,,6.49,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,,5.45,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,4.85,1.5,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,5.25,2.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,x,6.59,4.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [17]:
#E changing column names is easy.  Look at that sassy question mark.
working.rename(columns = {'Salsa': 'Salsa Score', 'Salsa.1': 'Salsa?', 
                          'Meat:filling': 'Meat:Filling Ratio'}, inplace =True)
working.head()

Unnamed: 0,Burrito,Date,Chips,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:Filling Ratio,Uniformity,Salsa Score,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa?,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,,6.49,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,,5.45,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,4.85,1.5,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,5.25,2.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,x,6.59,4.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [18]:
working.columns

Index(['Burrito', 'Date', 'Chips', 'Cost', 'Hunger', 'Tortilla', 'Temp',
       'Meat', 'Fillings', 'Meat:Filling Ratio', 'Uniformity', 'Salsa Score',
       'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa?', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Egg',
       'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini', 'Great'],
      dtype='object')

In [20]:
#C Lets fix the booleans
#at first I though onehot might work for this.  But because our booleans are
#very sloppy, I think that would just make things worse.  So instead, I think
#I should make some logic to clean it up.

col_bool = ['Chips', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa?', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Egg',
       'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']

for column in col_bool:
  print(working[column].value_counts())

x      21
X       3
Yes     1
No      1
Name: Chips, dtype: int64
x    33
Name: Unreliable, dtype: int64
x    5
X    2
Name: NonSD, dtype: int64
x    137
X     42
Name: Beef, dtype: int64
x    127
X     31
Name: Pico, dtype: int64
x    114
X     40
Name: Guac, dtype: int64
x    128
X     31
Name: Cheese, dtype: int64
x    102
X     25
Name: Fries, dtype: int64
x    67
X    25
Name: Sour cream, dtype: int64
x    36
X    15
Name: Pork, dtype: int64
x    20
X     1
Name: Chicken, dtype: int64
x    17
X     4
Name: Shrimp, dtype: int64
x    4
X    2
Name: Fish, dtype: int64
x    26
X    10
Name: Rice, dtype: int64
x    27
X     8
Name: Beans, dtype: int64
x    9
X    2
Name: Lettuce, dtype: int64
x    5
X    2
Name: Tomato, dtype: int64
x    4
X    3
Name: Bell peper, dtype: int64
x    1
Name: Carrots, dtype: int64
x    6
X    2
Name: Cabbage, dtype: int64
x    33
X     5
Name: Sauce, dtype: int64
x    6
X    1
Name: Salsa?, dtype: int64
x    9
X    6
Name: Cilantro, dtype: int64
x    9
X 

In [21]:
#okay so I have a few options representing true, and false.

trues = ['x', 'X', 'Yes']

working[col_bool].replace(trues, 1, inplace = True)
working[col_bool].replace('No', 0, inplace = True)
working[col_bool].fillna(0, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  **kwargs


In [22]:
for column in col_bool:
  print(working[column].value_counts())

x      21
X       3
Yes     1
No      1
Name: Chips, dtype: int64
x    33
Name: Unreliable, dtype: int64
x    5
X    2
Name: NonSD, dtype: int64
x    137
X     42
Name: Beef, dtype: int64
x    127
X     31
Name: Pico, dtype: int64
x    114
X     40
Name: Guac, dtype: int64
x    128
X     31
Name: Cheese, dtype: int64
x    102
X     25
Name: Fries, dtype: int64
x    67
X    25
Name: Sour cream, dtype: int64
x    36
X    15
Name: Pork, dtype: int64
x    20
X     1
Name: Chicken, dtype: int64
x    17
X     4
Name: Shrimp, dtype: int64
x    4
X    2
Name: Fish, dtype: int64
x    26
X    10
Name: Rice, dtype: int64
x    27
X     8
Name: Beans, dtype: int64
x    9
X    2
Name: Lettuce, dtype: int64
x    5
X    2
Name: Tomato, dtype: int64
x    4
X    3
Name: Bell peper, dtype: int64
x    1
Name: Carrots, dtype: int64
x    6
X    2
Name: Cabbage, dtype: int64
x    33
X     5
Name: Sauce, dtype: int64
x    6
X    1
Name: Salsa?, dtype: int64
x    9
X    6
Name: Cilantro, dtype: int64
x    9
X 