<a href="https://colab.research.google.com/github/jacobpad/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS12_214A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression



---


## Assignment 🌯 ***CLASSIFICATION - A OR B - GREAT OR NOT***


---


You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

# This code was given to start with

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

# Start with my imports

In [0]:
# A bunch of imports, yes, it's probably overkill
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import math
import plotly.express as px
import category_encoders as ce
import datetime # import and manipulate datetime
from IPython.display import display, HTML
from warnings import filterwarnings
filterwarnings('ignore')
from sklearn.metrics import mean_absolute_error, r2_score, accuracy_score
from sklearn.linear_model import Ridge, LinearRegression, RidgeCV, LogisticRegressionCV
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
%matplotlib inline

In [0]:
# Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
# Begin with baselines for classification.
# Use scikit-learn for logistic regression.
# Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
# Get your model's test accuracy. (One time, at the end.)
# Commit your notebook to your fork of the GitHub repo.

# Check some things out

In [9]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [10]:
df.describe()

Unnamed: 0,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Queso
count,87.0,87.0,414.0,418.0,22.0,22.0,283.0,281.0,281.0,421.0,401.0,407.0,418.0,412.0,419.0,396.0,419.0,418.0,0.0
mean,3.887356,4.167816,7.067343,3.495335,546.181818,0.675277,20.038233,22.135765,0.786477,3.519477,3.783042,3.620393,3.539833,3.586481,3.428998,3.37197,3.586993,3.979904,
std,0.475396,0.373698,1.506742,0.812069,144.445619,0.080468,2.083518,1.779408,0.152531,0.794438,0.980338,0.829254,0.799549,0.997057,1.068794,0.924037,0.886807,1.118185,
min,2.5,2.9,2.99,0.5,350.0,0.56,15.0,17.0,0.4,1.0,1.0,1.0,1.0,0.5,0.0,0.0,1.0,0.0,
25%,3.5,4.0,6.25,3.0,450.0,0.619485,18.5,21.0,0.68,3.0,3.0,3.0,3.0,3.0,2.6,3.0,3.0,3.5,
50%,4.0,4.2,6.99,3.5,540.0,0.658099,20.0,22.0,0.77,3.5,4.0,3.8,3.5,4.0,3.5,3.5,3.8,4.0,
75%,4.0,4.4,7.88,4.0,595.0,0.721726,21.5,23.0,0.88,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,
max,4.5,5.0,25.0,5.0,925.0,0.865672,26.0,29.0,1.54,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,


In [11]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Yelp,87.0,3.887356,0.475396,2.5,3.5,4.0,4.0,4.5
Google,87.0,4.167816,0.373698,2.9,4.0,4.2,4.4,5.0
Cost,414.0,7.067343,1.506742,2.99,6.25,6.99,7.88,25.0
Hunger,418.0,3.495335,0.812069,0.5,3.0,3.5,4.0,5.0
Mass (g),22.0,546.181818,144.445619,350.0,450.0,540.0,595.0,925.0
Density (g/mL),22.0,0.675277,0.080468,0.56,0.619485,0.658099,0.721726,0.865672
Length,283.0,20.038233,2.083518,15.0,18.5,20.0,21.5,26.0
Circum,281.0,22.135765,1.779408,17.0,21.0,22.0,23.0,29.0
Volume,281.0,0.786477,0.152531,0.4,0.68,0.77,0.88,1.54
Tortilla,421.0,3.519477,0.794438,1.0,3.0,3.5,4.0,5.0


In [12]:
df.describe(exclude='number')

Unnamed: 0,Burrito,Date,Chips,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
count,421,421,26,33,7,179,158,154,159,127,92,51,21,21,6,36,35,11,7,7,1,8,38,7,15,17,4,7,2,4,4,1,5,3,3,2,13,3,1,421
unique,5,169,4,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,1,2
top,California,8/30/2016,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,False
freq,169,29,21,33,5,137,127,114,128,102,67,36,20,17,4,26,27,9,5,4,1,6,33,6,9,9,3,5,2,4,4,1,5,3,3,2,13,2,1,239


In [13]:
df.describe(exclude='number').T

Unnamed: 0,count,unique,top,freq
Burrito,421,5,California,169
Date,421,169,8/30/2016,29
Chips,26,4,x,21
Unreliable,33,1,x,33
NonSD,7,2,x,5
Beef,179,2,x,137
Pico,158,2,x,127
Guac,154,2,x,114
Cheese,159,2,x,128
Fries,127,2,x,102


In [14]:
df.columns

Index(['Burrito', 'Date', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
       'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Tortilla',
       'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa',
       'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',
       'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini',
       'Great'],
      dtype='object')

In [15]:
df.isnull().sum()

Burrito             0
Date                0
Yelp              334
Google            334
Chips             395
Cost                7
Hunger              3
Mass (g)          399
Density (g/mL)    399
Length            138
Circum            140
Volume            140
Tortilla            0
Temp               20
Meat               14
Fillings            3
Meat:filling        9
Uniformity          2
Salsa              25
Synergy             2
Wrap                3
Unreliable        388
NonSD             414
Beef              242
Pico              263
Guac              267
Cheese            262
Fries             294
Sour cream        329
Pork              370
Chicken           400
Shrimp            400
Fish              415
Rice              385
Beans             386
Lettuce           410
Tomato            414
Bell peper        414
Carrots           420
Cabbage           413
Sauce             383
Salsa.1           414
Cilantro          406
Onion             404
Taquito           417
Pineapple 

In [0]:
# df.isnull().any()

In [17]:
df.describe(include='all')

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
count,421,421,87.0,87.0,26,414.0,418.0,22.0,22.0,283.0,281.0,281.0,421.0,401.0,407.0,418.0,412.0,419.0,396.0,419.0,418.0,33,7,179,158,154,159,127,92,51,21,21,6,36,35,11,7,7,1,8,38,7,15,17,4,7,2,4,4,1,0.0,5,3,3,2,13,3,1,421
unique,5,169,,,4,,,,,,,,,,,,,,,,,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,2,1,1,1,1,,1,1,1,1,1,2,1,2
top,California,8/30/2016,,,x,,,,,,,,,,,,,,,,,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,,x,x,x,x,x,x,x,False
freq,169,29,,,21,,,,,,,,,,,,,,,,,33,5,137,127,114,128,102,67,36,20,17,4,26,27,9,5,4,1,6,33,6,9,9,3,5,2,4,4,1,,5,3,3,2,13,2,1,239
mean,,,3.887356,4.167816,,7.067343,3.495335,546.181818,0.675277,20.038233,22.135765,0.786477,3.519477,3.783042,3.620393,3.539833,3.586481,3.428998,3.37197,3.586993,3.979904,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
std,,,0.475396,0.373698,,1.506742,0.812069,144.445619,0.080468,2.083518,1.779408,0.152531,0.794438,0.980338,0.829254,0.799549,0.997057,1.068794,0.924037,0.886807,1.118185,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
min,,,2.5,2.9,,2.99,0.5,350.0,0.56,15.0,17.0,0.4,1.0,1.0,1.0,1.0,0.5,0.0,0.0,1.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
25%,,,3.5,4.0,,6.25,3.0,450.0,0.619485,18.5,21.0,0.68,3.0,3.0,3.0,3.0,3.0,2.6,3.0,3.0,3.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
50%,,,4.0,4.2,,6.99,3.5,540.0,0.658099,20.0,22.0,0.77,3.5,4.0,3.8,3.5,4.0,3.5,3.5,3.8,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
75%,,,4.0,4.4,,7.88,4.0,595.0,0.721726,21.5,23.0,0.88,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [18]:
df.dtypes

Burrito            object
Date               object
Yelp              float64
Google            float64
Chips              object
Cost              float64
Hunger            float64
Mass (g)          float64
Density (g/mL)    float64
Length            float64
Circum            float64
Volume            float64
Tortilla          float64
Temp              float64
Meat              float64
Fillings          float64
Meat:filling      float64
Uniformity        float64
Salsa             float64
Synergy           float64
Wrap              float64
Unreliable         object
NonSD              object
Beef               object
Pico               object
Guac               object
Cheese             object
Fries              object
Sour cream         object
Pork               object
Chicken            object
Shrimp             object
Fish               object
Rice               object
Beans              object
Lettuce            object
Tomato             object
Bell peper         object
Carrots     

In [19]:
# Sum Null values by column and sort from least to greatest
# but first change the display
pd.set_option('display.max_rows', 60)
df.isnull().sum().sort_values()

Burrito             0
Tortilla            0
Great               0
Date                0
Synergy             2
Uniformity          2
Hunger              3
Wrap                3
Fillings            3
Cost                7
Meat:filling        9
Meat               14
Temp               20
Salsa              25
Length            138
Circum            140
Volume            140
Beef              242
Cheese            262
Pico              263
Guac              267
Fries             294
Sour cream        329
Yelp              334
Google            334
Pork              370
Sauce             383
Rice              385
Beans             386
Unreliable        388
Chips             395
Density (g/mL)    399
Mass (g)          399
Chicken           400
Shrimp            400
Onion             404
Cilantro          406
Avocado           408
Lettuce           410
Cabbage           413
Bell peper        414
Salsa.1           414
NonSD             414
Pineapple         414
Tomato            414
Fish      

In [20]:
df.shape

(421, 59)

In [21]:
df['Queso'].value_counts().sum()

0

In [22]:
df['Queso'].sort_values().unique()

array([nan])

In [23]:
df['Queso'].sort_values()

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
418   NaN
419   NaN
420   NaN
421   NaN
422   NaN
Name: Queso, Length: 421, dtype: float64

In [24]:
df['NonSD'].unique()

array([nan, 'x', 'X'], dtype=object)

# 

# Clean up the data

In [25]:
# Get rid of the columns that have high NaN's - greater than 100 in this case
columns = list()
for column in df.columns:
  if df[column].isnull().sum() < 100:
    columns.append(column)
  else:
    pass

columns

['Burrito',
 'Date',
 'Cost',
 'Hunger',
 'Tortilla',
 'Temp',
 'Meat',
 'Fillings',
 'Meat:filling',
 'Uniformity',
 'Salsa',
 'Synergy',
 'Wrap',
 'Great']

In [26]:
df[columns].isnull().sum()

Burrito          0
Date             0
Cost             7
Hunger           3
Tortilla         0
Temp            20
Meat            14
Fillings         3
Meat:filling     9
Uniformity       2
Salsa           25
Synergy          2
Wrap             3
Great            0
dtype: int64

In [27]:
df2 = df[columns].dropna()
df2.head(2)

Unnamed: 0,Burrito,Date,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
0,California,1/18/2016,6.49,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,False
1,California,1/24/2016,5.45,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,False


In [28]:
df2[columns].isnull().sum()

Burrito         0
Date            0
Cost            0
Hunger          0
Tortilla        0
Temp            0
Meat            0
Fillings        0
Meat:filling    0
Uniformity      0
Salsa           0
Synergy         0
Wrap            0
Great           0
dtype: int64

In [29]:
df2[columns].shape

(356, 14)

In [0]:
# Don't use this - it's just an example
# get rid of '?' and change it to NaN by overwriting the original
# df = df.replace({'?':np.NaN, 'n':0,'y':1})
# df.head()
# https://github.com/jacobpad/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS12_121_Statistics_Probability_Assignment.ipynb

In [0]:
# Get rid of 'NaN's' and replace w/ 0's 
# Get rid of 'x' and replace with 1
# Get rid of 'X' and replace with 1
# by overwriting the original
# df = df.replace({np.NaN:0, 'x':1,'X':1})
# df.head()

# Fix the date

In [0]:
# Fix the date thing to the date

In [33]:
# See what it is
print(df2['Date'].head(1)) # it's an object

0    1/18/2016
Name: Date, dtype: object


In [34]:
# Now make it something usable
# Change the date datatype
print('Before changing --',df2['Date'].dtypes)
df2['Date'] = pd.to_datetime(df2['Date'])
print('Post   changing --',df2['Date'].dtypes)

Before changing -- object
Post   changing -- datetime64[ns]


# **Train/Validate/Test Split** - Seperate the Train, Validate & Split - Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later

## Fix and set the date

In [35]:
# Set the Train
start_date = '01-01-1900'
end_date = '12-31-2016'
train = df2[(df2['Date'] > start_date) & (df2['Date'] <= end_date)]
print(train['Date'].min())
print(train['Date'].max())

2016-01-18 00:00:00
2016-12-10 00:00:00


In [36]:
# Set the Validate
start_date = '01-01-2017'
end_date = '12-31-2017'
validate = df2[(df2['Date'] > start_date) & (df2['Date'] <= end_date)]
print(validate['Date'].min())
print(validate['Date'].max())

2017-01-07 00:00:00
2017-12-29 00:00:00


In [37]:
# Set the Test
start_date = '01-01-2018'
test = df2[(df2['Date'] > start_date)]
print(test['Date'].min())
print(test['Date'].max())

2018-01-02 00:00:00
2026-04-25 00:00:00


In [38]:
# Print the shapes
print('Train    -',train.shape)
print('Validate -',validate.shape)
print('Test     -',test.shape)

Train    - (250, 14)
Validate - (74, 14)
Test     - (32, 14)


## Moving on

In [0]:
# Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
# Begin with baselines for classification.
# Use scikit-learn for logistic regression.
# Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
# Get your model's test accuracy. (One time, at the end.)
# Commit your notebook to your fork of the GitHub repo.

In [40]:
train.columns

Index(['Burrito', 'Date', 'Cost', 'Hunger', 'Tortilla', 'Temp', 'Meat',
       'Fillings', 'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap',
       'Great'],
      dtype='object')

In [41]:
train.describe(include='all').T

Unnamed: 0,count,unique,top,freq,first,last,mean,std,min,25%,50%,75%,max
Burrito,250,5.0,California,106.0,NaT,NaT,,,,,,,
Date,250,98.0,2016-08-30 00:00:00,25.0,2016-01-18,2016-12-10,,,,,,,
Cost,250,,,,NaT,NaT,6.96016,1.13894,3.99,6.25,6.95,7.5,11.95
Hunger,250,,,,NaT,NaT,3.45,0.823756,0.5,3.0,3.5,4.0,5.0
Tortilla,250,,,,NaT,NaT,3.457,0.764683,1.4,3.0,3.5,4.0,5.0
Temp,250,,,,NaT,NaT,3.6724,0.997426,1.0,3.0,4.0,4.5,5.0
Meat,250,,,,NaT,NaT,3.5472,0.843158,1.0,3.0,3.5,4.0,5.0
Fillings,250,,,,NaT,NaT,3.4998,0.800282,1.0,3.0,3.5,4.0,5.0
Meat:filling,250,,,,NaT,NaT,3.4942,1.02114,1.0,3.0,3.75,4.0,5.0
Uniformity,250,,,,NaT,NaT,3.34,1.1049,1.0,2.5,3.5,4.0,5.0


In [42]:
train.describe(exclude='number').T

Unnamed: 0,count,unique,top,freq,first,last
Burrito,250,5,California,106,NaT,NaT
Date,250,98,2016-08-30 00:00:00,25,2016-01-18,2016-12-10
Great,250,2,False,154,NaT,NaT


In [43]:
train.describe(include='number').columns

Index(['Cost', 'Hunger', 'Tortilla', 'Temp', 'Meat', 'Fillings',
       'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap'],
      dtype='object')

In [0]:
# Set features and target
features = ['Cost', 'Hunger', 'Tortilla', 'Temp', 'Meat', 'Fillings',
       'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap']

target = 'Great'

In [0]:
# Seperate features and target for the different tests
X_train = train[features]
y_train = train[target]

X_validate = validate[features]
y_validate = validate[target]

X_test = test[features]
y_test = test[target]

In [46]:
# Encode the catagorical features with one hot encoding
encoder = ce.OneHotEncoder(use_cat_names=True)

# Use the encoder to transform X_train
X_train = encoder.fit_transform(X_train)
print(X_train.shape)
X_train.head(2)

(250, 11)


Unnamed: 0,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap
0,6.49,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0
1,5.45,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0


In [47]:
# Checking for NaNs
X_train.isnull().sum()

Cost            0
Hunger          0
Tortilla        0
Temp            0
Meat            0
Fillings        0
Meat:filling    0
Uniformity      0
Salsa           0
Synergy         0
Wrap            0
dtype: int64

In [48]:
# I have no need to impute, but I'll do it for practice...
# Need to impute the missing values before fitting our logistic regression
imputer = SimpleImputer(strategy='mean')

X_train_imputed = imputer.fit_transform(X_train)

# Train data with imputed values
X_train_imputed

array([[6.49, 3.  , 3.  , ..., 4.  , 4.  , 4.  ],
       [5.45, 3.5 , 2.  , ..., 3.5 , 2.5 , 5.  ],
       [4.85, 1.5 , 3.  , ..., 3.  , 3.  , 5.  ],
       ...,
       [5.49, 3.  , 4.5 , ..., 3.  , 2.5 , 3.  ],
       [7.75, 4.  , 3.5 , ..., 2.2 , 3.3 , 4.5 ],
       [7.75, 4.  , 4.  , ..., 2.  , 2.  , 4.  ]])

In [49]:
# The validation and test data need to be one hot encoded and imputed as well
X_validate = encoder.transform(X_validate)
print(X_validate.head(3),'\n\n\n')

# Using the same training data means (uses transform and not fit_transform)
X_validate_imputed = imputer.transform(X_validate)

# Train data with imputed values
X_validate_imputed

     Cost  Hunger  Tortilla  Temp  ...  Uniformity  Salsa  Synergy  Wrap
303  8.50     3.9       3.0   4.5  ...         4.0    4.3      4.2   5.0
304  7.90     4.0       3.5   4.0  ...         4.5    4.0      3.8   4.8
305  4.99     3.5       2.5   4.5  ...         3.0    2.0      2.0   4.0

[3 rows x 11 columns] 





array([[ 8.5 ,  3.9 ,  3.  ,  4.5 ,  4.1 ,  3.  ,  3.7 ,  4.  ,  4.3 ,
         4.2 ,  5.  ],
       [ 7.9 ,  4.  ,  3.5 ,  4.  ,  4.  ,  3.  ,  4.  ,  4.5 ,  4.  ,
         3.8 ,  4.8 ],
       [ 4.99,  3.5 ,  2.5 ,  4.5 ,  3.  ,  2.5 ,  3.  ,  3.  ,  2.  ,
         2.  ,  4.  ],
       [ 7.29,  3.5 ,  3.  ,  2.  ,  3.5 ,  3.5 ,  3.  ,  2.5 ,  3.7 ,
         3.2 ,  4.2 ],
       [ 7.89,  3.  ,  4.  ,  5.  ,  4.5 ,  4.  ,  4.5 ,  3.5 ,  3.  ,
         4.5 ,  2.5 ],
       [ 7.49,  3.7 ,  4.  ,  3.5 ,  3.9 ,  4.  ,  3.7 ,  2.  ,  3.5 ,
         4.  ,  4.  ],
       [ 7.9 ,  3.5 ,  4.5 ,  5.  ,  4.  ,  3.5 ,  4.  ,  2.8 ,  4.  ,
         4.5 ,  3.7 ],
       [ 7.9 ,  2.  ,  4.2 ,  4.  ,  4.2 ,  3.8 ,  4.  ,  3.  ,  3.8 ,
         4.5 ,  5.  ],
       [ 5.99,  4.5 ,  2.  ,  4.  ,  2.5 ,  3.5 ,  4.5 ,  3.  ,  2.  ,
         1.5 ,  3.4 ],
       [ 6.99,  3.5 ,  2.5 ,  4.5 ,  4.  ,  4.  ,  2.  ,  3.  ,  3.  ,
         3.  ,  4.  ],
       [ 6.85,  3.5 ,  3.  ,  2.  ,  3.5 ,  3.5 ,  4.  ,  3.

In [50]:
X_test = encoder.transform(X_test)
X_test.head(3)

Unnamed: 0,Cost,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap
77,8.0,4.0,4.5,5.0,5.0,5.0,4.5,5.0,3.0,5.0,5.0
386,7.25,4.0,4.0,5.0,4.0,5.0,5.0,3.0,3.0,4.0,5.0
387,4.19,3.0,3.0,5.0,2.0,2.0,4.0,1.0,4.0,3.0,4.0


In [51]:
# Using the same training data means (uses transform and not fit_transform)
X_test_imputed = imputer.transform(X_test)
# Train data with imputed values
X_test_imputed

array([[8.  , 4.  , 4.5 , 5.  , 5.  , 5.  , 4.5 , 5.  , 3.  , 5.  , 5.  ],
       [7.25, 4.  , 4.  , 5.  , 4.  , 5.  , 5.  , 3.  , 3.  , 4.  , 5.  ],
       [4.19, 3.  , 3.  , 5.  , 2.  , 2.  , 4.  , 1.  , 4.  , 3.  , 4.  ],
       [7.  , 5.  , 5.  , 5.  , 5.  , 5.  , 5.  , 5.  , 4.  , 5.  , 5.  ],
       [8.5 , 4.  , 4.  , 4.  , 3.  , 3.5 , 1.  , 2.  , 3.  , 3.  , 1.  ],
       [7.2 , 3.  , 4.  , 5.  , 4.  , 3.  , 3.  , 3.  , 4.  , 3.  , 4.  ],
       [5.99, 3.  , 3.5 , 5.  , 4.3 , 3.5 , 5.  , 4.  , 3.  , 3.8 , 2.  ],
       [5.99, 3.5 , 4.  , 4.5 , 5.  , 4.5 , 5.  , 4.  , 4.  , 4.5 , 4.  ],
       [5.99, 2.  , 2.  , 3.5 , 4.5 , 4.  , 4.  , 2.  , 3.  , 4.  , 2.  ],
       [8.99, 4.  , 4.5 , 4.5 , 4.  , 4.  , 3.  , 4.  , 3.5 , 4.  , 3.  ],
       [5.99, 3.5 , 4.  , 4.5 , 3.5 , 3.  , 4.5 , 3.  , 3.  , 2.5 , 2.5 ],
       [7.5 , 4.  , 4.  , 4.  , 3.5 , 4.2 , 4.5 , 4.3 , 3.  , 4.  , 4.5 ],
       [5.99, 3.  , 2.  , 5.  , 4.5 , 3.5 , 4.5 , 4.5 , 2.5 , 3.5 , 1.5 ],
       [5.99, 5.  , 4.  ,

In [52]:
print('Train shape:      ', len(X_train_imputed), len(X_train_imputed[0]))
print('Validation shape: ', len(X_validate_imputed),'', len(X_validate_imputed[0]))
print('Test shape:       ', len(X_test_imputed),'', len(X_test_imputed[0]))

Train shape:       250 11
Validation shape:  74  11
Test shape:        32  11


# Baseline

In [53]:
# Target = 'Great'
# Baseline accuracy - percentage correct if just guessing the most 
#   common 'Great' classification (True of False)
baseline = train['Great'].value_counts(normalize=True).max()
print("Train Accuracy:", baseline*100,'%')

Train Accuracy: 61.6 %


# Standardizing before fitting the logistic regression model

In [0]:
# Standardizing before fitting
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_validate_scaled = scaler.transform(X_validate_imputed)
X_test_scaled = scaler.transform(X_validate_imputed)

# Logestic Regression

In [55]:
# Creating logistic regression model
model = LogisticRegressionCV()

# Fitting model onto data
model.fit(X_train_scaled, y_train)

# Getting predicted targets of the validation set
y_pred = model.predict(X_validate_scaled)

# Checking the accuracy of the validation set
log_reg_validate = accuracy_score(y_validate, y_pred)
print("Validation Accuracy:", log_reg_validate)

Validation Accuracy: 0.8648648648648649


In [56]:
# See the shape
print('Train shape:', len(X_train_scaled), len(X_train_scaled[0]))
print('Validation shape:', len(X_validate_scaled), len(X_validate_scaled[0]))

Train shape: 250 11
Validation shape: 74 11


In [0]:
# Recombine training and validation features
X_complete_train = np.concatenate([X_train_imputed, X_validate_imputed])

# Recombine training and validation targets
y_complete_train = pd.concat([y_train, y_validate], axis=0)

# Check accuracy on testing data

In [58]:
# Creating model
model = LogisticRegressionCV()

# Fitting model onto data
model.fit(X_complete_train, y_complete_train)

# Getting predicted targets
y_test_pred = model.predict(X_test_imputed)

# Checking the accuracy of the validation set
log_reg_test = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", log_reg_test)

Test Accuracy: 0.8125


# Results

In [59]:
print("Baseline Train Accuracy:", baseline)
print("Validation Accuracy:", log_reg_validate)
print("Test Accuracy:", log_reg_test)

Baseline Train Accuracy: 0.616
Validation Accuracy: 0.8648648648648649
Test Accuracy: 0.8125


In [0]:
# Test was not as good as the validate