<a href="https://colab.research.google.com/github/repoocsov/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [x] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [x] Get your model's test accuracy. (One time, at the end.)
- [x] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [x] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [x] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [7]:
"""
    BEGIN ASSIGNMENT HERE
"""
df.head(3)

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False


In [8]:
"""Train/validation/test split"""

# Convert to datetime and check for missing values
df['Date'] = pd.to_datetime(df['Date'])
df['Date'].isnull().sum()

0

In [9]:
# Training data
start_date = '01-01-1900'
end_date = '01-01-2017'
train = df[(df['Date'] > start_date) & (df['Date'] < end_date)]

print(train['Date'].min())
print(train['Date'].max())

2011-05-16 00:00:00
2016-12-15 00:00:00


In [10]:
# Validation data
start_date = '01-01-2017'
end_date = '01-01-2018'
val = df[(df['Date'] >= start_date) & (df['Date'] < end_date)]

print(val['Date'].min())
print(val['Date'].max())

2017-01-04 00:00:00
2017-12-29 00:00:00


In [11]:
# Testing data
start_date = '01-01-2018'
end_date = '01-01-2021'
test = df[(df['Date'] >= start_date) & (df['Date'] < end_date)]

print(test['Date'].min())
print(test['Date'].max())

2018-01-02 00:00:00
2019-08-27 00:00:00


In [12]:
# Numeric columns
train.describe(include='number').columns

Index(['Yelp', 'Google', 'Cost', 'Hunger', 'Mass (g)', 'Density (g/mL)',
       'Length', 'Circum', 'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings',
       'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'Queso'],
      dtype='object')

In [13]:
# Categorical data
print(train.describe(exclude='number').T)

# Features with a good cardinality to be potential features for our model
features = ['Burrito', 'Chips', 'NonSD', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish', 'Rice', 'Beans', 'Lettuce',
'Tomato', 'Bell peper', 'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Yelp', 'Google', 'Cost', 'Hunger', 'Mass (g)', 'Density (g/mL)',
'Length', 'Circum', 'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'Queso']

target = 'Great'

              count unique                  top freq      first       last
Burrito         298      5           California  118        NaT        NaT
Date            298    110  2016-08-30 00:00:00   29 2011-05-16 2016-12-15
Chips            22      2                    x   19        NaT        NaT
Unreliable       27      1                    x   27        NaT        NaT
NonSD             5      2                    x    3        NaT        NaT
Beef            168      2                    x  130        NaT        NaT
Pico            143      2                    x  115        NaT        NaT
Guac            139      2                    x  101        NaT        NaT
Cheese          149      2                    x  121        NaT        NaT
Fries           119      2                    x   97        NaT        NaT
Sour cream       85      2                    x   63        NaT        NaT
Pork             43      2                    x   29        NaT        NaT
Chicken          20      

In [0]:
# Seperating features and target
X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]
y_test = test[target]

In [15]:
# Encoding the catagorical features with one hot encoding
import category_encoders as ce
encoder = ce.OneHotEncoder(use_cat_names=True)

# Using the encoder to transform X_train
X_train = encoder.fit_transform(X_train)
X_train.head(3)

Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Chips_nan,Chips_x,Chips_X,NonSD_nan,NonSD_x,NonSD_X,Beef_x,Beef_nan,Beef_X,Pico_x,Pico_nan,Pico_X,Guac_x,Guac_nan,Guac_X,Cheese_x,Cheese_nan,Cheese_X,Fries_x,Fries_nan,Fries_X,Sour cream_nan,Sour cream_x,Sour cream_X,Pork_nan,Pork_x,Pork_X,Chicken_nan,Chicken_x,Chicken_X,Shrimp_nan,Shrimp_x,Shrimp_X,Fish_nan,Fish_x,...,Cabbage_nan,Cabbage_x,Cabbage_X,Sauce_nan,Sauce_x,Sauce_X,Salsa.1_nan,Salsa.1_x,Salsa.1_X,Cilantro_nan,Cilantro_x,Cilantro_X,Onion_nan,Onion_x,Onion_X,Taquito_nan,Taquito_x,Taquito_X,Pineapple_nan,Pineapple_x,Pineapple_X,Yelp,Google,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Queso
0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,...,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,3.5,4.2,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,
1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,...,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,3.5,3.3,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,
2,0,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,...,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,


In [16]:
# Checking for nans
pd.options.display.max_rows = 999
X_train.isnull().sum()

Burrito_California       0
Burrito_Carnitas         0
Burrito_Asada            0
Burrito_Other            0
Burrito_Surf & Turf      0
Chips_nan                0
Chips_x                  0
Chips_X                  0
NonSD_nan                0
NonSD_x                  0
NonSD_X                  0
Beef_x                   0
Beef_nan                 0
Beef_X                   0
Pico_x                   0
Pico_nan                 0
Pico_X                   0
Guac_x                   0
Guac_nan                 0
Guac_X                   0
Cheese_x                 0
Cheese_nan               0
Cheese_X                 0
Fries_x                  0
Fries_nan                0
Fries_X                  0
Sour cream_nan           0
Sour cream_x             0
Sour cream_X             0
Pork_nan                 0
Pork_x                   0
Pork_X                   0
Chicken_nan              0
Chicken_x                0
Chicken_X                0
Shrimp_nan               0
Shrimp_x                 0
S

In [17]:
# Need to impute the missing values before fitting our logistic regression
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')

X_train_imputed = imputer.fit_transform(X_train)

# Train data with imputed values
X_train_imputed

array([[1.        , 0.        , 0.        , ..., 4.        , 4.        ,
        4.        ],
       [1.        , 0.        , 0.        , ..., 3.5       , 2.5       ,
        5.        ],
       [0.        , 1.        , 0.        , ..., 3.        , 3.        ,
        5.        ],
       ...,
       [1.        , 0.        , 0.        , ..., 2.2       , 3.3       ,
        4.5       ],
       [0.        , 0.        , 1.        , ..., 2.        , 2.        ,
        4.        ],
       [0.        , 0.        , 0.        , ..., 3.32464029, 3.8       ,
        2.        ]])

In [18]:
# The validation and test data need to be one hot encoded and imputed as well
X_val = encoder.transform(X_val)
X_val.head(3)

# Using the same training data means (uses transform and not fit_transform)
X_val_imputed = imputer.transform(X_val)

# Train data with imputed values
X_val_imputed

array([[1.  , 0.  , 0.  , ..., 1.5 , 3.5 , 4.5 ],
       [0.  , 0.  , 0.  , ..., 4.2 , 3.75, 5.  ],
       [0.  , 0.  , 0.  , ..., 4.3 , 4.2 , 5.  ],
       ...,
       [0.  , 0.  , 0.  , ..., 3.5 , 4.  , 2.  ],
       [1.  , 0.  , 0.  , ..., 3.5 , 4.3 , 4.5 ],
       [1.  , 0.  , 0.  , ..., 5.  , 5.  , 3.  ]])

In [19]:
X_test = encoder.transform(X_test)
X_test.head(3)

# Using the same training data means (uses transform and not fit_transform)
X_test_imputed = imputer.transform(X_test)

# Train data with imputed values
X_test_imputed

array([[1. , 0. , 0. , ..., 3. , 4. , 5. ],
       [0. , 0. , 0. , ..., 4. , 3. , 4. ],
       [1. , 0. , 0. , ..., 4. , 5. , 5. ],
       ...,
       [1. , 0. , 0. , ..., 3.5, 4. , 4.5],
       [0. , 0. , 0. , ..., 5. , 5. , 2. ],
       [0. , 0. , 0. , ..., 3. , 4.5, 4. ]])

In [20]:
print('Train shape:', len(X_train_imputed), len(X_train_imputed[0]))
print('Validation shape:', len(X_val_imputed), len(X_val_imputed[0]))
print('Test shape:', len(X_test_imputed), len(X_test_imputed[0]))

Train shape: 298 93
Validation shape: 85 93
Test shape: 37 93


In [21]:
""" BASELINE ACCURACY """
# Target is 'Great'
# Baseline accuracy is percentage correct if just guessing the most common 'Great' classification (True of False)
baseline = train['Great'].value_counts(normalize=True).max()
print("Train Accuracy:", baseline)

# Any model has to be better than 59% accurate

Train Accuracy: 0.5906040268456376


In [0]:
# Standardizing before fitting our logistic regression model
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)
X_test_scaled = scaler.transform(X_val_imputed)

In [23]:
""" LOGISTIC REGRESSION """
from warnings import filterwarnings
filterwarnings('ignore')
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import accuracy_score

# Creating logistic regression model
model = LogisticRegressionCV()

# Fitting model onto data
model.fit(X_train_scaled, y_train)

# Getting predicted targets of the validation set
y_pred = model.predict(X_val_scaled)

# Checking the accuracy of the validation set
log_reg = accuracy_score(y_val, y_pred)
print("Validation Accuracy:", log_reg)

Train Accuracy: 0.7647058823529411


In [24]:
"""
  BEFORE CHECKING THE MODEL ACCURACY ON THE TEST DATA, CAN THE MODEL BE IMPROVED?
"""
print('Train shape:', len(X_train_scaled), len(X_train_scaled[0]))
print('Validation shape:', len(X_val_scaled), len(X_val_scaled[0]))

Train shape: 298 93
Validation shape: 85 93


In [0]:
import numpy as np

# Recombining training and validation features
X_complete_train = np.concatenate([X_train_imputed, X_val_imputed])

# Recombining training and validation targets
y_complete_train = pd.concat([y_train, y_val], axis=0)

In [50]:
""" CHECKING ACCURACY ON TESTING DATA """

# Creating model
model = LogisticRegressionCV()

# Fitting model onto data
model.fit(X_complete_train, y_complete_train)

# Getting predicted targets
y_test_pred = model.predict(X_test_imputed)

# Checking the accuracy of the validation set
log_reg = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", log_reg)

Train Accuracy: 0.7837837837837838


In [46]:
"""
  Highest accuracy yet. Combining the validation set with the training set before testing the model improved the result.
"""

(383, 93)