Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

## Preprocessing

looking for features we don't need. Dropping high cardinality features. High cardinality means that the column contains a large percentage of totally unique values. Low cardinality means that the column contains a lot of “repeats” in its data range. Will look and deal with nulls, and create new features as needed to move on. We also wan't to make sure datatypes are correct.  The question is; "How accurately can you predict whether a burrito is rated 'Great'?". What constitutes a "great" burrito. Let me look at the columns to see what we are working with.

In [7]:
df.columns

Index(['Burrito', 'Date', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
       'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Tortilla',
       'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa',
       'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',
       'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini',
       'Great'],
      dtype='object')

***What do we NOT need here? Before we do any feature engineering, lets look for nulls, and check out the data types.***  

In [8]:
df.isna().sum()

Burrito             0
Date                0
Yelp              334
Google            334
Chips             395
Cost                7
Hunger              3
Mass (g)          399
Density (g/mL)    399
Length            138
Circum            140
Volume            140
Tortilla            0
Temp               20
Meat               14
Fillings            3
Meat:filling        9
Uniformity          2
Salsa              25
Synergy             2
Wrap                3
Unreliable        388
NonSD             414
Beef              242
Pico              263
Guac              267
Cheese            262
Fries             294
Sour cream        329
Pork              370
Chicken           400
Shrimp            400
Fish              415
Rice              385
Beans             386
Lettuce           410
Tomato            414
Bell peper        414
Carrots           420
Cabbage           413
Sauce             383
Salsa.1           414
Cilantro          406
Onion             404
Taquito           417
Pineapple 

***I can tell you right away, that these nulls can be safely converted to zeros***

In [9]:
df = df.fillna(0)

***Lets take a look at the data types and make sure everything is in line.*** 

In [10]:
df.dtypes

Burrito            object
Date               object
Yelp              float64
Google            float64
Chips              object
Cost              float64
Hunger            float64
Mass (g)          float64
Density (g/mL)    float64
Length            float64
Circum            float64
Volume            float64
Tortilla          float64
Temp              float64
Meat              float64
Fillings          float64
Meat:filling      float64
Uniformity        float64
Salsa             float64
Synergy           float64
Wrap              float64
Unreliable         object
NonSD              object
Beef               object
Pico               object
Guac               object
Cheese             object
Fries              object
Sour cream         object
Pork               object
Chicken            object
Shrimp             object
Fish               object
Rice               object
Beans              object
Lettuce            object
Tomato             object
Bell peper         object
Carrots     

***We need to deal with the date feature. Converting to datetime, and creating year feature.*** 

In [11]:
# Convert to datetime
df['Date'] = df['Date'].astype('datetime64[ns]')

# Make year feature
df['Year'] = df['Date'].dt.year

***Will take a look at the head and columns again, and see what we can live without, what we need, and what we can engineer***

In [12]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,...,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great,Year
0,California,2016-01-18,3.5,4.2,0,6.49,3.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,False,2016
1,California,2016-01-24,3.5,3.3,0,5.45,3.5,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,False,2016
2,Carnitas,2016-01-24,0.0,0.0,0,4.85,1.5,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,False,2016
3,Asada,2016-01-24,0.0,0.0,0,5.25,2.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,False,2016
4,California,2016-01-27,4.0,3.8,x,6.59,4.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,True,2016


In [13]:
df.columns

Index(['Burrito', 'Date', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
       'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Tortilla',
       'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa',
       'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',
       'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini',
       'Great', 'Year'],
      dtype='object')

***We can divide these columns up into several groups, excluding burrito, date, great, and year.***

1.  **Ratings** \['Yelp', 'Google'\]

2. **Toppings / Fillings** \['Meat', 'Fillings', 'Meat:filling', 'Beef', 'Pico', 'Guac',  'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',  'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',        'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito', 'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',        'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini'\]

3. **Other** \['Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Temp', 'Wrap', 'Unreliable', 'NonSD', 'Uniformity', 'Synergy' \]

***Of course we have date, year, and burrito. Some of the things in "other" I know, but others like circum, and NonSD are less clear.  Our target is "great", because we are trying to predict the greatness of a burrito. Which of these other columns would help us do that? The ratings would not, though they could help us test our results (is our prediction in accord with how people rated these? ). I will keep them around, but I am going to combine them into one feature called "rating". I will just  get the mean of them both.  Nevermind, there are nulls in those columns (which we converted to 0). Will just drop them.  To make this quicker, I will drop everything in "ratings", and  "other.***

In [14]:
# Lists from above
ratings = ['Yelp', 'Google']
other = ['Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Temp', 'Wrap', 'Unreliable', 'NonSD', 'Uniformity', 'Synergy' ]

# Drop the columns
df = df.drop(ratings + other, axis=1)

# Lets make all column names lowercase
df.columns = map(str.lower, df.columns)

In [15]:
# Make sure they have been deleted, and 
# and that they are now lower
df.columns

Index(['burrito', 'date', 'chips', 'cost', 'hunger', 'tortilla', 'meat',
       'fillings', 'meat:filling', 'salsa', 'beef', 'pico', 'guac', 'cheese',
       'fries', 'sour cream', 'pork', 'chicken', 'shrimp', 'fish', 'rice',
       'beans', 'lettuce', 'tomato', 'bell peper', 'carrots', 'cabbage',
       'sauce', 'salsa.1', 'cilantro', 'onion', 'taquito', 'pineapple', 'ham',
       'chile relleno', 'nopales', 'lobster', 'queso', 'egg', 'mushroom',
       'bacon', 'sushi', 'avocado', 'corn', 'zucchini', 'great', 'year'],
      dtype='object')

In [16]:
# Look at the head
df.head()

Unnamed: 0,burrito,date,chips,cost,hunger,tortilla,meat,fillings,meat:filling,salsa,...,queso,egg,mushroom,bacon,sushi,avocado,corn,zucchini,great,year
0,California,2016-01-18,0,6.49,3.0,3.0,3.0,3.5,4.0,4.0,...,0.0,0,0,0,0,0,0,0,False,2016
1,California,2016-01-24,0,5.45,3.5,2.0,2.5,2.5,2.0,3.5,...,0.0,0,0,0,0,0,0,0,False,2016
2,Carnitas,2016-01-24,0,4.85,1.5,3.0,2.5,3.0,4.5,3.0,...,0.0,0,0,0,0,0,0,0,False,2016
3,Asada,2016-01-24,0,5.25,2.0,3.0,3.5,3.0,4.0,4.0,...,0.0,0,0,0,0,0,0,0,False,2016
4,California,2016-01-27,x,6.59,4.0,4.0,4.0,3.5,4.5,2.5,...,0.0,0,0,0,0,0,0,0,True,2016


In [30]:
# I spotted an 'x' ini chips, this probably just mean "no chips"
# but ae there x's anywhere else in the dataframe? Lets iterate
# over topping and make sure things are binary

# STOLE THIS FUNCTION FROM SOMEWWHERE ELSE. 
def tobinary(item):
    return 1 if item in ('x', 'X') else 0


toppings = ['meat', 'fillings', 'meat:filling', 'beef', 'pico', 
            'guac', 'cheese', 'fries', 'sour cream', 'pork', 
            'chicken', 'shrimp', 'fish', 'rice', 'beans', 'lettuce', 
            'tomato', 'bell peper', 'carrots', 'cabbage', 'sauce', 
            'salsa.1', 'cilantro', 'onion', 'taquito', 'pineapple',
            'ham', 'chile relleno', 'nopales', 'lobster', 'queso', 
            'egg', 'mushroom', 'bacon', 'sushi', 'avocado', 
            'corn', 'zucchini', 'chips']
         
for col in toppings:
    df[col] = df[col].apply(tobinary)

# Get "great" into binary as well
df['great'] = df['great'].apply(int)

In [31]:
df.dtypes

burrito                  object
date             datetime64[ns]
chips                     int64
cost                    float64
hunger                  float64
tortilla                float64
meat                      int64
fillings                  int64
meat:filling              int64
salsa                   float64
beef                      int64
pico                      int64
guac                      int64
cheese                    int64
fries                     int64
sour cream                int64
pork                      int64
chicken                   int64
shrimp                    int64
fish                      int64
rice                      int64
beans                     int64
lettuce                   int64
tomato                    int64
bell peper                int64
carrots                   int64
cabbage                   int64
sauce                     int64
salsa.1                   int64
cilantro                  int64
onion                     int64
taquito 

***I think we are ready to move to objective 1.*** 

## Objective 1

***Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.***

In [34]:
# Lets drop the 'date' column, we only need the year we create earlier.
df = df.drop('date', axis=1)

In [38]:
# Main split
train = df[df['year'] <= 2016]
validate = df[df['year'] == 2017]
# Test is minus target
test = df[df['year'] >= 2018]

# Target
target_train = train['great']
target_validate = validate['great']
target_test = test['great']

# Test - minus taarget
test_train = train.drop(columns=['great'])
test_validate = validate.drop(columns=['great'])
test_test = test.drop(columns=['great'])

train.head()

Unnamed: 0,burrito,chips,cost,hunger,tortilla,meat,fillings,meat:filling,salsa,beef,...,queso,egg,mushroom,bacon,sushi,avocado,corn,zucchini,great,year
0,California,0,6.49,3.0,3.0,0,0,0,4.0,0,...,0,0,0,0,0,0,0,0,0,2016
1,California,0,5.45,3.5,2.0,0,0,0,3.5,0,...,0,0,0,0,0,0,0,0,0,2016
2,Carnitas,0,4.85,1.5,3.0,0,0,0,3.0,0,...,0,0,0,0,0,0,0,0,0,2016
3,Asada,0,5.25,2.0,3.0,0,0,0,4.0,0,...,0,0,0,0,0,0,0,0,0,2016
4,California,1,6.59,4.0,4.0,0,0,0,2.5,0,...,0,0,0,0,0,0,0,0,1,2016


## Objective 2

***Begin with baselines for classification.***

## Objective 3

***Use scikit-learn for logistic regression.***

## Objective 4

***Get your model's validation accuracy. (Multiple times if you try multiple iterations.)***

## Objective 5

***Get your model's test accuracy. (One time, at the end.)***