Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [280]:
print(df.shape)
df.head(2)

(421, 67)


Unnamed: 0,Location,Burrito,Date,Neighborhood,Address,URL,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,overall,Rec,Reviewer,Notes,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,Donato's taco shop,California,1/18/2016,Miramar,6780 Miramar Rd,http://donatostacoshop.net/,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,3.8,,Scott,good fries: 4/5,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,Oscar's Mexican food,California,1/24/2016,San Marcos,225 S Rancho Santa Fe Rd,http://www.yelp.com/biz/oscars-mexican-food-sa...,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,3.0,,Scott,Fries: 3/5; too little meat,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False


In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [283]:
df['Date'].dtype

dtype('O')

In [0]:
# Changing the date to a dateTime object
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)

In [0]:
# Splitting the data into the train validate and test groups
# Do train/validate/test split. Train on reviews from 2016 & earlier. 
# Validate on 2017. Test on 2018 & later
train = df[df['Date'].dt.year <= 2016 ]
val = df[df['Date'].dt.year == 2017]
test = df[df['Date'].dt.year >= 2018]

In [286]:
# show the shapes
print(f"Train shape {train.shape}")
print(f"Validate shape {val.shape}")
print(f"Test shape {test.shape}")

Train shape (298, 59)
Validate shape (85, 59)
Test shape (38, 59)


In [0]:
# Getting the target out of the train and the val and the test
y_train = train['Great']
y_val = val['Great']
y_test = test['Great']

# Dropping the colum from the dataFrames
train = train.drop('Great', axis=1)
val = val.drop("Great", axis=1)
test = test.drop("Great", axis=1)


In [288]:
# New shape of the train, val and the test
print(f"Train: {train.shape}")
print(f"Val:  {val.shape}")
print(f"Test:  {test.shape}")

Train: (298, 58)
Val:  (85, 58)
Test:  (38, 58)


In [289]:
# What is the baseLine -- Will find the majority class

# Looking at what is the majority class for the "Great" column

y_train.value_counts(normalize=True)
# False is the majority 59 percent of the time

False    0.590604
True     0.409396
Name: Great, dtype: float64

In [0]:
# importing some of the sklearn modules
from sklearn.metrics import accuracy_score, mean_absolute_error
from sklearn.linear_model import LogisticRegressionCV, LinearRegression
from sklearn.dummy import DummyClassifier

In [291]:
# Will do two different baselines with the dummyClassifier
# Will show the accuracy if we just choose the "most common" and if we 
# choose a choice being stratified 
t = 'prior'
u = 'stratified'
v = 'most_frequent'

model = DummyClassifier(strategy=t, random_state=42)
model.fit(train, y_train)
y_pred = model.predict(train)
acc = accuracy_score(y_train, y_pred)

print(f"Baseline for the {t} is {acc:.2f}")

model = DummyClassifier(strategy=u, random_state=42)
model.fit(train, y_train)
y_pred = model.predict(train)
acc = accuracy_score(y_train, y_pred)

print(f"Baseline for the {u} is {acc:.2f}")

model = DummyClassifier(strategy=v, random_state=42)
model.fit(train, y_train)
y_pred = model.predict(train)
acc = accuracy_score(y_train, y_pred)

print(f"Baseline for the {v} is {acc:.2f}")


# So our baseline is at around 59 percent

Baseline for the prior is 0.59
Baseline for the stratified is 0.45
Baseline for the most_frequent is 0.59


In [292]:
# Want to look at the Null values to help to choose the features that I will use(keep)

print(train.shape)
train.isnull().sum()


(298, 58)


Burrito             0
Date                0
Yelp              227
Google            227
Chips             276
Cost                6
Hunger              1
Mass (g)          298
Density (g/mL)    298
Length            123
Circum            124
Volume            124
Tortilla            0
Temp               15
Meat               10
Fillings            1
Meat:filling        6
Uniformity          2
Salsa              20
Synergy             2
Wrap                2
Unreliable        271
NonSD             293
Beef              130
Pico              155
Guac              159
Cheese            149
Fries             179
Sour cream        213
Pork              255
Chicken           278
Shrimp            278
Fish              293
Rice              265
Beans             266
Lettuce           287
Tomato            291
Bell peper        291
Carrots           297
Cabbage           291
Sauce             261
Salsa.1           292
Cilantro          283
Onion             281
Taquito           294
Pineapple 

In [293]:
train['Tortilla'].value_counts()

4.00    92
3.00    72
3.50    45
2.00    21
4.50    19
5.00    15
2.50    15
1.50     6
3.80     5
3.60     2
1.40     1
2.80     1
3.20     1
2.10     1
4.80     1
3.75     1
Name: Tortilla, dtype: int64

In [294]:
train['Unreliable'].value_counts(dropna=False)

NaN    271
x       27
Name: Unreliable, dtype: int64

In [295]:
train['Length'].value_counts(dropna=False)

NaN      123
20.00     20
19.00     19
18.50     16
18.00     16
19.50     13
20.50     13
22.00     10
23.00      9
22.50      9
17.00      9
21.00      8
17.50      7
21.50      7
16.50      5
23.50      3
25.50      2
17.78      1
20.75      1
15.00      1
16.00      1
25.00      1
15.50      1
26.00      1
24.00      1
17.70      1
Name: Length, dtype: int64

In [296]:
train['Meat:filling'].value_counts(dropna=False)

4.00    83
3.00    39
3.50    32
5.00    32
4.50    30
2.00    21
2.50    18
1.50    10
1.00     9
NaN      6
4.70     3
3.75     3
3.78     1
2.90     1
3.40     1
3.60     1
0.50     1
4.80     1
4.20     1
2.80     1
3.20     1
1.40     1
3.80     1
3.70     1
Name: Meat:filling, dtype: int64

In [297]:
train.head(2)

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,California,2016-01-24,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [0]:
# These are the columsn that we will change the NAN to a 'N' so that these
# colums can be then seen as categorical
theBinaryCol = ['Chips','Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster',
       'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']

In [299]:
# Will drop this because they are all absent
train['Queso'].value_counts(dropna=False)

NaN    298
Name: Queso, dtype: int64

In [300]:
# Looking at the binary features to see the dtypes
train[theBinaryCol].dtypes

Chips            object
Unreliable       object
NonSD            object
Beef             object
Pico             object
Guac             object
Cheese           object
Fries            object
Sour cream       object
Pork             object
Chicken          object
Shrimp           object
Fish             object
Rice             object
Beans            object
Lettuce          object
Tomato           object
Bell peper       object
Carrots          object
Cabbage          object
Sauce            object
Salsa.1          object
Cilantro         object
Onion            object
Taquito          object
Pineapple        object
Ham              object
Chile relleno    object
Nopales          object
Lobster          object
Egg              object
Mushroom         object
Bacon            object
Sushi            object
Avocado          object
Corn             object
Zucchini         object
dtype: object

In [0]:

# These are the values that we will drop
dropCols = ['Yelp', 'Google','Mass (g)', 'Density (g/mL)',
        'Length', 'Circum', 'Volume',  'Queso' ]


In [0]:
# Creating a method that will apply on 
# all the binary colums where "n" is placed in the 
# place of NAN
# It will then make sure the x is lower case

# another inner function for the removeNan
def fixString(theString):
  aStr = str(theString)
  aStr = aStr.lower()
  aStr = aStr.strip()
  return aStr

def removeNAN(dataFrame):
  theCopy = dataFrame.copy()
  for col in theBinaryCol:
     theCopy[col] = theCopy[col].fillna(value='n')
     theCopy[col] = theCopy[col]
     # Doing the apply of the fixString
     theCopy[col] = theCopy[col].apply(fixString)
  return theCopy

In [0]:
# Creating the function to clean the data
# Will still have some NAN that need to be imputed
# and the some categorical features can be one hot encoded
def prepareData(dataFrame):
  dataFrame = removeNAN(dataFrame)
  dataFrame = dataFrame.drop(labels=dropCols, axis=1)


  return dataFrame

In [0]:
# Fixing the train set
X_train = prepareData(train)
X_val = prepareData(val)
X_test = prepareData(test)

In [305]:
print(train.shape)
train.head(1)

(298, 58)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini
0,California,2016-01-18,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [0]:
# importing the category encoder to do one hot encoding
import category_encoders as ce
from sklearn.impute import SimpleImputer

In [310]:
# Doing the encoding for the train and the val
encoder = ce.OneHotEncoder(use_cat_names=True)
x_train_encoded = encoder.fit_transform(X_train)
x_val_encoded = encoder.transform(X_val)
print(x_train_encoded.shape, x_val_encoded.shape)

(298, 91) (85, 91)


In [0]:
# Will impute some of the missing data
imputer = SimpleImputer(strategy='mean')
x