<a href="https://colab.research.google.com/github/joeyMckinney/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/josiah_mckinney_LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

#Wrangle Data

In [7]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [8]:
df.shape

(421, 59)

In [9]:
#getting usable data
df2 = df[['Burrito', 'Date', 'Hunger', 'Tortilla', 'Temp', 'Meat', 'Fillings', 
          'Meat:filling','Uniformity', 'Salsa', 'Synergy', 'Wrap', 'Great']]

In [10]:
df2.head()

Unnamed: 0,Burrito,Date,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Great
0,California,1/18/2016,3.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,False
1,California,1/24/2016,3.5,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,False
2,Carnitas,1/24/2016,1.5,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,False
3,Asada,1/24/2016,2.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,False
4,California,1/27/2016,4.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,True


In [11]:
#making date to date time object 
df2['Date'] = pd.to_datetime(df2['Date'], infer_datetime_format=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
df2.isnull().sum()

Burrito          0
Date             0
Hunger           3
Tortilla         0
Temp            20
Meat            14
Fillings         3
Meat:filling     9
Uniformity       2
Salsa           25
Synergy          2
Wrap             3
Great            0
dtype: int64

In [13]:
#filling nan values
df2['Hunger'] = df2['Hunger'].fillna(df2['Hunger'].mean())
df2['Temp'] = df2['Temp'].fillna(df2['Temp'].mean())
df2['Meat'] = df2['Meat'].fillna(df2['Meat'].mean())
df2['Fillings'] = df2['Fillings'].fillna(df2['Fillings'].mean())
df2['Meat:filling'] = df2['Meat:filling'].fillna(df2['Meat:filling'].mean())
df2['Uniformity'] = df2['Uniformity'].fillna(df2['Uniformity'].mean())
df2['Salsa'] = df2['Salsa'].fillna(df2['Salsa'].mean())
df2['Synergy'] = df2['Synergy'].fillna(df2['Synergy'].mean())
df2['Wrap'] = df2['Wrap'].fillna(df2['Wrap'].mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [14]:
df2.isnull().sum()

Burrito         0
Date            0
Hunger          0
Tortilla        0
Temp            0
Meat            0
Fillings        0
Meat:filling    0
Uniformity      0
Salsa           0
Synergy         0
Wrap            0
Great           0
dtype: int64

#Spliting data

In [15]:
df.head()

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [16]:
df2.shape

(421, 13)

In [17]:
y = df2[['Great', 'Date']]
X = df2[['Burrito', 'Date', 'Hunger', 'Tortilla', 'Temp', 'Meat', 'Fillings', 
          'Meat:filling','Uniformity', 'Salsa', 'Synergy', 'Wrap',]]

In [52]:
#spling data into 3 groups: train data, vaildation, and testing
cutoff = '1/01/2017'
cutoff2 = '1/01/2018'
mask = df2.Date < cutoff
mask2 = (df2.Date > cutoff) & (df2.Date < cutoff2)
mask3 = df2.Date > cutoff2 
X_train, y_train = X.loc[mask], y.loc[mask]
X_val, y_val = X.loc[mask2], y.loc[mask2]
X_test, y_test = X.loc[mask3], y.loc[mask3]


In [53]:
X_train.shape

(298, 12)

In [54]:
X_val.shape

(85, 12)

In [55]:
X_test.shape

(38, 12)

In [56]:
X_val.head()

Unnamed: 0,Burrito,Date,Hunger,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap
301,California,2017-01-04,3.495335,4.0,4.5,4.0,3.5,3.5,5.0,1.5,3.5,4.5
302,Other,2017-01-04,3.495335,4.0,2.0,3.620393,4.0,3.586481,4.6,4.2,3.75,5.0
303,Other,2017-01-07,3.9,3.0,4.5,4.1,3.0,3.7,4.0,4.3,4.2,5.0
304,Other,2017-01-07,4.0,3.5,4.0,4.0,3.0,4.0,4.5,4.0,3.8,4.8
305,Other,2017-01-10,3.5,2.5,4.5,3.0,2.5,3.0,3.0,2.0,2.0,4.0


#Baseline

In [57]:
#creating baseline accuracy score
baseline = y_train['Great'].value_counts(normalize=True).max()

In [58]:
baseline

0.5906040268456376

#making model

In [59]:
from sklearn.linear_model import LogisticRegression
from category_encoders import OneHotEncoder
from sklearn.pipeline import Pipeline

In [60]:
#getting rid of datetime objects
X_train = X_train.drop(['Date'], axis=1)
y_train = y_train.drop(['Date'], axis=1)

In [72]:
model = Pipeline([
                      ('one_hot', OneHotEncoder(['Burrito'], use_cat_names=True)),
                       ('regressor', LogisticRegression())
])

model.fit(X_train, y_train);

  y = column_or_1d(y, warn=True)


#Checking metrics

In [74]:
#baseline accuracy score
baseline

0.5906040268456376

In [63]:
#trained data accuracy score
model.score(X_train, y_train)

0.8825503355704698

In [69]:
#getting rid of datetime objects score
X_val = X_val.drop(['Date'], axis=1)
y_val = y_val.drop(['Date'], axis=1)

In [70]:
#validation accuracy score
model.score(X_val, y_val)

0.8705882352941177

In [71]:
#getting rid of datetime objects
X_test = X_test.drop(['Date'], axis=1)
y_test = y_test.drop(['Date'], axis=1)

In [73]:
#test data accuracy score
model.score(X_test, y_test)

0.7105263157894737