<a href="https://colab.research.google.com/github/AshleyBrooks213/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/UNIT2_MOD4_SPRINT1__LS_DS_214_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv',
                 parse_dates=['Date'],
                 index_col='Date')

In [3]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [4]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [5]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [6]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

#Get rid of NaN values
df = df.dropna(axis=1)

In [7]:
df.head()




Unnamed: 0_level_0,Burrito,Tortilla,Great
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-18,California,3.0,False
2016-01-24,California,2.0,False
2016-01-24,Carnitas,3.0,False
2016-01-24,Asada,3.0,False
2016-01-27,California,4.0,True


#Split 

In [8]:
#Split our Target Vector from our Feature Matrix
target = 'Great'
y = df[target]
X = df.drop(columns=target)

In [9]:
#Split our data
from sklearn.model_selection import train_test_split

cutoff1 = '2016-12-31'
cutoff2 = '2017-12-31'


mask1 = X.index <= cutoff1
mask2 = (cutoff1 <= X.index) & (X.index <= cutoff2)
mask3 = X.index > cutoff2

X_train, y_train = X.loc[mask1], y.loc[mask1]
X_val, y_val = X.loc[mask2], y.loc[mask2]
X_test, y_test = X.loc[mask3], y.loc[mask3]


#Establish Baseline


*   This is a **classification problem** so we look at the **majority class** to calculate **baseline accuracy score**




In [10]:
print('Baseline accuracy:', y_train.value_counts(normalize=True).max())

Baseline accuracy: 0.5906040268456376


#Build Model



*   OneHotEncoder
*   SimpleImputer
*   StandardScalar



In [11]:
from category_encoders import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression


  import pandas.util.testing as tm


In [12]:
model = make_pipeline(
    OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    LinearRegression()
)

In [13]:
model.fit(X_train, y_train);

  elif pd.api.types.is_categorical(cols):


#Check Metrics

In [14]:
print('Training Accuracy:', model.score(X_train, y_train))
print('Validation Accuracy:', model.score(X_val, y_val))

Training Accuracy: 0.1615402510785321
Validation Accuracy: 0.08185004652776362


#Predict

In [15]:
y_pred = model.predict(X_test)

In [16]:
y_pred

array([0.72625788, 0.61200007, 0.2545782 , 0.84027156, 0.48309382,
       0.49871882, 0.38836726, 0.50262507, 0.04559382, 0.72625788,
       0.72625788, 0.48309382, 0.48309382, 0.04559382, 0.50262507,
       0.56634578, 0.26898249, 0.49774226, 0.38348445, 0.49774226,
       0.15985163, 0.50262507, 0.48309382, 0.71136531, 0.71136531,
       0.36883601, 0.2545782 , 0.84027156, 0.61200007, 0.59735163,
       0.2702032 , 0.48309382, 0.72625788, 0.71136531, 0.48309382,
       0.61200007, 0.71339813, 0.25655428])

#Check Test Metrics

In [17]:
#If I went throught the NA values and changed some of them to 0's that would have helped make this model better
#Next time 
print('Test Accuracy:', model.score(X_test, y_test))

Test Accuracy: 0.06278850407135284
