Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [0]:
#imports
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

In [0]:
#making a new year column in order to split the dataset
df['Year'] = df['Date'].apply(lambda x: pd.to_datetime(x, infer_datetime_format=True)).apply(lambda x: x.year)

In [0]:
#splitting the dataset
train = df[df['Year'] <= 2016]
validate = df[df['Year'] == 2017]
test = df[df['Year'] >= 2018]

In [0]:
#false is the more common burrito answer
target = 'Great'
y_train = train[target]

majority_class = y_train.mode()[0]
y_pred_train = [majority_class] * len(y_train)

In [11]:
accuracy_score(y_train, y_pred_train)

0.5906040268456376

In [12]:
#baseline accuracy on the validation set
y_val = validate[target]
y_pred_val = [majority_class] * len(y_val)
accuracy_score(y_val, y_pred_val)

0.5529411764705883

In [0]:
empty_columns = ['Mass (g)', 'Density (g/mL)', 'Queso']
date_data = ['Date', 'Year']
features = train.columns.drop([target] + empty_columns + date_data)

In [0]:
#a lot of the features appear to be a binary value of if an ingredient was present or not
#this is coded as x, X or nan
#im assuming that x and X are the same

X_train = train[features]
X_train_binary = X_train.copy()
for column in X_train_binary.columns:
    column_mode = X_train_binary[column].mode()[0]
    if column_mode == 'x' or column_mode == 'X':
        X_train_binary[column] = X_train_binary[column].apply(lambda x: str(x).lower()).apply(lambda x: 1 if x == 'x' else 0)

In [15]:
#i think it is important to keep in the data if a yelp or google review was made or not
#going to turn the reviews into strings and one hot encode
X_train_pre_one_hot = X_train_binary
X_train_pre_one_hot['Yelp'] = X_train_pre_one_hot['Yelp'].apply(lambda x: str(x))
X_train_pre_one_hot['Google'] = X_train_pre_one_hot['Google'].apply(lambda x: str(x))

encoder = ce.one_hot.OneHotEncoder(use_cat_names=True)
X_train_post_one_hot = encoder.fit_transform(X_train_pre_one_hot)

NameError: ignored

In [0]:
imputer = SimpleImputer()
X_train_post_impute = imputer.fit_transform(X_train_post_one_hot)
X_train_post_impute = pd.DataFrame(X_train_post_impute, columns=X_train_post_one_hot.columns)
X_train_post_impute