<a href="https://colab.research.google.com/github/Struth-Rourke/DS-Unit-2-Linear-Models/blob/master/module4-logistic-regression/Assignment_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')
# df.head()

In [0]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset = ['overall'])
df['Great'] = df['overall'] >= 4

In [0]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [0]:
# Drop some high cardinality categoricals
df = df.drop(columns = ['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [0]:
# Drop some columns to prevent "leakage"
df = df.drop(columns = ['Rec', 'overall'])

In [360]:
# Viewing the df
print(df.shape)
df.head()

(421, 59)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [361]:
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format = True)
df['Date_Year'] = df['Date'].dt.year

df.replace('x', 1, inplace = True)
df.replace('X', 1, inplace = True)
df.replace('Yes', 1, inplace = True)
df.replace('No', 0, inplace = True)
df.fillna(0, inplace = True)
# Can use the impute to fillna with means for specific columns you want


print(df.shape)
df.head()

(421, 60)


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great,Date_Year
0,California,2016-01-18,3.5,4.2,0.0,6.49,3.0,0.0,0.0,0.0,0.0,0.0,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,2016
1,California,2016-01-24,3.5,3.3,0.0,5.45,3.5,0.0,0.0,0.0,0.0,0.0,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,2016
2,Carnitas,2016-01-24,0.0,0.0,0.0,4.85,1.5,0.0,0.0,0.0,0.0,0.0,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,2016
3,Asada,2016-01-24,0.0,0.0,0.0,5.25,2.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,2016
4,California,2016-01-27,4.0,3.8,1.0,6.59,4.0,0.0,0.0,0.0,0.0,0.0,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,True,2016


In [0]:
### Changing all the columns from an Object to a Float

# List of Columns that need to be changed
#lst = ['NonSD','Beef','Pico','Guac','Cheese','Fries','Sour cream','Pork','Chicken',
#       'Shrimp','Fish','Rice','Beans','Lettuce','Tomato','Bell pepper','Cabbage',
#       'Sauce','Salsa.1','Cilantro','Onion','Taquito','Pineapple','Ham','Corn', 'Chips']
#
# For Loop
#for column in lst:
#  df[column] = pd.to_numeric(column, downcast = 'float')

In [363]:
# Viewing the df to make sure it was changed
df.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 60 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Burrito         421 non-null    object        
 1   Date            421 non-null    datetime64[ns]
 2   Yelp            421 non-null    float64       
 3   Google          421 non-null    float64       
 4   Chips           421 non-null    float64       
 5   Cost            421 non-null    float64       
 6   Hunger          421 non-null    float64       
 7   Mass (g)        421 non-null    float64       
 8   Density (g/mL)  421 non-null    float64       
 9   Length          421 non-null    float64       
 10  Circum          421 non-null    float64       
 11  Volume          421 non-null    float64       
 12  Tortilla        421 non-null    float64       
 13  Temp            421 non-null    float64       
 14  Meat            421 non-null    float64       
 15  Fillin

In [364]:
# Viewing the df
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Yelp,421.0,0.803325,1.590489,0.0,0.0,0.0,0.0,4.5
Google,421.0,0.861283,1.698009,0.0,0.0,0.0,0.0,5.0
Chips,421.0,0.059382,0.23662,0.0,0.0,0.0,0.0,1.0
Cost,421.0,6.949834,1.746725,0.0,6.25,6.95,7.84,25.0
Hunger,421.0,3.470428,0.861041,0.0,3.0,3.5,4.0,5.0
Mass (g),421.0,28.541568,125.907378,0.0,0.0,0.0,0.0,925.0
Density (g/mL),421.0,0.035288,0.15153,0.0,0.0,0.0,0.0,0.865672
Length,421.0,13.469881,9.570803,0.0,0.0,18.5,20.5,26.0
Circum,421.0,14.774703,10.541694,0.0,0.0,21.0,22.5,29.0
Volume,421.0,0.524941,0.391316,0.0,0.0,0.68,0.83,1.54


In [365]:
train = df[df['Date_Year'] <= 2016]
val = df[(df['Date_Year'] > 2016) & (df['Date_Year'] <= 2017)]
test = df[df['Date_Year'] >= 2018]

print(train.shape, val.shape, test.shape)

## CHECK:
# train['Date_Year'].value_counts()
# val['Date_Year'].value_counts()
# test['Date_Year'].value_counts()

(298, 60) (85, 60) (38, 60)


In [366]:
# Baseline
from sklearn.metrics import accuracy_score

target = 'Great'
y_train = train[target]
#y_train.value_counts(normalize = True)

majority_class = y_train.mode()[0]
y_pred_train = [majority_class] * len(y_train)

accuracy_score(y_train, y_pred_train)

0.5906040268456376

In [367]:
import category_encoders as ce
from sklearn.linear_model  import LogisticRegression
from sklearn.preprocessing import StandardScaler

### DISCRETIONARY ATTEMPT NUMBER: 1

# Features and Target
features = ['Yelp', 'Google', 'Cost', 'Mass (g)', 'Density (g/mL)', 'Length', 
            'Volume', 'Synergy']
target = 'Great'

# X Features
X_train = train[features]
X_val = val[features]
X_test = test[features]

# y Target
y_train = train[target]
y_val = val[target]
y_test = test[target]

# Shape 
print(X_train.shape, X_val.shape, X_test.shape)
print('\n')

# Standard Scaler
scaler = StandardScaler()

X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

# Fit Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train_sc, y_train)

# Predict Logistic Regression Model
y_pred_train = log_reg.predict(X_train_sc)
y_pred_val = log_reg.predict(X_val_sc)

# Validation Score
print(f'Validation Score (Training): {log_reg.score(X_train_sc, y_train)}')
print(f'Validation Score (Validation): {log_reg.score(X_val_sc, y_val)}')
print(f'Validation Score (Testing): {log_reg.score(X_test_sc, y_test)}')

# Coefficients
print('\n')
X_train_sc = pd.DataFrame(X_train_sc, columns = X_train.columns)
coefs = pd.Series(log_reg.coef_[0], X_train_sc.columns)
print('Coefficients:')
print(coefs)

# Intercept
print('\n')
print('Intercept:', log_reg.intercept_)

(298, 8) (85, 8) (38, 8)


Validation Score (Training): 0.8154362416107382
Validation Score (Validation): 0.8
Validation Score (Testing): 0.7368421052631579


Coefficients:
Yelp              0.111974
Google           -0.282610
Cost              0.309911
Mass (g)          0.000000
Density (g/mL)    0.000000
Length           -0.103693
Volume           -0.025264
Synergy           2.541572
dtype: float64


Intercept: [-0.92947901]


In [368]:
train.columns

Index(['Burrito', 'Date', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
       'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Tortilla',
       'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa',
       'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',
       'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini',
       'Great', 'Date_Year'],
      dtype='object')

In [369]:
## DISCRETIONARY ATTEMPT NUMBER: 2

# Features and Target
features = ['Yelp', 'Cost', 'Synergy', 'Burrito', 'Hunger']
target = 'Great'


# X Features
X_train = train[features]
X_val = val[features]
X_test = test[features]

# y Target
y_train = train[target]
y_val = val[target]
y_test = test[target]


# Shape 
print(X_train.shape, X_val.shape, X_test.shape)
print('\n')


# Category Encoder
encoder = ce.OneHotEncoder(cols = 'Burrito', use_cat_names = True)

X_train_enc = encoder.fit_transform(X_train)
X_val_enc = encoder.transform(X_val)
X_test_enc = encoder.transform(X_test)


# Standard Scaler
scaler = StandardScaler()

X_train_sc = scaler.fit_transform(X_train_enc)
X_val_sc = scaler.transform(X_val_enc)
X_test_sc = scaler.transform(X_test_enc)

# Fit Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train_sc, y_train)

# Predict Logistic Regression Model
y_pred_train = log_reg.predict(X_train_sc)
y_pred_val = log_reg.predict(X_val_sc)

# Validation Score
print(f'Validation Score (Training): {log_reg.score(X_train_sc, y_train)}')
print(f'Validation Score (Validation): {log_reg.score(X_val_sc, y_val)}')
print(f'Validation Score (Testing): {log_reg.score(X_test_sc, y_test)}')

# Coefficients
print('\n')
X_train_sc = pd.DataFrame(X_train_sc, columns = X_train_enc.columns)
coefs = pd.Series(log_reg.coef_[0], X_train_enc.columns)
print('Coefficients:')
print(coefs)

# Intercept
print('\n')
print('Intercept:', log_reg.intercept_)

(298, 5) (85, 5) (38, 5)


Validation Score (Training): 0.8322147651006712
Validation Score (Validation): 0.7764705882352941
Validation Score (Testing): 0.7368421052631579


Coefficients:
Yelp                  -0.194717
Cost                   0.306453
Synergy                2.540105
Burrito_California     0.150236
Burrito_Carnitas       0.138112
Burrito_Asada          0.019765
Burrito_Other         -0.180309
Burrito_Surf & Turf   -0.086155
Hunger                 0.179344
dtype: float64


Intercept: [-0.95456062]


In [370]:
## Using SKLearn Feature Selection

from sklearn.feature_selection import f_regression, SelectKBest
selector = SelectKBest(score_func = f_regression, k = 10)

# Target and Features
target = 'Great'
features = ['Burrito', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
            'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Tortilla',
            'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa',
            'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
            'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
            'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
            'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
            'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',
            'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']

# X Features
X_train = train[features]
X_val = val[features]

# y Target
y_train = train[target]
y_val = val[target]


# Category Encoder
encoder = ce.OneHotEncoder(cols = 'Burrito', use_cat_names = True)

X_train = encoder.fit_transform(X_train)
X_val = encoder.transform(X_val)

# Applying Select KBest to CE features
X_train_selected = selector.fit_transform(X_train, y_train)
X_val_selected = selector.transform(X_val)


# Which features were selected?
all_names = X_train.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features Selected:')
for name in selected_names:
  print(name)

print('\n')
print('Features not selected:')
for name in unselected_names:
  print(name)

Features Selected:
Tortilla
Temp
Meat
Fillings
Meat:filling
Uniformity
Salsa
Synergy
Unreliable
Beans


Features not selected:
Burrito_California
Burrito_Carnitas
Burrito_Asada
Burrito_Other
Burrito_Surf & Turf
Yelp
Google
Chips
Cost
Hunger
Mass (g)
Density (g/mL)
Length
Circum
Volume
Wrap
NonSD
Beef
Pico
Guac
Cheese
Fries
Sour cream
Pork
Chicken
Shrimp
Fish
Rice
Lettuce
Tomato
Bell peper
Carrots
Cabbage
Sauce
Salsa.1
Cilantro
Onion
Taquito
Pineapple
Ham
Chile relleno
Nopales
Lobster
Queso
Egg
Mushroom
Bacon
Sushi
Avocado
Corn
Zucchini


  corr /= X_norms
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [371]:
# Features and Target -- KBest = 10
features = ['Tortilla','Temp','Meat','Fillings','Meat:filling','Uniformity',
            'Salsa','Synergy','Unreliable','Beans']
target = 'Great'


# X Features
X_train = train[features]
X_val = val[features]
X_test = test[features]

# y Target
y_train = train[target]
y_val = val[target]
y_test = test[target]


# Shape 
print(X_train.shape, X_val.shape, X_test.shape)
print('\n')

# Standard Scaler
scaler = StandardScaler()

X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

# Fit Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train_sc, y_train)

# Predict Logistic Regression Model
y_pred_train = log_reg.predict(X_train_sc)
y_pred_val = log_reg.predict(X_val_sc)

# Validation Score
print(f'Validation Score (Training): {log_reg.score(X_train_sc, y_train)}')
print(f'Validation Score (Validation): {log_reg.score(X_val_sc, y_val)}')
print(f'Validation Score (Testing): {log_reg.score(X_test_sc, y_test)}')

# Coefficients
print('\n')
X_train_sc = pd.DataFrame(X_train_sc, columns = X_train.columns)
coefs = pd.Series(log_reg.coef_[0], X_train.columns)
print('Coefficients:')
print(coefs)

# Intercept
print('\n')
print('Intercept:', log_reg.intercept_)

(298, 10) (85, 10) (38, 10)


Validation Score (Training): 0.8825503355704698
Validation Score (Validation): 0.8117647058823529
Validation Score (Testing): 0.7631578947368421


Coefficients:
Tortilla        0.680834
Temp            0.476858
Meat            0.565198
Fillings        0.959758
Meat:filling    0.969165
Uniformity      0.104169
Salsa           0.382278
Synergy         1.824106
Unreliable      0.899465
Beans          -0.246271
dtype: float64


Intercept: [-1.10570213]


In [372]:
from sklearn.feature_selection import f_regression, SelectKBest
selector = SelectKBest(score_func = f_regression, k = 5)

# Target and Features
target = 'Great'
features = ['Burrito', 'Yelp', 'Google', 'Chips', 'Cost', 'Hunger',
            'Mass (g)', 'Density (g/mL)', 'Length', 'Circum', 'Volume', 'Tortilla',
            'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity', 'Salsa',
            'Synergy', 'Wrap', 'Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
            'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
            'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
            'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
            'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster', 'Queso',
            'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']

# X Features
X_train = train[features]
X_val = val[features]

# y Target
y_train = train[target]
y_val = val[target]


# Category Encoder
encoder = ce.OneHotEncoder(cols = 'Burrito', use_cat_names = True)

X_train = encoder.fit_transform(X_train)
X_val = encoder.transform(X_val)

# Applying Select KBest to CE features
X_train_selected = selector.fit_transform(X_train, y_train)
X_val_selected = selector.transform(X_val)


# Which features were selected?
all_names = X_train.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features Selected:')
for name in selected_names:
  print(name)

print('\n')
print('Features not selected:')
for name in unselected_names:
  print(name)

Features Selected:
Tortilla
Meat
Fillings
Meat:filling
Synergy


Features not selected:
Burrito_California
Burrito_Carnitas
Burrito_Asada
Burrito_Other
Burrito_Surf & Turf
Yelp
Google
Chips
Cost
Hunger
Mass (g)
Density (g/mL)
Length
Circum
Volume
Temp
Uniformity
Salsa
Wrap
Unreliable
NonSD
Beef
Pico
Guac
Cheese
Fries
Sour cream
Pork
Chicken
Shrimp
Fish
Rice
Beans
Lettuce
Tomato
Bell peper
Carrots
Cabbage
Sauce
Salsa.1
Cilantro
Onion
Taquito
Pineapple
Ham
Chile relleno
Nopales
Lobster
Queso
Egg
Mushroom
Bacon
Sushi
Avocado
Corn
Zucchini


  corr /= X_norms
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)


In [373]:
# Features and Target -- KBest = 5
features = ['Tortilla','Temp','Meat','Fillings','Meat:filling','Synergy']
target = 'Great'


# X Features
X_train = train[features]
X_val = val[features]
X_test = test[features]

# y Target
y_train = train[target]
y_val = val[target]
y_test = test[target]


# Shape 
print(X_train.shape, X_val.shape, X_test.shape)
print('\n')

# Standard Scaler
scaler = StandardScaler()

X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
X_test_sc = scaler.transform(X_test)

# Fit Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train_sc, y_train)

# Predict Logistic Regression Model
y_pred_train = log_reg.predict(X_train_sc)
y_pred_val = log_reg.predict(X_val_sc)

# Validation Score
print(f'Validation Score (Training): {log_reg.score(X_train_sc, y_train)}')
print(f'Validation Score (Validation): {log_reg.score(X_val_sc, y_val)}')
print(f'Validation Score (Testing): {log_reg.score(X_test_sc, y_test)}')

# Coefficients
print('\n')
X_train_sc = pd.DataFrame(X_train_sc, columns = X_train.columns)
coefs = pd.Series(log_reg.coef_[0], X_train.columns)
print('Coefficients:')
print(coefs)

# Intercept
print('\n')
print('Intercept:', log_reg.intercept_)

(298, 6) (85, 6) (38, 6)


Validation Score (Training): 0.8657718120805369
Validation Score (Validation): 0.8352941176470589
Validation Score (Testing): 0.7631578947368421


Coefficients:
Tortilla        0.696623
Temp            0.297783
Meat            0.495169
Fillings        0.924741
Meat:filling    0.851854
Synergy         1.962821
dtype: float64


Intercept: [-1.17200819]
