<a href="https://colab.research.google.com/github/tallywiesenberg/DS-Unit-2-Applied-Modeling/blob/master/DS_Sprint_Challenge_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Unit 2_

# Applied Modeling Sprint Challenge: Predict Chicago food inspections 🍔

For this Sprint Challenge, you'll use a dataset with information from inspections of restaurants and other food establishments in Chicago from January 2010 to March 2019. 

[See this PDF](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF) for descriptions of the data elements included in this dataset.

According to [Chicago Department of Public Health — Food Protection Services](https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/food-protection-services.html), "Chicago is home to 16,000 food establishments like restaurants, grocery stores, bakeries, wholesalers, lunchrooms, mobile food vendors and more. Our business is food safety and sanitation with one goal, to prevent the spread of food-borne disease. We do this by inspecting food businesses, responding to complaints and food recalls." 

#### Your challenge: Predict whether inspections failed

The target is the `Fail` column.

- When the food establishment failed the inspection, the target is `1`.
- When the establishment passed, the target is `0`.

#### Run this cell to install packages in Colab:

In [132]:
import sys

if 'google.colab' in sys.modules:
    # Install packages in Colab
    !pip install category_encoders==2.*
    !pip install eli5
    !pip install pandas-profiling==2.*
    !pip install pdpbox
    !pip install shap



#### Run this cell to load the data:

In [0]:
import pandas as pd

train_url = 'https://drive.google.com/uc?export=download&id=13_tP9JpLcZHSPVpWcua4t2rY44K_s4H5'
test_url  = 'https://drive.google.com/uc?export=download&id=1GkDHjsiGrzOXoF_xcYjdzBTSjOIi3g5a'

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

assert train.shape == (51916, 17)
assert test.shape  == (17306, 17)

### Part 1: Preprocessing

You may choose which features you want to use, and whether/how you will preprocess them. If you use categorical features, you may use any tools and techniques for encoding.

_To earn a score of 3 for this part, find and explain leakage. The dataset has a feature that will give you an ROC AUC score > 0.90 if you process and use the feature. Find the leakage and explain why the feature shouldn't be used in a real-world model to predict the results of future inspections._

### Part 2: Modeling

**Fit a model** with the train set. (You may use scikit-learn, xgboost, or any other library.) Use cross-validation or do a three-way split (train/validate/test) and **estimate your ROC AUC** validation score.

Use your model to **predict probabilities** for the test set. **Get an ROC AUC test score >= 0.60.**

_To earn a score of 3 for this part, get an ROC AUC test score >= 0.70 (without using the feature with leakage)._


### Part 3: Visualization

Make visualizations for model interpretation. (You may use any libraries.) Choose two of these types:

- Permutation Importances
- Partial Dependence Plot, 1 feature isolation
- Partial Dependence Plot, 2 features interaction
- Shapley Values

_To earn a score of 3 for this part, make all four of these visualization types._

## Part 1: Preprocessing

> You may choose which features you want to use, and whether/how you will preprocess them. If you use categorical features, you may use any tools and techniques for encoding.

In [134]:
train.head(2)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Violations,Latitude,Longitude,Location,Fail
0,2088270,"TOM YUM RICE & NOODLE, INC.",TOM YUM CAFE,2354911.0,Restaurant,Risk 1 (High),608 W BARRY,CHICAGO,IL,60657.0,2017-09-15T00:00:00,Canvass,3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATUR...,41.938007,-87.644755,"{'longitude': '-87.6447545707008', 'latitude':...",1
1,555268,FILLING STATION & CONVENIENCE STORE,FILLING STATION & CONVENIENCE STORE,1044901.0,Grocery Store,Risk 3 (Low),6646-6658 S WESTERN AVE,CHICAGO,IL,60636.0,2011-10-20T00:00:00,Complaint Re-Inspection,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.772402,-87.683603,"{'longitude': '-87.68360273081268', 'latitude'...",0


In [135]:
train.select_dtypes('object').nunique() > 50     #examining categorical features with more than 50 unique values

DBA Name            True
AKA Name            True
Facility Type       True
Risk               False
Address             True
City               False
State              False
Inspection Date     True
Inspection Type     True
Violations          True
Location            True
dtype: bool

In [0]:
#columns with high cardinality to drop
columns_drop = ['DBA Name', 'AKA Name','Address', 'Inspection Date', 'Inspection Type', 'Violations', 'Location',
                'Risk', 'Fail',                             #risk and fail (target) shouldn't leak into features
                'Inspection ID', 'License #']               #IDs aren't useful

In [137]:
columns_drop

['DBA Name',
 'AKA Name',
 'Address',
 'Inspection Date',
 'Inspection Type',
 'Violations',
 'Location',
 'Risk',
 'Fail',
 'Inspection ID',
 'License #']

In [0]:
features = train.columns[~train.columns.isin(columns_drop)] #list of features to use for prediction



## Part 2: Modeling

> **Fit a model** with the train set. (You may use scikit-learn, xgboost, or any other library.) Use cross-validation or do a three-way split (train/validate/test) and **estimate your ROC AUC** validation score.
>
> Use your model to **predict probabilities** for the test set. **Get an ROC AUC test score >= 0.60.**

In [0]:
#train test split

from sklearn.model_selection import train_test_split
train, val = train_test_split(train, 
                              train_size = 0.8, test_size=0.2,
                              stratify = train['Fail'],
                              random_state=42)

In [0]:
X_train = train[features]
X_val = val[features]
X_test = test[features]

y_train = train['Fail']
y_val = val['Fail']
y_test = test['Fail']

In [141]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

#make pipeline
pipeline = make_pipeline(OrdinalEncoder(),
                         SimpleImputer(),
                         StandardScaler(),
                         RandomForestClassifier(n_estimators = 500,
                                                n_jobs=-1,
                                                random_state=42))
#fit pipeline
pipeline.fit(X_train, y_train)

y_pred_proba = pipeline.predict_proba(X_val)[:, 1]

from sklearn.metrics import roc_auc_score

roc_auc_score(y_val, y_pred_proba)

0.538155962599317

In [0]:
##RANDOM SEARCH

from sklearn.model_selection import RandomizedSearchCV
#select hyperparameters
hyperparameters = {'simpleimputer__strategy': ['mean', 'median'],
                   'randomforestclassifier__max_depth': range(0, 50, 2),
                   'randomforestclassifier__min_samples_split': range(0, 500, 5),
                   'randomforestclassifier__min_samples_leaf': range(0, 500, 5)}
#apply search
search = RandomizedSearchCV(pipeline,
                            hyperparameters,
                            random_state = 42,
                            n_iter = 20,
                            cv = 5)
#fit search to trian set
best_model = search.fit(X_train, y_train)

In [143]:
y_pred_proba = best_model.predict_proba(X_val)[:, 1]

roc_auc_score(y_val, y_pred_proba)
best_model.best_params_

{'randomforestclassifier__max_depth': 12,
 'randomforestclassifier__min_samples_leaf': 45,
 'randomforestclassifier__min_samples_split': 395,
 'simpleimputer__strategy': 'mean'}

## Part 3: Visualization

> Make visualizations for model interpretation. (You may use any libraries.) Choose two of these types:
>
> - Permutation Importances
> - Partial Dependence Plot, 1 feature isolation
> - Partial Dependence Plot, 2 features interaction
> - Shapley Values

In [144]:
##PERMUTATION IMPORTANCE W/ ELI5

#pipeline without model (to use eli5)
small_pipeline = make_pipeline(OrdinalEncoder(),
                               SimpleImputer(),
                               StandardScaler())

#transform X_train and X_test
X_train_transformed = small_pipeline.fit_transform(X_train)
X_val_transformed = small_pipeline.transform(X_val)

#isolated random forest
model = RandomForestClassifier(max_depth=12,
  min_samples_leaf=45,
  min_samples_split=395)

model.fit(X_train_transformed, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=12, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=45, min_samples_split=395,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [145]:
from eli5.sklearn import PermutationImportance
import eli5
#instantiate permuter
permuter = PermutationImportance(model,
                                 scoring = 'accuracy',
                                 n_iter = 5,
                                 random_state=42)

#fit permuter to validation set
permuter.fit(X_val_transformed, y_val)

features = X_val.columns.tolist()

#show weights
eli5.show_weights(permuter,
                  top=None,
                  feature_names = features)

  rel_weight = (abs(weight) / weight_range) ** 0.7


Weight,Feature
0  ± 0.0000,Longitude
0  ± 0.0000,Latitude
0  ± 0.0000,Zip
0  ± 0.0000,State
0  ± 0.0000,City
0  ± 0.0000,Facility Type


In [150]:
X_val.columns

Index(['Facility Type', 'City', 'State', 'Zip', 'Latitude', 'Longitude'], dtype='object')

In [156]:
## PARTIAL DEPENDENCE PLOTS

from pdpbox.pdp import pdp_isolate, pdp_plot

feature = 'Facility Type'

isolated = pdp_isolate(
    model = model,
    dataset = X_val,
    model_features = X_val.columns,
    feature =feature)

pdp_plot(isolated, feature_name=feature)

TypeError: ignored

#Leak

I believe the column that would hurt the model in the real world is the "Risk" column. The problem with this column is that it implies that the inspectors have prior insight into the cleanliness of the restaurant, while the purpose of the model is to predict the cleanliness of the restaurant (whether they pass the test or not). I had a similar problem in my Unit 2 Build project. I was building a model to test whether car accidents would result in major injury or fatality (Y/N), and a series of attributes counting the number of fatalities/major injuries/minor injuries per pedestrian/cyclist/driver leaked into my model, giving me a ver accruate model that was useless because the most important feature could not be retrieved prior to real world accidents.