<a href="https://colab.research.google.com/github/TheJoys2019/DS-Unit-2-Sprint-4-Practicing-Understanding/blob/master/Artin_Sinani_(1_of_2)_DS_Unit_2_Sprint_Challenge_4_Practicing_Understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science Unit 2_
 
 # Sprint Challenge: Practicing & Understanding Predictive Modeling

### Chicago Food Inspections

For this Sprint Challenge, you'll use a dataset with information from inspections of restaurants and other food establishments in Chicago from January 2010 to March 2019. 

[See this PDF](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF) for descriptions of the data elements included in this dataset.

According to [Chicago Department of Public Health — Food Protection Services](https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/food-protection-services.html), "Chicago is home to 16,000 food establishments like restaurants, grocery stores, bakeries, wholesalers, lunchrooms, mobile food vendors and more. Our business is food safety and sanitation with one goal, to prevent the spread of food-borne disease. We do this by inspecting food businesses, responding to complaints and food recalls." 

#### Your challenge: Predict whether inspections failed

The target is the `Fail` column.

- When the food establishment failed the inspection, the target is `1`.
- When the establishment passed, the target is `0`.

#### Run this cell to load the data:

In [0]:
import pandas as pd

train_url = 'https://drive.google.com/uc?export=download&id=13_tP9JpLcZHSPVpWcua4t2rY44K_s4H5'
test_url  = 'https://drive.google.com/uc?export=download&id=1GkDHjsiGrzOXoF_xcYjdzBTSjOIi3g5a'

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

assert train.shape == (51916, 17)
assert test.shape  == (17306, 17)

In [0]:
train = train.drop(columns=["State", "Violations"])
test = test.drop(columns=["State", "Violations"])

In [0]:
y_train = train["Fail"] == 1
X_train = train.drop(columns=["Fail"])

y_test = test["Fail"] == 1
X_test = test.drop(columns=["Fail"])

In [0]:
!pip install category_encoders



In [0]:
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

preprocessor = make_pipeline(ce.OrdinalEncoder(), 
                             StandardScaler(), 
                             SimpleImputer())

X_train = preprocessor.fit_transform(X_train)

X_train = pd.DataFrame(X_train)


In [0]:
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

scores = cross_validate(RandomForestClassifier(max_depth=5, n_estimators=100), 
                        X_train, 
                        y_train, 
                        scoring="roc_auc",
                        cv=3, 
                        return_train_score=True, 
                        return_estimator=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,estimator,test_score,train_score
0,3.833824,0.131105,"(DecisionTreeClassifier(class_weight=None, cri...",0.680477,0.703544
1,3.834558,0.124022,"(DecisionTreeClassifier(class_weight=None, cri...",0.682972,0.700089
2,3.827459,0.130577,"(DecisionTreeClassifier(class_weight=None, cri...",0.685247,0.698975


In [0]:
print("ROC AUC Cross Validation Score:", scores["test_score"].mean())

ROC AUC Cross Validation Score: 0.6828985311803061


In [0]:
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "n_estimators": [100, 200, 300],
    "max_depth": [4, 5, 6]
}

search = RandomizedSearchCV(
    RandomForestClassifier(n_jobs=-1, random_state=42),
    param_distributions=param_distributions, 
    n_iter=9, 
    cv=3, 
    scoring="roc_auc", 
    verbose=10, 
    return_train_score=True)

search.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] n_estimators=100, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  n_estimators=100, max_depth=4, score=0.6730114085549883, total=   3.8s
[CV] n_estimators=100, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.0s remaining:    0.0s


[CV]  n_estimators=100, max_depth=4, score=0.6759275876633415, total=   2.5s
[CV] n_estimators=100, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.7s remaining:    0.0s


[CV]  n_estimators=100, max_depth=4, score=0.6763935085801869, total=   2.5s
[CV] n_estimators=200, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    9.4s remaining:    0.0s


[CV]  n_estimators=200, max_depth=4, score=0.6741386816115323, total=   4.8s
[CV] n_estimators=200, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   14.6s remaining:    0.0s


[CV]  n_estimators=200, max_depth=4, score=0.6763108374663085, total=   4.8s
[CV] n_estimators=200, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   19.9s remaining:    0.0s


[CV]  n_estimators=200, max_depth=4, score=0.6801801904579438, total=   4.8s
[CV] n_estimators=300, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   25.1s remaining:    0.0s


[CV]  n_estimators=300, max_depth=4, score=0.673164332924467, total=   7.2s
[CV] n_estimators=300, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   32.9s remaining:    0.0s


[CV]  n_estimators=300, max_depth=4, score=0.6764348920346429, total=   7.3s
[CV] n_estimators=300, max_depth=4 ...................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   40.7s remaining:    0.0s


[CV]  n_estimators=300, max_depth=4, score=0.6797566109458888, total=   7.2s
[CV] n_estimators=100, max_depth=5 ...................................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   48.5s remaining:    0.0s


[CV]  n_estimators=100, max_depth=5, score=0.6802493875711598, total=   3.0s
[CV] n_estimators=100, max_depth=5 ...................................
[CV]  n_estimators=100, max_depth=5, score=0.6862493893066617, total=   2.9s
[CV] n_estimators=100, max_depth=5 ...................................
[CV]  n_estimators=100, max_depth=5, score=0.688271848887689, total=   3.0s
[CV] n_estimators=200, max_depth=5 ...................................
[CV]  n_estimators=200, max_depth=5, score=0.6822535117578675, total=   5.9s
[CV] n_estimators=200, max_depth=5 ...................................
[CV]  n_estimators=200, max_depth=5, score=0.6863484413712623, total=   5.8s
[CV] n_estimators=200, max_depth=5 ...................................
[CV]  n_estimators=200, max_depth=5, score=0.6886092302869061, total=   5.7s
[CV] n_estimators=300, max_depth=5 ...................................
[CV]  n_estimators=300, max_depth=5, score=0.6825687288993277, total=   8.8s
[CV] n_estimators=300, max_depth=5 .

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  2.8min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=9, n_jobs=None,
          param_distributions={'n_estimators': [100, 200, 300], 'max_depth': [4, 5, 6]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='roc_auc', verbose=10)

In [0]:
print('Cross Validation ROC_AUC:', search.best_score_)

Cross Validation ROC_AUC: 0.6910549079114173


In [0]:
from sklearn.metrics import roc_auc_score

best = search.best_estimator_
y_pred_proba = best.predict_proba(X_test.values)[:,1]
print('Test ROC AUC:', roc_auc_score(y_test, y_pred_proba))

Test ROC AUC: 0.6963621275115393


### Part 1: Preprocessing

You may choose which features you want to use, and whether/how you will preprocess them. If you use categorical features, you may use any tools and techniques for encoding. (Pandas, category_encoders, sklearn.preprocessing, or any other library.)

_To earn a score of 3 for this part, find and explain leakage. The dataset has a feature that will give you an ROC AUC score > 0.90 if you process and use the feature. Find the leakage and explain why the feature shouldn't be used in a real-world model to predict the results of future inspections._

### Part 2: Modeling

**Fit a model** with the train set. (You may use scikit-learn, xgboost, or any other library.) **Use cross-validation** to **do hyperparameter optimization**, and **estimate your ROC AUC** validation score.

Use your model to **predict probabilities** for the test set. **Get an ROC AUC test score >= 0.60.**

_To earn a score of 3 for this part, get an ROC AUC test score >= 0.70 (without using the feature with leakage)._


### Part 3: Visualization

Make one visualization for model interpretation. (You may use any libraries.) Choose one of these types:

- Feature Importances
- Permutation Importances
- Partial Dependence Plot
- Shapley Values

_To earn a score of 3 for this part, make at least two of these visualization types._

### Part 4: Gradient Descent

Answer both of these two questions:

- What does Gradient Descent seek to minimize?
- What is the "Learning Rate" and what is its function?

One sentence is sufficient for each.

_To earn a score of 3 for this part, go above and beyond. Show depth of understanding and mastery of intuition in your answers._