<a href="https://colab.research.google.com/github/will-cotton4/DS-Unit-2-Sprint-4-Practicing-Understanding/blob/master/DS_Unit_2_Sprint_Challenge_4_Practicing_Understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science Unit 2_
 
 # Sprint Challenge: Practicing & Understanding Predictive Modeling

### Chicago Food Inspections

For this Sprint Challenge, you'll use a dataset with information from inspections of restaurants and other food establishments in Chicago from January 2010 to March 2019. 

[See this PDF](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF) for descriptions of the data elements included in this dataset.

According to [Chicago Department of Public Health — Food Protection Services](https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/food-protection-services.html), "Chicago is home to 16,000 food establishments like restaurants, grocery stores, bakeries, wholesalers, lunchrooms, mobile food vendors and more. Our business is food safety and sanitation with one goal, to prevent the spread of food-borne disease. We do this by inspecting food businesses, responding to complaints and food recalls." 

#### Your challenge: Predict whether inspections failed

The target is the `Fail` column.

- When the food establishment failed the inspection, the target is `1`.
- When the establishment passed, the target is `0`.

#### Run this cell to load the data:

In [0]:
import pandas as pd

train_url = 'https://drive.google.com/uc?export=download&id=13_tP9JpLcZHSPVpWcua4t2rY44K_s4H5'
test_url  = 'https://drive.google.com/uc?export=download&id=1GkDHjsiGrzOXoF_xcYjdzBTSjOIi3g5a'

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

assert train.shape == (51916, 17)
assert test.shape  == (17306, 17)

### Part 1: Preprocessing

You may choose which features you want to use, and whether/how you will preprocess them. If you use categorical features, you may use any tools and techniques for encoding. (Pandas, category_encoders, sklearn.preprocessing, or any other library.)

_To earn a score of 3 for this part, find and explain leakage. The dataset has a feature that will give you an ROC AUC score > 0.90 if you process and use the feature. Find the leakage and explain why the feature shouldn't be used in a real-world model to predict the results of future inspections._

#### Preliminary Exploration

In [0]:
!pip install category-encoders
!pip install eli5
!pip install shap

In [3]:
train.columns

Index(['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Violations', 'Latitude', 'Longitude', 'Location',
       'Fail'],
      dtype='object')

In [51]:
train.isnull().sum()

Inspection ID        0
DBA Name             0
AKA Name           623
License #            5
Facility Type      224
Risk                12
Address              0
City                53
State               10
Zip                 26
Inspection Date      0
Inspection Type      1
Violations           0
Latitude           198
Longitude          198
Location           198
Fail                 0
dtype: int64

In [91]:
train['Facility Type'].value_counts()

Restaurant                           34235
Grocery Store                         6889
School                                3876
Bakery                                 842
Daycare (2 - 6 Years)                  830
Children's Services Facility           802
Daycare Above and Under 2 Years        656
Long Term Care                         394
Catering                               304
Mobile Food Dispenser                  278
Liquor                                 260
Daycare Combo 1586                     227
Wholesale                              199
Golden Diner                           162
Mobile Food Preparer                   159
Hospital                               141
TAVERN                                  87
Shared Kitchen User (Long Term)         68
Daycare (Under 2 Years)                 65
Special Event                           60
GAS STATION                             33
KIOSK                                   33
BANQUET HALL                            30
Shelter    

In [95]:
train.City.value_counts()

CHICAGO              51427
Chicago                 88
chicago                 34
CCHICAGO                16
SCHAUMBURG               6
CHicago                  5
ELK GROVE VILLAGE        4
MAYWOOD                  4
CHESTNUT STREET          3
CICERO                   3
OAK PARK                 2
ROSEMONT                 2
EAST HAZEL CREST         2
SKOKIE                   2
NAPERVILLE               2
ELMHURST                 2
ALSIP                    2
NILES NILES              2
CHICAGOCHICAGO           2
EVANSTON                 1
CHARLES A HAYES          1
CHICAGO HEIGHTS          1
HIGHLAND PARK            1
BERWYN                   1
BRIDGEVIEW               1
CHICAGOHICAGO            1
SCHILLER PARK            1
BROADVIEW                1
SUMMIT                   1
TINLEY PARK              1
WORTH                    1
LAKE BLUFF               1
CHCHICAGO                1
CHICAGOI                 1
BEDFORD PARK             1
OLYMPIA FIELDS           1
STREAMWOOD               1
B

In [97]:
train.State.value_counts()

IL    51627
Name: State, dtype: int64

In [0]:
import category_encoders as ce

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

def wrangle(df): 
  df = df.copy()
  # Remove values that have too many options/don't provide helpful info
  df = df.drop(columns = ['DBA Name', 'AKA Name', 'License #', 'Address', 'Location', 'City', 'State'])
  
  # Broken
#   # Bin violations by type
#   df.Violations = df.fillna(0)
#   df.Violations = df.Violations.apply(str)
#   print(df.sample(1).Violations)
#   def clean_violation(entry):
#       return int(entry.split('.')[0])
#   df.Violations = df.Violations.apply(clean_violation)
#   df.Violations = df.Violations.apply(int)
  
  #Rename risk categories:
  risk_dict = {'Risk 1 (High)': 1, 'Risk 2 (Medium)': 2, 'Risk 3 (Low)': 3}
  df.Risk = df.Risk.replace(risk_dict)
  
  features = ['Facility Type', 'Risk',  'Zip', 'Inspection Date', 'Inspection Type']
  # Remove missing values
  df = df.dropna()
  
  # One-hot encode risk
  one_hot = pd.get_dummies(df['Risk'], prefix = 'Risk')
  df = df.join(one_hot)
  
  df = df.drop(columns=["Risk"])
  
  df['Inspection Date'] = pd.to_datetime(df['Inspection Date'])
  
  return df

train = wrangle(train)
test = wrangle(test)

In [119]:
train['Inspection Type'].value_counts()

Canvass                                   22972
License                                    5082
Complaint                                  4792
Canvass Re-Inspection                      4363
Complaint Re-Inspection                    1512
Short Form Complaint                       1472
License Re-Inspection                      1134
Suspected Food Poisoning                    200
Tag Removal                                 114
Consultation                                104
License-Task Force                           99
Recent Inspection                            44
Complaint-Fire                               42
Suspected Food Poisoning Re-inspection       38
Task Force Liquor 1475                       33
Short Form Fire-Complaint                    22
Special Events (Festivals)                   14
Complaint-Fire Re-inspection                  9
Package Liquor 1474                           4
LICENSE RENEWAL FOR DAYCARE                   1
KIDS CAFE                               

##### NOT FINISHED

In [0]:
features = ['Facility Type', 'Risk', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Violations', 'Latitude', 'Longitude', 'Location','Fail']

**`Violations` is leaky; some of the `violation`s are immediate fails, so positive values there would indicate that the restaurant failed.**

In [0]:
train.Violations = train.Violations.fillna(0)

In [49]:
pd.crosstab()

Unnamed: 0_level_0,Zip,Latitude,Longitude
Fail,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2332384000.0,1605312.0,-3360755.0
1,813646300.0,560615.8,-1173706.0


In [0]:
train.Violations.value_counts()

In [31]:
# Clean and replace values:
def clean_violation(string):
  return int(string.split('.')[0])



3
32
35
38
34


In [10]:
train[train.Violations!=0].Fail.value_counts()

0    29939
1    12322
Name: Fail, dtype: int64


### Part 2: Modeling

**Fit a model** with the train set. (You may use scikit-learn, xgboost, or any other library.) **Use cross-validation** to **do hyperparameter optimization**, and **estimate your ROC AUC** validation score.

Use your model to **predict probabilities** for the test set. **Get an ROC AUC test score >= 0.60.**

_To earn a score of 3 for this part, get an ROC AUC test score >= 0.70 (without using the feature with leakage)._

In [104]:
train.columns

Index(['Inspection ID', 'Facility Type', 'Zip', 'Inspection Date',
       'Inspection Type', 'Violations', 'Latitude', 'Longitude', 'Fail',
       'Risk_1.0', 'Risk_2.0', 'Risk_3.0'],
      dtype='object')

In [0]:
features = ['Risk_1.0', 'Risk_2.0', 'Risk_3.0', 'Zip']

X_train = train[features].dropna()
y_train = train['Fail'].dropna()

X_test = test[features].dropna()
y_test = test['Fail'].dropna()

In [121]:
X_train.isnull().sum()

Risk_1.0    0
Risk_2.0    0
Risk_3.0    0
Zip         0
dtype: int64

In [122]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

param_distributions = {
    'n_estimators': [100, 200], 
    'max_depth': [4, 5], 
    'criterion': ['mse', 'mae']
}

gridsearch = RandomizedSearchCV(
    RandomForestClassifier(n_jobs=-1, random_state=42), 
    param_distributions=param_distributions, 
    n_iter=8, 
    cv=3, 
    scoring='roc_auc', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

gridsearch.fit(X_train, y_train)

ValueError: ignored

### Part 3: Visualization

Make one visualization for model interpretation. (You may use any libraries.) Choose one of these types:

- Feature Importances
- Permutation Importances
- Partial Dependence Plot
- Shapley Values

_To earn a score of 3 for this part, make at least two of these visualization types._

### Part 4: Gradient Descent

Answer both of these two questions:

- What does Gradient Descent seek to minimize?
- What is the "Learning Rate" and what is its function?

One sentence is sufficient for each.

_To earn a score of 3 for this part, go above and beyond. Show depth of understanding and mastery of intuition in your answers._

1. Gradent descent seeks to minimize a cost function for a given problem by locating the direction of steepest descent (along the negative gradient) and traveling in that direction.

2. The learning rate determines how much to scale the gradient when iterating through the GD algorithm. For example, a learning rate of 0.1 would indicate that we would travel in the direction of the negative gradient with a length of 0.1 times the original gradient length.