# Building a Random Forest Model to Predict Default

We will utilize the popular Python machine learning package scikit-learn to develop a random forest model for predicting the probability of default for a given loan. After launching a notebook container, all resources you need a automatically available for you to utilize within this notebook. We will be build

We will run the folowing steps to build our random forest model. 
1. Pull training data
2. Optimize hyperparameters
3. Fit random forest

In [2]:


import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
import cPickle as pickle
from IPython.display import clear_output, Image

from s3_connect import s3_connect

tmp_localdir = '/home/jupyter/'


# Pull training data
Running this cell will pull the lending club data into memory. 

In [5]:
# Setup S3 connection object
s3_conn = s3_connect(access=os.environ['AWS_CLOUD_BUCKET_KEY'],
                     secret=os.environ['AWS_CLOUD_BUCKET_SECRET_KEY'],
                     bucketname='ds-site-static-assets')

# For demo training set, use this:
dat = s3_conn.pull_pickle_from_s3(key='ds-examples/loan-risk/data/demo_data.p',tmp_localdir='~')

Grabbed ds-examples/loan-risk/data/demo_data.p from S3. Local file ds-examples/loan-risk/data/demo_data.p is now available.


## Let's make sure our data is in the right format.

Below are the first 10 rows of the dataframe. Each row represents a loan, and each column represents a loan feature.

In [33]:
# Display first 10 rows
dat['X_train'].head(10)

Unnamed: 0,loan_amnt,int_rate,dti,annual_inc,delinq_2yrs,open_acc,revol_util,term_ 36 months,term_ 60 months,purpose_car,...,addr_state_WA,addr_state_WI,addr_state_WV,addr_state_WY,home_ownership_ANY,home_ownership_MORTGAGE,home_ownership_NONE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
107265,14000.0,17.57,21.6,82000.0,1.0,24.0,43.8,1,0,0,...,0,0,0,0,0,1,0,0,0,0
84758,12000.0,23.43,27.04,40000.0,0.0,18.0,66.9,0,1,0,...,0,0,0,0,0,0,0,0,0,1
70313,5500.0,19.52,21.72,20000.0,0.0,11.0,77.3,1,0,0,...,0,0,0,0,0,1,0,0,0,0
114535,6825.0,14.99,13.84,45000.0,0.0,8.0,24.8,1,0,0,...,0,0,0,0,0,0,0,0,0,1
192027,18000.0,6.89,11.34,89000.0,0.0,5.0,81.7,1,0,0,...,0,0,0,0,0,1,0,0,0,0
49131,4500.0,15.22,21.11,100859.0,0.0,16.0,61.7,1,0,0,...,0,0,0,0,0,1,0,0,0,0
90056,16000.0,12.12,17.74,62500.0,0.0,5.0,89.3,1,0,0,...,0,0,0,0,0,0,0,0,0,1
139025,30000.0,15.8,22.54,80000.0,2.0,9.0,90.1,1,0,0,...,0,0,0,0,0,1,0,0,0,0
262853,25000.0,18.25,22.42,72300.0,0.0,13.0,54.6,1,0,0,...,1,0,0,0,0,0,0,0,0,1
108457,11225.0,18.99,26.96,43000.0,0.0,10.0,63.4,1,0,0,...,0,0,0,0,0,1,0,0,0,0


# Build Model

## Optimize Hyperparameters
To create a more optimal model, let's tune the hyperparameters of our random forest before we fit it. We will used RandomizedSearchCV. In contrast to GridSearchCV, not all hyperparameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter, which we set to 10. As a result, this method computationally much faster than GridSearchCV. Model accuracy is a tradeoff, but accuracy is usually deemed "good enough" if the number of search iterations is sufficient. 

The output below contains information about the RandomizedSearchCV run. Most importantly for us, it contains the set of hyperparameters that result in the best 3-fold cross validation accuracy. 

In [34]:

# Initialize classifier object
clf = RandomForestClassifier()

# Specify hyperparameter space to optimize
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Run randomized gridsearch
n_iter_search = 10
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)
random_search.fit(dat['X_train'], dat['y_train'])

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'bootstrap': [True, False], 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11bbc9250>, 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x10a744e10>, 'criterion': ['gini', 'entropy'], 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x10a744690>, 'max_depth': [3, None]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=

## Fit Random Forest
We will now use the best hyperparameters determined above to fit our random forest model. 

In [35]:
# Initialize a new random forest classifier object that contains the optimized hyperparameters determined above.
clf = RandomForestClassifier(**random_search.best_params_)

# Fit the random forest using the LC data.
clf.fit(dat['X_train'], dat['y_train'])




RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features=4, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=4,
            min_samples_split=10, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [36]:
s3_conn.push_file_to_s3(pickle.dumps(clf), key='demos/loan-risk/models/RF_demo.p', string=True)

Sent string to S3 with key 'demos/loan-risk/models/RF_demo.p'


# Next Steps: Deploy model
Now that we have built a model to predict loan default, let's deploy model so it can be called via an API. 
You can view a previously deployed API at:
### https://demo.datascience.com/project/optimizing-your-investment-strategy/outputs/loan-risk-predictor-64730/versions/1

# Next Steps: Report insights and methodology

### <a href="https://demo.datascience.com/project/optimizing-your-investment-strategy/outputs/optimizing-loan-selection-UG9zdFR5cGU6MTU4" target="_blank">Now that our model for predicting loan default is deployed, let's make a report detailing our loan selection methodology to keep our stakeholders informed.</a>