# Random Forest 

For this notebook we will be exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.


We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. Use the load_data.csv already provided.

Here are what the columns represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

## Import Libraries

**Import the usual libraries.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Get the Data

** Use pandas to read loan_data.csv as a dataframe called loans.**

In [None]:
loans = pd.read_csv('loan_data.csv')

** Check out the info(), head(), and describe() methods on loans.**

In [None]:
loans.info()

In [None]:
loans.describe()

In [None]:
loans.head()

## Exploratory Data Analysis

** Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.**

In [None]:
plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='blue',
                                              bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='red',
                                              bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')

plt.figure(figsize=(10,6))

** Create a similar figure, except this time select by the not.fully.paid column.**

In [None]:
plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='blue',
                                              bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='red',
                                              bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')

** Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid. **

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(x='purpose',hue='not.fully.paid',data=loans,palette='Set1')

** Let's see the trend between FICO score and INTEREST RATE with a jointplot.**

In [None]:
sns.jointplot(x='fico',y='int.rate',data=loans,color='purple')

** Create lmplots to see the trend differed between not.fully.paid and credit.policy. Check the documentation for `lmplot()`. **

In [None]:
plt.figure(figsize=(11,7))
sns.lmplot(y='int.rate',x='fico',data=loans,hue='credit.policy',
           col='not.fully.paid',palette='Set1')

## Setting up the Data
**Check loans.info() again.**

In [None]:
loans.info()

## Categorical Features

Notice that the **purpose** column as categorical

We need to trasform the **purpose** column using dummy variables (by using `pd.get_dummies`) so as sklearn will be able to understand them


**Create a list of 1 element containing the string 'purpose'. Call this list cat_feats.**

In [None]:
cat_feats = ['purpose']

**Now use `pd.get_dummies()` to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data. Check the documentation to se how `pd.get_dummies()` works.**

In [None]:
final_data = pd.get_dummies(loans,columns=cat_feats,drop_first=True)

In [None]:
final_data.info()

## Train Test Split

Split the data into a training set and a testing set.

** Use sklearn to split your data into a training set and a testing set.**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = final_data.drop('not.fully.paid',axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

## Training a Decision Tree Model

Let's start by training a single decision tree first.

** Import DecisionTreeClassifier**

In [None]:
from sklearn.tree import DecisionTreeClassifier

**Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.**

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
dtree.fit(X_train,y_train)

## Predictions and Evaluation of Decision Tree
**Create predictions from the test set and create a classification report, accuracy score a confusion matrix.**

In [None]:
predictions = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
# Confusion matrix
print(confusion_matrix(y_test,predictions))

In [None]:
# Test accuracy score
print('Test accuracy score: '+ str(accuracy_score(y_test,predictions)))

In [None]:
# Classification report
print(classification_report(y_test,predictions))

## Training the Random Forest model

**Create an instance of the RandomForestClassifier class and fit it to our training data.**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=600)

In [None]:
rfc.fit(X_train,y_train)

## Predictions and Evaluation

Let's predict the y_test values and evaluate our model.

** Predict the class of not.fully.paid for the X_test data.**

In [None]:
pred = rfc.predict(X_test)

**Compute confusion matrix, accuracy score and classification report.**

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
# Confusion matrix
print(confusion_matrix(y_test,pred))

In [None]:
# Test accuracy score
print('Test accuracy score: '+ str(accuracy_score(y_test,pred)))

In [None]:
# Classification report
print(classification_report(y_test,pred))

## Grid Search

#### Create a parameter grid. Focus especially on these hyperparameters: number of estimators, maximum depth, maximum features. You can also investigate minimum samples leaf, minimum sample split, bootstrap. Investigate a little bit on which values/values range to use for each parameter.

In [None]:
# Create the parameter grid
param_grid_rfc = {
    'bootstrap': [True],
    'oob_score': [True],
    'max_depth': [100,110,120, 130],
    'max_features': [0.33],
    'min_samples_leaf': [1,2],
    'min_samples_split': [2,4],
    'n_estimators': [400, 600, 1000]
}

#### Create a base classifier.

In [None]:
rfc_base = RandomForestClassifier()

#### Perform a grid search.

In [None]:
#from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV
rfc_grid = GridSearchCV(estimator = rfc_base, param_grid = param_grid_rfc, cv = None)

#### Fit the estimator to the training data (this will take a while).

In [None]:
rfc_grid.fit(X_train, y_train)

#### Check the best estimator and score.

In [None]:
rfc_best_estimator = rfc_grid.best_estimator_
rfc_best_score = rfc_grid.best_score_

print(rfc_best_estimator)
print(rfc_best_score)

#### Predict on the test data.

In [None]:
pred_grid = rfc_grid.predict(X_test)

**Compute confusion matrix, accuracy score and classification report.**

In [None]:
# Confusion matrix
print(confusion_matrix(y_test,pred_grid))

In [None]:
# Test accuracy score
print('Test accuracy score: '+ str(accuracy_score(y_test,pred_grid)))

In [None]:
# Classification report
print(classification_report(y_test,pred_grid))