# Name(s) Group:
# Class: Knowledge Discovery and Data Mining
# Date: Summer 2020

# Decision Trees and Random Forest Models applied to LendingClub data set 

## Knowledge Discovery in Databases - Data Mining
## DIRECTIONS
### Review all code and markdown.  Provide responses to questions at the end of the notebook (markdown, code)
### Turn in this notebook via Canvas in the assignment area

For this exercise, we will be exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects people who need money (borrowers) with people who have money (investors). We try to create a model to predict the risk of lending money to someone given a wide range of credit related data. We will use lending data from 2007-2010 and be trying to classify and predict **whether or not the borrower paid back their loan in full.**

Here are what the columns in the data set represent:

* **credit.policy**: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* **purpose**: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* **int.rate**: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* **installment**: The monthly installments owed by the borrower if the loan is funded.
* **log.annual.inc**: The natural log of the self-reported annual income of the borrower.
* **dti**: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* **fico**: The FICO credit score of the borrower.
* **days.with.cr.line**: The number of days the borrower has had a credit line.
* **revol.bal**: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* **revol.util**: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* **inq.last.6mths**: The borrower's number of inquiries by creditors in the last 6 months.
* **delinq.2yrs**: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* **pub.rec**: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
* **not.fully.paid**: The quantity of interest for classification - whether the borrower paid back the money in full or not

# Import Libraries and data set

**Import the usual libraries for pandas and plotting**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Get the Data

** Use pandas to read loan_data.csv**

In [None]:
df = pd.read_csv('loan_data.csv')

### Check out the info(), head(), and describe() methods on loans

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
print("Following is a analysis of credit approval status. 1 means approved credit, 0 means not approved.")
print(df['credit.policy'].value_counts())

# Exploratory Data Analysis

### Histogram of FICO scores by credit approval status

In [None]:
df[df['credit.policy']==1]['fico'].plot.hist(bins=30,alpha=0.5,color='blue', label='Credit.Policy=1')
df[df['credit.policy']==0]['fico'].plot.hist(bins=30,alpha=0.5, color='red', label='Credit.Policy=0')
plt.legend(fontsize=15)
plt.title ("Histogram of FICO score by approved or disapproved credit policies", fontsize=16)
plt.xlabel("FICO score", fontsize=14)

### Presence or absence of statistical difference of various factors between credit approval status

In [None]:
sns.boxplot(x=df['credit.policy'],y=df['int.rate'])
plt.title("Interest rate varies between risky and non-risky borrowers", fontsize=15)
plt.xlabel("Credit policy",fontsize=15)
plt.ylabel("Interest rate",fontsize=15)

In [None]:
sns.boxplot(x=df['credit.policy'],y=df['log.annual.inc'])
plt.title("Income level does not make a big difference in credit approval odds", fontsize=15)
plt.xlabel("Credit policy",fontsize=15)
plt.ylabel("Log. annual income",fontsize=15)

In [None]:
sns.boxplot(x=df['credit.policy'],y=df['days.with.cr.line'])
plt.title("Credit-approved users have a slightly higher days with credit line", fontsize=15)
plt.xlabel("Credit policy",fontsize=15)
plt.ylabel("Days with credit line",fontsize=15)

In [None]:
sns.boxplot(x=df['credit.policy'],y=df['dti'])
plt.title("Debt-to-income level does not make a big difference in credit approval odds", fontsize=15)
plt.xlabel("Credit policy",fontsize=15)
plt.ylabel("Debt-to-income ratio",fontsize=15)

### Countplot bar chart of loans by purpose, with the color hue defined by not.fully.paid

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x='purpose',hue='not.fully.paid',data=df, palette='Set1')
plt.title("Bar chart of loan purpose colored by not fully paid status", fontsize=17)
plt.xlabel("Purpose", fontsize=15)

### Trend between FICO score and interest rate

In [None]:
sns.jointplot(x='fico',y='int.rate',data=df, color='purple', size=12)

### LM plot to see if the trend differed between not.fully.paid and credit.policy

In [None]:
plt.figure(figsize=(14,7))
sns.lmplot(y='int.rate',x='fico',data=df,hue='credit.policy',
           col='not.fully.paid',palette='Set1',size=6)

# Setting up the Data
## Categorical Features

The **purpose** column as categorical. We transform them using dummy variables so sklearn will be able to understand them.

In [None]:
df_final = pd.get_dummies(df,['purpose'],drop_first=True)

In [None]:
df_final.head()

## Train Test Split
For supervised learning we are predicting whether or not the borrower will or has paid their load in full based on not.fully.paid.  The train - test split is 70% 30%.  We will train on 70% of the data which includes the target variable.  We will then use the model to predict the test set.

While we did not deal with missing values, remember to adjust missing values on the train set only - using information from the test set will cause overfitting (or may cause overfitting). This means the prediction is very accurate (of course, you are using all of the data which represents the future data) but then when you try to apply your model to actual new data the accuracy can be poor.

In [None]:
from sklearn.model_selection import train_test_split
X = df_final.drop('not.fully.paid',axis=1)
y = df_final['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [None]:
X.head()

## Training a Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

**Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.**

In [None]:
dtree = DecisionTreeClassifier(criterion='gini',max_depth=None)

In [None]:
dtree.fit(X_train,y_train)

## Predictions and Evaluation of Decision Tree
**Create predictions from the test set and create a classification report and a confusion matrix.**

In [None]:
predictions = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,predictions))

In [None]:
cm=confusion_matrix(y_test,predictions)
print(cm)
print ("Accuracy of prediction:",round((cm[0,0]+cm[1,1])/cm.sum(),3))

## Training the Random Forest model

Now its time to train our model random forest.
Why is this better?  Random Forest represents an ensemble method of learning - this helps to avoid overfitting which was described earlier.  Ensemble methods create not one but many models (many decision trees) and then take generally the mode of the results in order to improve the model accuracy for new data and avoid overfitting.  We like to say with random forest "Nobody knows everything, but everybody knows something" - we can use the benefit of multiple models to learn even more and improve our predictive capabilities on new data.  

Many like to say that Random Forest is what you try first for supervised learning.  It is almost always a good choice. WHY?
1)  it can solve classification and regression problems
2)  It can handle all types of input without need for scaling - binary, categorical, numerical, etc.
3)  It does not require alot of tweaking of parameters to get the best results
4)  it handles missing data well

So . . . is it a SILVER BULLET?  no, in fact it can seem like a black box for those that like to dig deep into tyhe statistics. 

**Create an instance of the RandomForestClassifier class and fit it to our training data from the previous step.**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=600)

In [None]:
rfc.fit(X_train, y_train)

## Predictions and Evaluation

Let's predict the y_test values and evaluate our model.

** Predict the class of not.fully.paid for the X_test data.**

In [None]:
rfc_pred = rfc.predict(X_test)

**Now create a classification report from the results.

In [None]:
cr = classification_report(y_test,predictions)

In [None]:
print(cr)

**Show the Confusion Matrix for the predictions.**

A confusion  matrix is a way to evaluate your results along with the classification report.  Here is some more info:  <a href="https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/"> Data School Confusion Matrix Explanation</a>

In [None]:
cm = confusion_matrix(y_test,rfc_pred)
print(cm)

### Running a loop with increasing number of trees in the random forest and checking accuracy of confusion matrix

**Criterion 'gini' or 'entropy'**

Your text describes how decision trees are constructed using entropy and information gain.  Entropy is a measure of randomness - we pick decision tree splits where the randomness is less (we are more certain of the outcome).  If we are completely certain of an outcome entropy is 0, if we are 50-50 split on the outcome entropy is 1.

As we partition the decision tree we try to reduce the disorder or entropy and the result is the information gain of understanding the tree.  We chose the splits where the information gain is the largest.  Here is a short explanation:  <a href="https://www.youtube.com/watch?v=7VeUPuFGJHk">Decision Trees, entropy and info gain</a>

In [None]:
# using gini index first
nsimu = 21
accuracy=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*5,min_samples_split=10,max_depth=None,criterion='gini')
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    cm = confusion_matrix(y_test,rfc_pred)
    accuracy[i] = (cm[0,0]+cm[1,1])/cm.sum()
    ntree[i]=i*5

In [None]:
# plotting the results
plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu],y=accuracy[1:nsimu],s=60,c='red')
plt.title("Number of trees in the Random Forest vs. prediction accuracy (criterion: 'gini')", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("Prediction accuracy from confusion matrix", fontsize=15)

In [None]:
# using entropy
nsimu = 21
accuracy=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*5,min_samples_split=10,max_depth=None,criterion='entropy')
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    cm = confusion_matrix(y_test,rfc_pred)
    accuracy[i] = (cm[0,0]+cm[1,1])/cm.sum()
    ntree[i]=i*5

In [None]:
# plotting the results
plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu],y=accuracy[1:nsimu],s=60,c='red')
plt.title("Number of trees in the Random Forest vs. prediction accuracy (criterion: 'entropy')", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("Prediction accuracy from confusion matrix", fontsize=15)

**Fixing max tree depth**

In [None]:
nsimu = 21
accuracy=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*5,min_samples_split=10,max_depth=None,criterion='gini')
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    cm = confusion_matrix(y_test,rfc_pred)
    accuracy[i] = (cm[0,0]+cm[1,1])/cm.sum()
    ntree[i]=i*5

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu],y=accuracy[1:nsimu],s=60,c='red')
plt.title("Number of trees in the Random Forest vs. prediction accuracy (max depth: None)", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("Prediction accuracy from confusion matrix", fontsize=15)

In [None]:
nsimu = 21
accuracy=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*5,min_samples_split=10,max_depth=5,criterion='gini')
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    cm = confusion_matrix(y_test,rfc_pred)
    accuracy[i] = (cm[0,0]+cm[1,1])/cm.sum()
    ntree[i]=i*5

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu],y=accuracy[1:nsimu],s=60,c='red')
plt.title("Number of trees in the Random Forest vs. prediction accuracy (max depth: 5)", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("Prediction accuracy from confusion matrix", fontsize=15)

**Minimum sample split criteria**

In [None]:
nsimu = 21
accuracy=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*5,min_samples_split=2,max_depth=None,criterion='gini')
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    cm = confusion_matrix(y_test,rfc_pred)
    accuracy[i] = (cm[0,0]+cm[1,1])/cm.sum()
    ntree[i]=i*5

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu],y=accuracy[1:nsimu],s=60,c='red')
plt.title("Number of trees in the Random Forest vs. prediction accuracy (minimum sample split: 2)", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("Prediction accuracy from confusion matrix", fontsize=15)

In [None]:
nsimu = 21
accuracy=[0]*nsimu
ntree = [0]*nsimu
for i in range(1,nsimu):
    rfc = RandomForestClassifier(n_estimators=i*5,min_samples_split=20,max_depth=None,criterion='gini')
    rfc.fit(X_train, y_train)
    rfc_pred = rfc.predict(X_test)
    cm = confusion_matrix(y_test,rfc_pred)
    accuracy[i] = (cm[0,0]+cm[1,1])/cm.sum()
    ntree[i]=i*5

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(x=ntree[1:nsimu],y=accuracy[1:nsimu],s=60,c='red')
plt.title("Number of trees in the Random Forest vs. prediction accuracy (minimum sample split: 20)", fontsize=18)
plt.xlabel("Number of trees", fontsize=15)
plt.ylabel("Prediction accuracy from confusion matrix", fontsize=15)

QUESTIONS:
1.  Evaluate the decision tree classification report and confusion matrix in your own words.
2.  Did the Random Forest Model show improvement in the prediction accuracy based on the final chart?  
3.  Why is Random Forest better for this predictive model than the basic Decision Tree?

1.

2.

3.