### Decision Tree & Random Forest Modeling
- ### Background:
The purpose of these modeling tasks is to help in figuring out (i.e. predicting) customers with the profile of having a high probability of paying back their loans. We are given the loan/lending data of a bank, between the years 2007 and 2010 for a classification task of predicting whether or not the borrower paid back their loan in full.

- ### Here are what the columns represent:

- credit.policy: 1 if a customer qualifies for a loan; and 0 otherwise.
- purpose: loan purpose.
- int.rate: the interest rate of the loan.
- installment: monthly installment amount to be paid by the borrower.
- log.annual.inc: the natural log of the reported annual income on file for the borrower.
- dti: debt-to-income ratio of the borrower (borrower's total debt divided by annual income).
- fico: the FICO credit score of the borrower at the time of loan application.
- days.with.cr.line: the number of days the borrower has had a credit line.
- revol.bal: borrower's revolving balance
- revol.util: borrower's revolving line utilization rate (total credit line used relative to total credit available).
- inq.last.6mths: number of inquiries by creditors in the last 6 months.
- delinq.2yrs: number of times the borrower had failed to make monthly payments by the due dates in the past 2 years.
- pub.rec: borrower's number of derogatory public records such as bankruptcy filings, etc.

#### Importing the appropriate libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# read in the loans data file, and call it "loans"
loans = pd.read_csv('loan_data.csv')

#### Before we do any EDAs, lets check out the info(), head(), and describe() methods on loans

In [None]:
loans.info()

In [None]:
loans.describe().transpose()

In [None]:
# view some records
loans.head(20)

#### Exploratory Data Analysis
- Let's do some data visualization, using seaborn and some built-in pandas plotting capabilities (You can also try out any other libraries you want) i.e. what really matter is the main idea behind each plot.

##### Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome:

In [None]:
plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(alpha=0.5,color='red',
                                              bins=30,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(alpha=0.5,color='green',
                                              bins=30,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')

#### Repeat the previous figure, based on the "not.fully.paid" column instead of "credit.policy"

In [None]:
plt.figure(figsize=(10,6))
loans[loans['not.fully.paid']==1]['fico'].hist(alpha=0.5,color='red',
                                              bins=30,label='not.fully.paid=1')
loans[loans['not.fully.paid']==0]['fico'].hist(alpha=0.5,color='green',
                                              bins=30,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')

####  Let's see a countplot showing loans by purpose, with the color hue defined by "not.fully.paid"

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(x='purpose', hue='not.fully.paid', data=loans,palette='Set1')
plt.show()

#### Let's visualize the trend between FICO score and interest rate, using jointplot from seaborn

In [None]:
sns.jointplot(x='fico',y='int.rate',data=loans,color='purple')
plt.show()

#### We will use seaborn's lmplot to see if the trend differed between "not.fully.paid" and "credit.policy"

In [None]:
plt.figure(figsize=(11,7))
sns.lmplot(y='int.rate', x='fico', data=loans, hue='credit.policy', col='not.fully.paid', palette='Set1')

#### Setting up the data for modeling :
- Let's prepare our dataset for both Decision Tree and Random Forest Classification Models!

In [None]:
# A quick check on our data:
loans.info()

#### Notice that the "purpose" column is categorical - this means we have to transform them using dummy variables for use in sklearn. We will accomplish this step using panda's get_dummies( ) method, in such a way to take care of all other categorical columns or features on our loans dataset.

In [None]:
# First, we create a list of 1 element containing the string 'purpose', and name it cat_feats:
cat_feats = ['purpose']

In [None]:
# Next, we use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a larger dataframe 
# that has new feature columns with all the dummy variables we need. We will call this dataframe as final_data:
final_data = pd.get_dummies(loans, columns=cat_feats,drop_first=True)

In [None]:
# let's see what fianl_data looks like:
final_data.info()

#### Train Test Split
- Now we go ahead and split our data into training and testing datasets!

In [None]:
# As before, we use sklearn to split our data into training and testing datasets:
from sklearn.model_selection import train_test_split

In [None]:
X = final_data.drop('not.fully.paid', axis=1)
y = final_data['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

#### Training a Decision Tree Model:
- We start by training a single decision tree.

In [None]:
# Import DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Next, we create an instance of DecisionTreeClassifier() called dtree and then train it!
dtree = DecisionTreeClassifier()

In [None]:
dtree.fit(X_train,y_train)

#### Predictions and Evaluation of our Decision Tree Model
- Create predictions from the test set and create a classification report and a confusion matrix.

In [None]:
predictions = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))

#### Next, We Train Our Random Forest Model
- We will create an instance of the RandomForestClassifier class and fit it to our training dataset.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 600)
rfc.fit(X_train, y_train)

In [None]:
# let do some prediction with our random forest!
rfc_pred = rfc.predict(X_test)

In [None]:
print(confusion_matrix(y_test,rfc_pred))

In [None]:
print(classification_report(y_test,rfc_pred))

#### Which model performed better? The Random Forest or The Decision Tree?

In [None]:
# It depends on what metric you are trying to optimize.
# Notice the recall for each class for the models: Neither did very well; More feature engineering might be needed.