# Capstone project: Loan Default
# Name: Matt McCarron

# Import data

We'll start by importing the Loan default data and checking for any null variables. 

In [1]:
import pandas as pd
import numpy as np

#Read in data
loan = pd.read_csv("Loan_default.csv")

#checking for any null values
pd.isnull(loan).sum()

#There are no null values

LoanID            0
Age               0
Income            0
LoanAmount        0
CreditScore       0
MonthsEmployed    0
NumCreditLines    0
InterestRate      0
LoanTerm          0
DTIRatio          0
Education         0
EmploymentType    0
MaritalStatus     0
HasMortgage       0
HasDependents     0
LoanPurpose       0
HasCoSigner       0
Default           0
dtype: int64

# Encode and Split data

Below we split the data into its categorical and non-categorical counterparts. We encode the categorical data and then recombine. Additionally, we need to remove the Loan_ID from the dataset as it will not be used in model training. Lastly, we split data into feature and response data and create training and testing datasets. 

In [2]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

#non-categorical variables
non_cat = ['LoanID', 'Age', 'Income', 'LoanAmount', 'CreditScore', 'MonthsEmployed', 
           'NumCreditLines', 'InterestRate', 'LoanTerm', 'DTIRatio', 'Default']

#categorical variables
cat = ['Education', 'EmploymentType', 'MaritalStatus', 'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner']

#split between categorical data and numeric
non_cat_dat = loan[non_cat]
cat_dat = loan[cat]


#One hot encode data
encoder = OneHotEncoder()
encoder.fit(cat_dat)
cat_dat_encoded = encoder.transform(cat_dat)
cat_dat_encoded = pd.DataFrame(cat_dat_encoded.toarray(), columns=encoder.get_feature_names_out())

#recombine
loan_OHE = non_cat_dat.merge(cat_dat_encoded, left_index=True, right_index=True)

#drop loan_id
loan_OHE = loan_OHE.drop('LoanID', axis = 'columns')

#split between feature and response data
x_loan_OHE = loan_OHE.drop('Default', axis = 'columns')
y_loan_OHE = loan_OHE['Default']

#split data between a train and test set
X_train, X_test, y_train, y_test = train_test_split(x_loan_OHE, y_loan_OHE, test_size=0.2, random_state=42)

# Random Forest Classifier Grid Search

First, we are going to train a random forest classifier. We'll start performing grid searches for parameter selection and evaluate the model through cross validation.

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, f1_score

# Use F1 score as the scoring metric
scorer = make_scorer(f1_score)

# -----
# Coarse-Grained RandomForestRegressor GridSearch
# -----

param_grid = {"max_depth":[4,8,16,32], 
              "n_estimators":[5,10,20,50], 
              "min_samples_split":[2,8,14,20]
}

gs_model = RandomForestClassifier()

grid_search = GridSearchCV(estimator=gs_model, param_grid=param_grid, scoring=scorer, cv=3)
grid_search.fit(X_train, y_train)

print("The best parameters are: ", grid_search.best_params_)

The best parameters are:  {'max_depth': 32, 'min_samples_split': 2, 'n_estimators': 5}


In [12]:
# -----
# Refined RandomForestRegressor GridSearch
# -----

param_grid = {"max_depth":[16,24,32], 
              "n_estimators":[5,6,7,8,9,10,11,12,13,14], 
              "min_samples_split":[2,3,4]
}

gs_model = RandomForestClassifier()

grid_search = GridSearchCV(estimator=gs_model, param_grid=param_grid,scoring=scorer, cv=3)
grid_search.fit(X_train, y_train)

print("The best parameters are: ", grid_search.best_params_)

The best parameters are:  {'max_depth': 24, 'min_samples_split': 2, 'n_estimators': 5}


In [13]:
# -----
# Final RandomForestRegressor GridSearch
# -----

param_grid = {"max_depth":[20,24,28], 
              "n_estimators":[1,2,3,4,5,6,7,8,9], 
              "min_samples_split":[2,3,4]
}

gs_model = RandomForestClassifier()

grid_search = GridSearchCV(estimator=gs_model, param_grid=param_grid,scoring=scorer, cv=3)
grid_search.fit(X_train, y_train)

print("The best parameters are: ", grid_search.best_params_)

The best parameters are:  {'max_depth': 28, 'min_samples_split': 2, 'n_estimators': 1}


The optimal model parameters for the `RandomForestClassifier` class are:

- `max_depth = 28`
- `n_estimators = 1`
- `min_samples_split = 2`

# Gradient Boosting Classifier GridSearch

Next, we are going to train a gradient boosting classifier. We'll start performing grid searches for parameter selection and evaluate the model through cross validation.

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

# -----
# Coarse-Grained GradientBoostingRegressor GridSearch
# -----

param_grid = {"max_depth":[4,8,16,32], 
              "n_estimators": [10,20,40,80], 
              "learning_rate":[.25, .5, .75]}

gs_model = GradientBoostingClassifier()

grid_search = GridSearchCV(estimator=gs_model, param_grid=param_grid, scoring=scorer, cv=3)
grid_search.fit(X_train, y_train)

print("The best parameters are: ", grid_search.best_params_)

The best parameters are:  {'learning_rate': 0.75, 'max_depth': 32, 'n_estimators': 20}


In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, f1_score

# Use F1 score as the scoring metric
scorer = make_scorer(f1_score)

# -----
# Refined GradientBoostingRegressor GridSearch
# -----

param_grid = {"max_depth":[24,28,32], 
              "n_estimators":[15,20,25,30], 
              "learning_rate":[.6,.75, .9]}

gs_model = GradientBoostingClassifier()

grid_search = GridSearchCV(estimator=gs_model, param_grid=param_grid,scoring=scorer, cv=3)
grid_search.fit(X_train, y_train)

print("The best parameters are: ", grid_search.best_params_)

The best parameters are:  {'learning_rate': 0.75, 'max_depth': 28, 'n_estimators': 20}


In [19]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import make_scorer, f1_score

# Use F1 score as the scoring metric
scorer = make_scorer(f1_score)

# -----
# Final GradientBoostingRegressor GridSearch
# -----

param_grid = {"max_depth":[26,28,30], 
              "n_estimators":[17,20,23], 
              "learning_rate":[.7,.75, .8]}

gs_model = GradientBoostingClassifier()

grid_search = GridSearchCV(estimator=gs_model, param_grid=param_grid,scoring=scorer, cv=3)
grid_search.fit(X_train, y_train)

print("The best parameters are: ", grid_search.best_params_)

The best parameters are:  {'learning_rate': 0.75, 'max_depth': 30, 'n_estimators': 20}


The optimal model parameters for the `GradientBoostingClassifier` class are:

- `max_depth = 30`
- `n_estimators = 20`
- `learning_rate = .75`

# Random Forest Classifier Model

In [4]:
from sklearn.ensemble import RandomForestClassifier

#Train random forest classifier based on parameters above
rnd_clf = RandomForestClassifier(min_samples_split = 2, max_depth = 28, n_estimators = 1, random_state = 42)
rnd_clf.fit(X_train, y_train)

# Gradient Boosting Classifier Model

In [3]:
from sklearn.ensemble import GradientBoostingClassifier

#Train gradient boosting classifier based on parameters above
gbrt = GradientBoostingClassifier(learning_rate = .75, max_depth = 30, n_estimators = 20, random_state = 42)
gbrt.fit(X_train, y_train)

# Figure Out Optimal Model

In [6]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import log_loss

#Use testing data
y_pred_rnd = rnd_clf.predict(X_test)
y_pred_grd = gbrt.predict(X_test)

#Measurement scores for Random Forest
acc_score_rnd = accuracy_score(y_test, y_pred_rnd)
prec_score_rnd = precision_score(y_test, y_pred_rnd)
recall_score_rnd = recall_score(y_test, y_pred_rnd)
cm_rnd = confusion_matrix(y_test, y_pred_rnd)
f1_rnd = f1_score(y_test, y_pred_rnd)
# Calculate ROC curve
fpr_rnd, tpr_rnd, thresholds_rnd = roc_curve(y_test, y_pred_rnd)
# Calculate AUC
roc_auc_rnd = auc(fpr_rnd, tpr_rnd)
probabilities_rnd = rnd_clf.predict_proba(X_test)
logloss_rnd = log_loss(y_test, probabilities_rnd)

#Measurement scores for Gradient Booster
acc_score_grd = accuracy_score(y_test, y_pred_grd)
prec_score_grd = precision_score(y_test, y_pred_grd)
recall_score_grd = recall_score(y_test, y_pred_grd)
cm_grd = confusion_matrix(y_test, y_pred_grd)
f1_grd = f1_score(y_test, y_pred_grd)
# Calculate ROC curve
fpr_grd, tpr_grd, thresholds_grd = roc_curve(y_test, y_pred_grd)
# Calculate AUC
roc_auc_grd = auc(fpr_grd, tpr_grd)
probabilities_grd = gbrt.predict_proba(X_test)
logloss_grd = log_loss(y_test, probabilities_grd)

#Exhibit to compare both models
exhibit = [['Random Forest', 'Accuracy', acc_score_rnd],
                ['', 'Precision', prec_score_rnd],
                ['', 'Recall', recall_score_rnd],
                 ['', 'F1 score', f1_rnd],
                 ['', 'AUC curve', roc_auc_rnd],
                 ['', 'Log Loss', logloss_rnd],
                ['Gradient Boosting', 'Accuracy', acc_score_grd],
                ['', 'Precision', prec_score_grd],
                ['', 'Recall', recall_score_grd],
                ['', 'F1 score', f1_grd],
                 ['', 'AUC curve', roc_auc_grd],
                ['', 'Log Loss', logloss_grd]]

exhibit = pd.DataFrame(exhibit, columns = ['Models', 'Item', 'Score'])

exhibit.style.hide(axis='index')

Models,Item,Score
Random Forest,Accuracy,0.807167
,Precision,0.196494
,Recall,0.21661
,F1 score,0.206063
,AUC curve,0.550457
,Log Loss,6.94786
Gradient Boosting,Accuracy,0.819013
,Precision,0.221091
,Recall,0.224576
,F1 score,0.22282


In the above exhibit, we compare our Random Forest and Gradient Boosting model across a number of different metrics. While we optimized our model based on F1 score, it's important to compare among a number of different metrics in order to get a more comprehensive look of how our models are working.

We chose F1 score as our scoring metric because it takes into account both precision and recall. Ultimately, default data is imbalanced where around 88% don't default vs. 12% defaulting. Because of this, accuracy alone is not a reliable metric. Additionally, loan defaulting does not have the need to favor precision or recall over the other. Therefore, we scored our grid searches on F1, but it will be important to review all of the above metrics. 

Based on the metrics above, we will move forward with the gradient boosting model. The models are roughly similar, but the gradient boosting model edges out the random forest model in all metrics, so we'll move forward with that model.

In [6]:
#Confusion matrix
print("RF confusion matrix")
print(cm_rnd)
print("GB confusion matrix")
print(cm_grd)

RF confusion matrix
[[39944  5226]
 [ 4622  1278]]
GB confusion matrix
[[40502  4668]
 [ 4575  1325]]


# Save model for deployment 

In [7]:
import pickle

#Utilize pickle to save model down to use in flask
with open("GradBoost_model.pkl", "wb") as f:
    pickle.dump(gbrt, f)

# Breakeven Financial Model

I will utilize the gradient boosting model to extract the probability of not defaulting. I will then utilize the below formula to calculate the expected profit from the loan. The below model calculates the expected profit by calculating the profit from interest if they do not default multiplied by the probability of not defaulting minus the expected loss experienced if they do default multiplied by that likelihood (1-probability).

Profit = Probability x LoanAmount x LoanTerm / 12 x InterestRate / 100 - (1-Probability) x LoanAmount

The function that calculates this and is utilized in the model is written below.

In [8]:
def loan(probability, LoanAmount, LoanTerm, InterestRate):
    profit = probability*LoanAmount*LoanTerm/12*InterestRate/100 - (1-probability)*LoanAmount
    if profit > 0:
        status = 'Approved'
    else:
        status = 'Denied'
    return status