# Vehicle Loan Prediction Machine Learning Model

# Random Forest

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, recall_score, roc_curve, auc, precision_score

In [2]:
loan_df = pd.read_csv('../data/vehicle_loans_feat.csv', index_col='UNIQUEID')

Just like we did for Logistic Regression let's convert our categorical variables to the 'category' data type

In [3]:
category_cols = ['MANUFACTURER_ID', 'STATE_ID', 'DISBURSAL_MONTH', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE_DESCRIPTION', 'EMPLOYMENT_TYPE']
loan_df[category_cols] = loan_df[category_cols].astype('category')

Bring the plot_roc_curve and eval_model functions

In [4]:
def plot_roc_curve(fpr, tpr, roc_auc):
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

In [15]:
def eval_model(model, x_test, y_test):
    preds = model.predict(x_test)
    probs = model.predict_proba(x_test)

    conf_mat = confusion_matrix(y_test, preds)
    accuracy = accuracy_score(y_test, preds)
    recall = recall_score(y_test, preds)
    precision = precision_score(y_test, preds)
    f1 = f1_score(y_test, preds)

    plot_confusion_matrix(model, x_test, y_test)
    plt.show()

    #print(conf_mat)
    print("\n")
    print("Accuracy: ", accuracy)
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F1: ", f1)

    #Show ROC Curve 
    fpr, tpr, threshold = roc_curve(y_test, probs[:,1], pos_label=1)
    roc_auc = auc(fpr, tpr)
    print("AUC: ", roc_auc)

    plot_roc_curve(fpr, tpr, roc_auc)

    results_df = pd.DataFrame()
    results_df['true_class'] = y_test
    results_df['predicted_class'] = list(preds)
    results_df['default_prob'] = probs[:, 1]

    #plot the distribution of probabilities for the estimated classes 
    sns.distplot(results_df[results_df['true_class'] == 0]['default_prob'], label="No Default", hist=False)
    sns.distplot(results_df[results_df['true_class'] == 1]['default_prob'], label="Default", hist=False)
    plt.title('Distribution of Probabilities for Estimated Classes')
    plt.legend(loc='best')
    plt.show()
    
    #see the true class versus predicted class as a percentage
    print(results_df.groupby('true_class')['predicted_class'].value_counts(normalize=True))


## Building The Forest


In [6]:
def encode_and_split(loan_df):
    loan_data_dumm = pd.get_dummies(loan_df, prefix_sep='_', drop_first=True)

    x = loan_data_dumm.drop(['LOAN_DEFAULT'], axis=1)
    y = loan_data_dumm['LOAN_DEFAULT']

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

    return x_train, x_test, y_train, y_test

Now let's test our new function and create a training and test set for RandomForest, this time using the full set of features 

In [7]:
x_train, x_test, y_train, y_test = encode_and_split(loan_df)

In [8]:
print("Training Features Shape", x_train.shape)
print("Training Label Rows", y_train.count())

Training Features Shape (186523, 92)
Training Label Rows 186523


In [9]:
print("Testing Features Shape", x_test.shape)
print("Testing Label Rows", y_test.count())

Testing Features Shape (46631, 92)
Testing Label Rows 46631


In [10]:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

0    0.783099
1    0.216901
Name: LOAN_DEFAULT, dtype: float64
0    0.782248
1    0.217752
Name: LOAN_DEFAULT, dtype: float64


In [23]:
def eval_model(model, x_test, y_test):
    # Your evaluation code here
    rfc_model = RandomForestClassifier()
    rfc_model.fit(x_train, y_train)
    eval_model(rfc_model, x_test, y_test)
    # Generate the confusion matrix
    confusion = confusion_matrix(y_test, preds)
    
    # Plot the confusion matrix
    plot_confusion_matrix(model, x_test, y_test)
    plt.show()


Let's take a minute to interpret these results 

### Accuracy 

- ~78% similar to the simple logistic regression model we built already

### Precision 

- 39% better than simple logistic Regression which had ~33% 
- More of the instances we classified as defaults actually were defaults 
- However, most of the instances we classify as defaults are actually not defaults

### Recall 

- Recall has increased dramatically, from 0.03% to 4.5%!
- Random Forest picked up a lot more of the actual positive cases
- It still missed most of them

### F1

- The F1 score has also increased dramatically from 0.0006 to ~0.08! 
- There is a better balance between Precision and Recall for Random Forest
- Although this is still generally poor

### AUC 

- The area under the roc curve has increased very slightly

### Probability Distributions 

- Plot shows bad class separation 
- Majority of cases unlikely to be classified as defaults 

Generally the random forest is better than Logistic Regression but it is still not doing a good job

## Overfitting

In [None]:
eval_model(rfc_model, x_train, y_train)

Our random forest is overfitting

## Hyperparameters 

### Number of Trees 

Let's do some manual exploration of the forest size parameter, remember the default value is 100 

In [None]:
rfc_model = RandomForestClassifier(n_estimators=1)
rfc_model.fit(x_train, y_train)

eval_model(rfc_model, x_test, y_test)

- With a forest size of 1, the random forest behaves as a standalone decision tree and is unable to distinguish between the two classes
- With AUC of 0.52 it is only marginally better than a random classifier

Let's see what happens if we increase the number of trees to 10

In [None]:
rfc_model = RandomForestClassifier(n_estimators=10)
rfc_model.fit(x_train, y_train)

eval_model(rfc_model, x_test, y_test)

- We see here that with a forest size of 10 the separation ability of the model increases with an AUC of 0.58
- Multiple peaks on the distribution chart suggest that this is not a very stable model

In [None]:
rfc_model = RandomForestClassifier(n_estimators=100)
rfc_model.fit(x_train, y_train)

eval_model(rfc_model, x_test, y_test)

- With 100 estimators the AUC improved from 0.58 to 0.62
- Class distributions appeared more defined and settled

What about if we increase to 300?

In [None]:
rfc_model = RandomForestClassifier(n_estimators=300)
rfc_model.fit(x_train, y_train)

eval_model(rfc_model, x_test, y_test)

Very similar performance to the default value of 100! 

Increasing the size of the forest helps classification performance up to a point

However, it also increases the computational cost of training the model

### Maximum Depth

In [None]:
rfc_model = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc_model.fit(x_train, y_train)

eval_model(rfc_model, x_test, y_test)

We have increased the AUC but the model is failing to identify any loan defaults

Let's take a look at how it performs on the training data

In [None]:
eval_model(rfc_model, x_train, y_train)

As with the test data, the model is not identifying any defaults.

Very similar performance between training and test data tells us we are not overfitting anymore, but the model has very little predictive power

Limiting the tree size to 5 has probably oversimplified the model and actually given us an underfit model!

Let's try again with a larger max_depth

In [None]:
rfc_model = RandomForestClassifier(n_estimators=100, max_depth=15)
rfc_model.fit(x_train, y_train)

eval_model(rfc_model, x_test, y_test)

A few things to note here! 

We have increased the AUC to ~0.65, this model has the best ability to separate classes that we have seen so far! 

It is also has a very good precision score of 67%, but we are still identifying very few loan defaults hence the poor recall

Let's have a look at the training set performance!

In [None]:
eval_model(rfc_model, x_train, y_train)

Our model does perform better on the training data so it could be a little overfitted. However, it certainly is much less dramatic than before! 

We have now limited the complexity of the trees in our forest which has reduced overfitting. 