<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-packages-and-data" data-toc-modified-id="Loading-packages-and-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading packages and data</a></span></li><li><span><a href="#Preparing-data-for-modelling" data-toc-modified-id="Preparing-data-for-modelling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparing data for modelling</a></span><ul class="toc-item"><li><span><a href="#Handle-null-values" data-toc-modified-id="Handle-null-values-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Handle null values</a></span></li><li><span><a href="#One-hot-encode-categorical-variables" data-toc-modified-id="One-hot-encode-categorical-variables-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>One-hot encode categorical variables</a></span></li></ul></li><li><span><a href="#Train-Random-Forest-model,-with-timing-for-comparision-with-R" data-toc-modified-id="Train-Random-Forest-model,-with-timing-for-comparision-with-R-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train Random Forest model, with timing for comparision with R</a></span><ul class="toc-item"><li><span><a href="#Split-the-dataset" data-toc-modified-id="Split-the-dataset-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Split the dataset</a></span></li><li><span><a href="#Fit-the-model" data-toc-modified-id="Fit-the-model-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Fit the model</a></span></li></ul></li><li><span><a href="#Get-feature-importance-from-model" data-toc-modified-id="Get-feature-importance-from-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Get feature importance from model</a></span></li></ul></div>

# Loading packages and data

In [126]:
import pandas as pd
import time
from sklearn.ensemble import RandomForestClassifier

In [127]:
# Reading data from csv file
data = pd.read_csv("credit.csv")

# Preparing data for modelling

Use `info()` to take a glimpse at the variables and their data types.

In [128]:
data.info()
#Python will classify non-numeric columns as objects, limiting

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100514 entries, 0 to 100513
Data columns (total 19 columns):
Loan ID                         100000 non-null object
Customer ID                     100000 non-null object
Loan Status                     100000 non-null object
Current Loan Amount             100000 non-null float64
Term                            100000 non-null object
Credit Score                    80846 non-null float64
Annual Income                   80846 non-null float64
Years in current job            95778 non-null object
Home Ownership                  100000 non-null object
Purpose                         100000 non-null object
Monthly Debt                    100000 non-null float64
Years of Credit History         100000 non-null float64
Months since last delinquent    46859 non-null float64
Number of Open Accounts         100000 non-null float64
Number of Credit Problems       100000 non-null float64
Current Credit Balance          100000 non-null float64
Maxi

## Handle null values

We find that many columns have fewer non-null entries than others, meaning that there are a number of null values to worry about. For the purpose of this demonstration, NaNs for months since delinquents, credit score, and income will be counted as zeros. The rest of NaNs will be dropped.

In [129]:
# Remove NaN values
data['Credit Score'].fillna(0, inplace=True)
data['Annual Income'].fillna(0, inplace=True)
data['Months since last delinquent'].fillna(0, inplace=True)
data['Years in current job'].fillna(0, inplace=True)

In [130]:
data.dropna(how='any',inplace=True)

Check how many rows are left. Seems good to go.

In [131]:
data.shape

(99794, 19)

## One-hot encode categorical variables

For the Purpose of running Random Forest, we need to handle unneeded and categorical fields. 

In [132]:
# Exclude IDs, for they are not important to the model
data_new = data.drop(['Customer ID','Loan ID'], axis=1)

In [133]:
# Use get_dummies to code categorical variables into 0s and 1s.
data_new = pd.get_dummies(data_new)

In [134]:
data_new.head(1)

Unnamed: 0,Current Loan Amount,Credit Score,Annual Income,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,...,Purpose_Medical Bills,Purpose_Other,Purpose_Take a Trip,Purpose_major_purchase,Purpose_moving,Purpose_other,Purpose_renewable_energy,Purpose_small_business,Purpose_vacation,Purpose_wedding
0,445412.0,709.0,1167493.0,5214.74,17.2,0.0,6.0,1.0,228190.0,416746.0,...,0,0,0,0,0,0,0,0,0,0


In [135]:
# Delete "Loan Status Charged Off" here to prevent it from skewing the model
data_new = data_new.drop('Loan Status_Charged Off', axis = 1)

# Train Random Forest model, with timing for comparision with R

In [136]:
# Use numpy to convert data to arrays -> This speeds thing up
import numpy as np

## Split the dataset
We choose Loan Status as our target (labels). The model will predict whether the loan is fully paid or not.

In [137]:
# Labels are the values we want to predict (Labels are the dependent var)
# We will predict Loan Status with this model
labels = np.array(data_new['Loan Status_Fully Paid'])

# Remove the labels from the set to create features
features = data_new.drop('Loan Status_Fully Paid', axis = 1)

# Saving feature names for later use (for visualization perhaps)
feature_list = list(features.columns)

# Convert to numpy array
features = np.array(features)

In [138]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# The split function produces 4 arrays, in the below order. -> Create 4 variables to store them.
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25, random_state = 42)

In [139]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (74845, 46)
Training Labels Shape: (74845,)
Testing Features Shape: (24949, 46)
Testing Labels Shape: (24949,)


## Fit the model

In [140]:
# Instantiate model with 200 decision trees
rf = RandomForestClassifier(n_estimators = 200, random_state = 42)

In [141]:
t = time.time()

# Train the model on training data
rf.fit(train_features, train_labels)

t_now = time.time()

In [142]:
elapsed = t_now - t
print("It took:")
print(round(elapsed,5),"seconds to train the Random Forest")

It took:
85.12242 seconds to train the Random Forest


In [143]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

The mean absolute error (mae) is:

In [144]:
print('Mean Absolute Error:', round(np.mean(errors), 5), 'degrees.')

Mean Absolute Error: 1.56668 degrees.


In [145]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / sum(test_labels))

# Calculate and display accuracy. This is faster than .score(), due to the time it took to train the RF
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 5), '%.')

Accuracy: 99.99186 %.


This is too good to be true. It is worth checkng my small sample size has led to overfitting.

# Get feature importance from model
This is to see which variable is most important to producing predictions. As our team is comparing Python and R while assuming a beginner's perspective, this step requires more work than the equivalent model in R. 

In [146]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Turn this into a Df and sort by importance - for presentation purpose
feature_importances = pd.DataFrame(feature_importances, columns=['features','importance']).sort_values(by="importance",ascending=False)

In [147]:
feature_importances

Unnamed: 0,features,importance
1,Credit Score,0.21
0,Current Loan Amount,0.12
3,Monthly Debt,0.09
8,Current Credit Balance,0.09
9,Maximum Open Credit,0.09
4,Years of Credit History,0.08
2,Annual Income,0.07
6,Number of Open Accounts,0.06
5,Months since last delinquent,0.05
18,Years in current job_3 years,0.01
