# Solution Development for User Credit Data Project

Considering the problem in the dataset, Lenders would definitely benefit from having some sort of application which can predict whether users will Default or not. This would definitely help the employees in determining whether or not to give someone credit/loans.
To fulfill this requirement, I will build a model to predict defaulters and non-defaulters.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
#Import Random Forest 
from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score, f1_score, precision_score, recall_score, make_scorer

In [2]:
#Read data from csv
fixed_df = pd.read_csv('data/fixed-data.csv',index_col=0)
pd.set_option('display.float_format', lambda x: '%.5f' % x) #set option to fix annoying scientific notation in pandas

In [3]:
fixed_df.head(5)

Unnamed: 0,payment_ratio,overlimit_percentage,payment_ratio_3month,payment_ratio_6month,years_since_card_issuing,remaining_bill_per_number_of_cards,remaining_bill_per_limit,total_usage_per_limit,total_3mo_usage_per_limit,total_6mo_usage_per_limit,...,outstanding,credit_limit,total_retail_usage,number_of_cards,X,X.1,x,branch_code,default_flag,delinquency_score
0,1.0219,0.0,0.7478,1.0,15.41667,13161.5,0.00376,1e-05,0.01172,0.01781,...,36158,7000000.0,94.0,2,1,1,1-a,I,0,0.0
1,0.0,0.0,0.0,0.0,0.75,0.0,0.0,0.0001,0.0,0.0,...,268691,10000000.0,1012.0,2,2,2,2-a,A,0,0.0
2,1.0,0.0,1.0,1.0091,10.75,0.0,0.0,0.0,0.04052,0.0477,...,6769149,28000000.0,0.0,3,3,3,3-a,A,0,0.0
4,0.9599,0.0,0.9749,0.9984,1.66667,2975932.5,0.59519,0.26666,0.32303,0.13116,...,9402085,10000000.0,2666558.0,2,5,5,5-a,A,0,0.0
6,0.1847,0.0,0.2495,0.1789,4.66667,1657023.0,0.82851,0.0576,0.01875,0.16667,...,3906290,4000000.0,230400.0,2,7,7,7-a,A,0,0.0


In [4]:
fixed_df.describe()

Unnamed: 0,payment_ratio,overlimit_percentage,payment_ratio_3month,payment_ratio_6month,years_since_card_issuing,remaining_bill_per_number_of_cards,remaining_bill_per_limit,total_usage_per_limit,total_3mo_usage_per_limit,total_6mo_usage_per_limit,utilization_6month,outstanding,credit_limit,total_retail_usage,number_of_cards,X,X.1,default_flag,delinquency_score
count,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0,7461.0
mean,0.49218,0.05823,0.53582,0.59363,5.74686,1190503.27634,0.3228,0.0439,0.10054,0.12665,0.4166,3743940.50824,10322677.92521,387402.28803,2.21981,7761.63249,7761.63249,0.07345,0.03163
std,0.47773,0.31863,0.36651,0.40482,3.71463,1617503.0654,0.36469,0.07012,0.10809,0.14228,0.34472,4080476.9162,8514204.36631,654254.53673,0.51746,4555.06907,4555.06907,0.26089,0.34395
min,0.0,0.0,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3000000.0,0.0,1.0,1.0,1.0,0.0,0.0
25%,0.0,0.0,0.204,0.2,2.75,0.0,0.0,0.0,0.0136,0.0129,0.0929,750719.0,5000000.0,0.0,2.0,3799.0,3799.0,0.0,0.0
50%,0.284,0.0,0.5,0.598,5.17,462871.0,0.142,0.004,0.0653,0.0758,0.349,2590903.0,7000000.0,45500.0,2.0,7724.0,7724.0,0.0,0.0
75%,1.0,0.0,0.862,1.0,7.92,1846563.0,0.6583,0.062,0.154,0.195,0.7,5050904.0,13000000.0,502000.0,2.0,11773.0,11773.0,0.0,0.0
max,2.44,2.65,1.9541,2.16,18.8,7743339.0,1.59,0.32,0.544,0.66225,1.77,23423624.0,47000000.0,3533381.0,4.0,15642.0,15642.0,1.0,5.0


In [5]:
fixed_df = fixed_df.drop(['X.1','x'],axis=1)
fixed_df.head(2)

Unnamed: 0,payment_ratio,overlimit_percentage,payment_ratio_3month,payment_ratio_6month,years_since_card_issuing,remaining_bill_per_number_of_cards,remaining_bill_per_limit,total_usage_per_limit,total_3mo_usage_per_limit,total_6mo_usage_per_limit,utilization_6month,outstanding,credit_limit,total_retail_usage,number_of_cards,X,branch_code,default_flag,delinquency_score
0,1.0219,0.0,0.7478,1.0,15.41667,13161.5,0.00376,1e-05,0.01172,0.01781,0.02195,36158,7000000.0,94.0,2,1,I,0,0.0
1,0.0,0.0,0.0,0.0,0.75,0.0,0.0,0.0001,0.0,0.0,0.0003,268691,10000000.0,1012.0,2,2,A,0,0.0


## Problem Statement / Hypothesis

This is a classification problem, as we need to identify which users belong to which group, from a total of 2 groups:
    1. Defaulters     (1)
    2. Non-Defaulters (0)
Before I start, I define the baseline result here (taken from: https://machinelearningmastery.com/how-to-get-baseline-results-and-why-they-matter/):

### Baseline Result for Classification Problem
Since the classes are imbalanced and has more observations for non-defaulters, we 
select the class that has the most observations and use that class as the result for all predictions.

In [6]:
print(fixed_df['default_flag'].count()) #Total number of rows after removing outliers
print(fixed_df['default_flag'].value_counts()) #Number of users who default vs who don't default

print('Baseline Classification Accuracy:', 6913/7461 * 100 )

7461
0    6913
1     548
Name: default_flag, dtype: int64
Baseline Classification Accuracy: 92.65514006165392


In [7]:
#Determine default_flag as my y dataset
y = fixed_df.default_flag.values

## Creating the Model

### Data Preparation

It would be helpful to reduce the number of independent variables. Therefore, at this point, I check for Multi-Collinearity using VIF (Variance Inflation Factor). 
And I will use the threshold of VIF > 10 to remove the variables that are indicated to be multicollinear.

In [9]:
# checking for multi-collinearity steps
df_model = pd.DataFrame() #create temp model to store our features
for cols in fixed_df.columns:
    if cols not in ['X', 'x', 'branch_code', 'default_flag']:
        df_model[cols] = fixed_df[cols]
    

In [10]:
#Checking the VIF/ Multi-collinearity amongst the model dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(df_model.values, i) for i in range(df_model.shape[1])]
vif["features"] = df_model.columns
print(vif)

    VIF Factor                            features
0      4.75589                       payment_ratio
1      1.13676                overlimit_percentage
2     11.62869                payment_ratio_3month
3      6.69783                payment_ratio_6month
4      4.00360            years_since_card_issuing
5      9.49250  remaining_bill_per_number_of_cards
6      8.59087            remaining_bill_per_limit
7      4.96582               total_usage_per_limit
8      3.83862           total_3mo_usage_per_limit
9      3.25759           total_6mo_usage_per_limit
10     5.77062                  utilization_6month
11     8.38131                         outstanding
12     4.98473                        credit_limit
13     5.04616                  total_retail_usage
14    11.45693                     number_of_cards
15     1.01766                   delinquency_score


In [11]:
#Fix VIF 
df_model = df_model.drop('payment_ratio_3month',axis=1)

In [12]:
#Calculate new VIF
#Checking the VIF/ Multi-collinearity amongst the model dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(df_model.values, i) for i in range(df_model.shape[1])]
vif["features"] = df_model.columns
print(vif)

    VIF Factor                            features
0      3.29303                       payment_ratio
1      1.13600                overlimit_percentage
2      4.12117                payment_ratio_6month
3      4.00010            years_since_card_issuing
4      9.48794  remaining_bill_per_number_of_cards
5      8.48698            remaining_bill_per_limit
6      4.96492               total_usage_per_limit
7      3.65325           total_3mo_usage_per_limit
8      3.24599           total_6mo_usage_per_limit
9      5.71856                  utilization_6month
10     8.32397                         outstanding
11     4.95581                        credit_limit
12     5.03382                  total_retail_usage
13    11.34698                     number_of_cards
14     1.01766                   delinquency_score


In [13]:
df_model = df_model.drop('number_of_cards',axis=1)

In [14]:
#Calculate new VIF
#Checking the VIF/ Multi-collinearity amongst the model dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(df_model.values, i) for i in range(df_model.shape[1])]
vif["features"] = df_model.columns
print(vif)

    VIF Factor                            features
0      3.20752                       payment_ratio
1      1.13355                overlimit_percentage
2      3.75560                payment_ratio_6month
3      3.09702            years_since_card_issuing
4      8.97238  remaining_bill_per_number_of_cards
5      7.50145            remaining_bill_per_limit
6      4.90205               total_usage_per_limit
7      3.64620           total_3mo_usage_per_limit
8      3.23625           total_6mo_usage_per_limit
9      5.47827                  utilization_6month
10     8.25821                         outstanding
11     3.94774                        credit_limit
12     4.97482                  total_retail_usage
13     1.01178                   delinquency_score


Solved Multicollinearity

### Splitting Data into Train and Test

For our classification problem, we split the Data into Training Data and Test Data. With a ratio of 70 % Training and 30 % Testing.

In [15]:
# TRAINING (70%) and TESTING (30%)
# Stratified on y
X_train, X_test, y_train, y_test = train_test_split(df_model, y, test_size = 0.30, random_state = 5, stratify=y)

My data is imbalanced, and leaning heavily to non-defaulters. I will try to use the SMOTE method (https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/) in order to balance this out.

In [16]:
# scale X matrix with StandardScaler
ss = StandardScaler()
X_scaled_train = ss.fit_transform(X_train)

X_scaled_test = ss.transform(X_test)

# SMOTE the training set as the data set is skewed towards having more non_defaulters
sm = SMOTE(random_state = 3, sampling_strategy = 'minority')
X_scaled_sm_train, y_sm_train = sm.fit_sample(X_scaled_train, y_train)

### Random Forest Classifier Model
I will use Random Forest Classifier to classify this problem of Defaulters vs Non-Defaulters. Random Forest is also selected because of its Features Importance

In [17]:
# Random Forest Classifier -> estimator
# Here, I try to use Random Forest Classifier
rclf = RandomForestClassifier(n_estimators=1000, max_depth=5, random_state=3) #set my random random forest classifier params

# train 30% of the training set
rclf.fit(X_scaled_sm_train, y_sm_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=3, verbose=0,
                       warm_start=False)

In [18]:
#Use grid search to select the best parameters
grid_search_params ={
    'criterion': ['gini'],
    'max_depth': [None,1,5,10],
    'max_features': ['auto', 3, 7 ],
    'n_estimators':[100,200, 500, 1000],
    'random_state':[3]
}

rclf_grid_search = GridSearchCV(RandomForestClassifier(), grid_search_params, n_jobs = -1, verbose = 1, cv = 3, scoring='recall')

In [19]:
rclf_grid_search.fit(X_scaled_sm_train, y_sm_train)

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   36.0s
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:  2.6min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [20]:
rclf_grid_search.best_params_ # Check the best parameters from the grid search

{'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'n_estimators': 1000,
 'random_state': 3}

In [21]:
# Use the best parameters that were obtained
rclf_grid_search_best_params = rclf_grid_search.best_estimator_

In [22]:
#evaulate model performance
#create performance evaluation function, with f1 score, precision, recall, and auc 
def eval_perf(model, X_test, y_test):
    model_yhat = model.predict(X_test)
    model_score = model.score(X_test, y_test)
    model_f1 = f1_score(y_test, model_yhat, average = 'binary')
    model_precision = precision_score(y_test, model_yhat, average = 'binary')
    model_recall = recall_score(y_test, model_yhat, average = 'binary')
    model_auc = roc_auc_score(y_test, model_yhat, average = 'macro')
    
    return (model_score, model_precision, model_recall, model_f1, model_auc)

In [23]:
score_rclf, precision_rclf, recall_rclf, f1_rclf, auc_rclf = eval_perf(rclf, X_scaled_test, y_test) #Evaluate performance here 
#This is performance evaluation for the normal random forest

In [24]:
print('Results for the normal/base Random Forest Model')
random_forest_classifier_yhat = rclf.predict(X_scaled_test)
print (classification_report(y_test, random_forest_classifier_yhat, labels = [0,1], target_names=['non_default','default']))

Results for the normal/base Random Forest Model
              precision    recall  f1-score   support

 non_default       0.98      0.79      0.87      2075
     default       0.22      0.76      0.35       164

    accuracy                           0.79      2239
   macro avg       0.60      0.78      0.61      2239
weighted avg       0.92      0.79      0.84      2239



In [25]:
# Get the confusion matrix, and put it into a confusion_matrix
#This is for the normal random forest
confusion_matrix_base = pd.DataFrame(confusion_matrix(y_test, random_forest_classifier_yhat, labels = [0,1]),
                      index = ['actual_non_default','actual_default'],
                      columns = ['predict_non_default','predict_default'])
confusion_matrix_base

Unnamed: 0,predict_non_default,predict_default
actual_non_default,1640,435
actual_default,39,125


In [27]:
# Evaluate performance of best params random forest
score_rclf_gs_best, precision_rclf_gs_best, recall_rclf_gs_best, f1_rclf_gs_best, auc_rclf_gs_best = eval_perf(rclf_grid_search_best_params, X_scaled_test, y_test)

print("Results for the Grid Search Random Forest Model")
RF_best_yhat = rclf_grid_search_best_params.predict(X_scaled_test)
print (classification_report(y_test, RF_best_yhat, labels = [0,1], target_names=['non_default','default']))

Results for the Grid Search Random Forest Model
              precision    recall  f1-score   support

 non_default       0.95      0.95      0.95      2075
     default       0.38      0.40      0.39       164

    accuracy                           0.91      2239
   macro avg       0.67      0.68      0.67      2239
weighted avg       0.91      0.91      0.91      2239



In [28]:
# Random Forest with Grid Search Best Params Confusion Matrix
confusion_matrix_gs = pd.DataFrame(confusion_matrix(y_test, random_forest_classifier_yhat, labels = [0,1]),
                      index = ['actual_non_default','actual_default'],
                      columns = ['predict_non_default','predict_default'])
confusion_matrix_gs

Unnamed: 0,predict_non_default,predict_default
actual_non_default,1640,435
actual_default,39,125


In [29]:
compare_models_df = pd.DataFrame({'Basic Random Forest':[score_rclf, precision_rclf, recall_rclf, f1_rclf, auc_rclf],\
                                     'Grid Search Random Forest':[score_rclf_gs_best, precision_rclf_gs_best, recall_rclf_gs_best, f1_rclf_gs_best, auc_rclf_gs_best]},\
                                    index = ['Accuracy: ','Precision: ','Recall: ','F1 Score: ','AUC Score: '])
compare_models_df

Unnamed: 0,Basic Random Forest,Grid Search Random Forest
Accuracy:,0.7883,0.90889
Precision:,0.22321,0.38372
Recall:,0.7622,0.40244
F1 Score:,0.3453,0.39286
AUC Score:,0.77628,0.67568


- The accuracy is higher for my Grid Search Random Forest Model
- Precision is also higher for my Grid Search Random Forest Model
- The Recall score is the ratio of correctly predicted positive observations to the all observations in the actual class. It is observed to get lower in my grid search random forest model.
- The F1 score got higher for my Grid Search Random Forest model
- AUC score is negatively impacted, as my Grid Search Random Forest model only managed to get 0.68, whilst the previous model obtained 0.78.

Regarding which score to use for model evaluation, Recall score should be the top priority. This is because a high recall score would really help in selecting the actual amount of defaulters. False negatives are costly because we could give credit to people who cannot pay back.

We also need to balance the costs between False Positives and False Negatives.
For instance, let's say the Lender gave loans to someone who is predicted to be a non-defaulter. But over time, it turns out that they eventually defaulted on their loans (False Negative). This could prove costly as the Lender won't get returns on credit.
But on the other hand, if the Lender predicted a 'non-defaulter or someone who could pay back their credit' as a defaulter, they would lose on potential profits (False Positive). 

In [30]:
rclf.feature_importances_

array([0.17439203, 0.00753956, 0.07120882, 0.00770928, 0.02910391,
       0.02961103, 0.10053447, 0.2419262 , 0.02736667, 0.01897217,
       0.07742609, 0.01658144, 0.15824828, 0.03938003])

In [31]:
#get my feature labels for printing
feature_labels = df_model.columns
# print feature importance on my random forest classifier.
list_features = []
for feature in zip(feature_labels, rclf.feature_importances_):
    print(feature)
    list_features.append(feature)

('payment_ratio', 0.17439203278375562)
('overlimit_percentage', 0.007539562228966823)
('payment_ratio_6month', 0.07120881734479163)
('years_since_card_issuing', 0.007709284389305301)
('remaining_bill_per_number_of_cards', 0.02910390983575271)
('remaining_bill_per_limit', 0.02961103458683387)
('total_usage_per_limit', 0.10053447370498568)
('total_3mo_usage_per_limit', 0.24192619914956312)
('total_6mo_usage_per_limit', 0.02736667442860117)
('utilization_6month', 0.01897216699057531)
('outstanding', 0.07742609353463578)
('credit_limit', 0.016581442932270193)
('total_retail_usage', 0.15824827563225774)
('delinquency_score', 0.039380032457704994)


In [32]:
feature_importance_df = pd.DataFrame({'importance': rclf.feature_importances_, 'features':df_model.columns})

In [33]:
feature_importance_df.sort_values('importance', ascending = False).reset_index(drop=True)

Unnamed: 0,importance,features
0,0.24193,total_3mo_usage_per_limit
1,0.17439,payment_ratio
2,0.15825,total_retail_usage
3,0.10053,total_usage_per_limit
4,0.07743,outstanding
5,0.07121,payment_ratio_6month
6,0.03938,delinquency_score
7,0.02961,remaining_bill_per_limit
8,0.0291,remaining_bill_per_number_of_cards
9,0.02737,total_6mo_usage_per_limit


The top 3 most important features to classify someone as "Defaulters" or "Non-Defaulters" are:
    1. Total usage per limit in the last 3 months
    2. Payment ratio in the last month
    3. Total Retail Usage

## Summary
- I created 2 Random Forest Classification Models made using base Random Forest (First model) and used Grid Search to tune params (Second Model):
    - The First model has better Recall and AUC score
    - The Second Model has better Accuracy, F1 score, and Precision
- The first model obtained an accuracy of 78%, the second model got 91%
- Both of these models could not beat the baseline accuracy I defined above, which is 92.65 %
- Total usage (both retail usage and total usage per limit in 3months) and payment ratio are the most important features to predict Default/Non-Default
- Select models based on recall score
- Need to balance costs of false positives vs false negatives in the prediction model