# Credit Data Example

This example demonstrate how to use classical tree base model to do classification prediction of an imbalance data like predicting a default for loan/credit. 

## Data
The data is a public data from UCL : http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Here we will have a brief inspection of the data, then test it with 2 methods: XGBoost and IsolationForest. EDA and feature engineering are be added.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 1000)
df = pd.read_excel('default of credit card clients.xls', header=1)

In [2]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [3]:
df.shape

(30000, 25)

In [4]:
na_cols = []
for col in df.columns:
    if df[col].count() != df.shape[0]:
        na_cols.append(col)
df[na_cols].count()

Series([], dtype: int64)

There is no NA's 

In [5]:
df.dtypes

ID                            int64
LIMIT_BAL                     int64
SEX                           int64
EDUCATION                     int64
MARRIAGE                      int64
AGE                           int64
PAY_0                         int64
PAY_2                         int64
PAY_3                         int64
PAY_4                         int64
PAY_5                         int64
PAY_6                         int64
BILL_AMT1                     int64
BILL_AMT2                     int64
BILL_AMT3                     int64
BILL_AMT4                     int64
BILL_AMT5                     int64
BILL_AMT6                     int64
PAY_AMT1                      int64
PAY_AMT2                      int64
PAY_AMT3                      int64
PAY_AMT4                      int64
PAY_AMT5                      int64
PAY_AMT6                      int64
default payment next month    int64
dtype: object

In [6]:
df['default payment next month'].value_counts()

0    23364
1     6636
Name: default payment next month, dtype: int64

It's a slightly imbalance data with 'default payment next month' being the label.

## Prepare for Training

In [7]:
label = df['default payment next month'].values
label

array([1, 1, 0, ..., 1, 1, 1], dtype=int64)

In [8]:
data = df.drop(['default payment next month'],axis=1, inplace=False)

## Split the data statistically
Sicne the data is inbalance, the train test split need to follow the statistic of the origianl data.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.values, label, test_size=0.2, random_state=42, stratify=label)

X_training, X_var, y_training, y_var = train_test_split(X_train, y_train,\
                                                        test_size=0.2, random_state=42, stratify=y_train)

In [10]:
X_training.shape

(19200, 24)

## Using XGBoost
XGBoost provide a flixiable and fast model training.

In [11]:
import os

mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-7.3.0-posix-seh-rt_v5-rev0\\mingw64\\bin'

os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

In [12]:
from xgboost import XGBClassifier

model = XGBClassifier(objective='binary:logistic',n_estimators=200)
model.fit(X_training, y_training, eval_set=[(X_training, y_training), (X_var, y_var)], eval_metric='auc', verbose=10)
print(model)

[0]	validation_0-auc:0.733729	validation_1-auc:0.712399
[10]	validation_0-auc:0.774028	validation_1-auc:0.764049
[20]	validation_0-auc:0.780168	validation_1-auc:0.768258
[30]	validation_0-auc:0.785753	validation_1-auc:0.771366
[40]	validation_0-auc:0.792319	validation_1-auc:0.774259
[50]	validation_0-auc:0.797579	validation_1-auc:0.774792
[60]	validation_0-auc:0.80159	validation_1-auc:0.775595
[70]	validation_0-auc:0.805092	validation_1-auc:0.775335
[80]	validation_0-auc:0.807562	validation_1-auc:0.775902
[90]	validation_0-auc:0.809775	validation_1-auc:0.775219
[100]	validation_0-auc:0.812344	validation_1-auc:0.775658
[110]	validation_0-auc:0.814872	validation_1-auc:0.775924
[120]	validation_0-auc:0.81696	validation_1-auc:0.776366
[130]	validation_0-auc:0.819119	validation_1-auc:0.776506
[140]	validation_0-auc:0.821087	validation_1-auc:0.776762
[150]	validation_0-auc:0.82246	validation_1-auc:0.776422
[160]	validation_0-auc:0.824194	validation_1-auc:0.776326
[170]	validation_0-auc:0.825

We can do some parameter tuning following a detailed guide online: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
* we only turn the most important parameter

In [13]:
from sklearn.model_selection import StratifiedKFold
from sklearn.grid_search import GridSearchCV

param_test1 = {
 'max_depth':[3,5,7,9],
 'min_child_weight':[1,3,5]
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=200, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(X_train, y_train)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_



([mean: 0.78217, std: 0.00777, params: {'max_depth': 3, 'min_child_weight': 1},
  mean: 0.78257, std: 0.00729, params: {'max_depth': 3, 'min_child_weight': 3},
  mean: 0.78199, std: 0.00776, params: {'max_depth': 3, 'min_child_weight': 5},
  mean: 0.77852, std: 0.00732, params: {'max_depth': 5, 'min_child_weight': 1},
  mean: 0.77752, std: 0.00712, params: {'max_depth': 5, 'min_child_weight': 3},
  mean: 0.77799, std: 0.00653, params: {'max_depth': 5, 'min_child_weight': 5},
  mean: 0.77390, std: 0.00746, params: {'max_depth': 7, 'min_child_weight': 1},
  mean: 0.77302, std: 0.00581, params: {'max_depth': 7, 'min_child_weight': 3},
  mean: 0.77581, std: 0.00742, params: {'max_depth': 7, 'min_child_weight': 5},
  mean: 0.76912, std: 0.00568, params: {'max_depth': 9, 'min_child_weight': 1},
  mean: 0.76778, std: 0.00651, params: {'max_depth': 9, 'min_child_weight': 3},
  mean: 0.76849, std: 0.00582, params: {'max_depth': 9, 'min_child_weight': 5}],
 {'max_depth': 3, 'min_child_weight': 3

In [14]:
param_test2 = {
 'max_depth':[3,4,5],
 'min_child_weight':[2,3,4]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=200, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(X_train, y_train)
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

([mean: 0.78264, std: 0.00826, params: {'max_depth': 3, 'min_child_weight': 2},
  mean: 0.78257, std: 0.00729, params: {'max_depth': 3, 'min_child_weight': 3},
  mean: 0.78243, std: 0.00699, params: {'max_depth': 3, 'min_child_weight': 4},
  mean: 0.78105, std: 0.00744, params: {'max_depth': 4, 'min_child_weight': 2},
  mean: 0.78120, std: 0.00803, params: {'max_depth': 4, 'min_child_weight': 3},
  mean: 0.78141, std: 0.00741, params: {'max_depth': 4, 'min_child_weight': 4},
  mean: 0.77980, std: 0.00780, params: {'max_depth': 5, 'min_child_weight': 2},
  mean: 0.77752, std: 0.00712, params: {'max_depth': 5, 'min_child_weight': 3},
  mean: 0.77911, std: 0.00758, params: {'max_depth': 5, 'min_child_weight': 4}],
 {'max_depth': 3, 'min_child_weight': 2},
 0.7826416120007113)

seems like the best max_depth is 4 and the min_child_weight is 2. Right now the validation score is around 0.78. Since parameter tuning will not significantly improve the result, ensembeling could be considered to further improve the model if needed. 

In [16]:
model = XGBClassifier(objective='binary:logistic',max_depth=4, \
                      min_child_weight=2,n_estimators=60) # early stopping at 60
model.fit(X_training, y_training, eval_set=[(X_training, y_training), (X_var, y_var)], eval_metric='auc', verbose=10)
print(model)

[0]	validation_0-auc:0.748791	validation_1-auc:0.730067
[10]	validation_0-auc:0.782635	validation_1-auc:0.76739
[20]	validation_0-auc:0.793164	validation_1-auc:0.773296
[30]	validation_0-auc:0.798885	validation_1-auc:0.775004
[40]	validation_0-auc:0.804768	validation_1-auc:0.776849
[50]	validation_0-auc:0.810866	validation_1-auc:0.77803
[59]	validation_0-auc:0.81517	validation_1-auc:0.777942
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=4, min_child_weight=2, missing=None, n_estimators=60,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)


For threshold = 0.85 gives the best F1 Score

## Using IsolationForest
Another method to be considered is anomaly detection for picking up the sparse claimed case. We can compare it with the model with XGBoost.

In [18]:
from sklearn.ensemble import IsolationForest
from sklearn import metrics

clf = IsolationForest()
clf.fit(X_train)
y_pred = clf.predict(X_train)
y_pred = (-y_pred+1)/2

print("Accuracy : %.4g" % metrics.accuracy_score(y_train, y_pred))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y_train, y_pred))
print("F1 Score (Train): %f" % metrics.f1_score(y_train, y_pred))

Accuracy : 0.7361
AUC Score (Train): 0.518998
F1 Score (Train): 0.178493


## Final evaliuation using the test set

**Bsaseline: Constant prediction**

In [19]:
import numpy as np
baseline = np.zeros(len(y_test))

print("Accuracy : %.4g" % metrics.accuracy_score(y_test, baseline))
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, baseline))
print("F1 Score (Test): %f" % metrics.f1_score(y_test, baseline))

Accuracy : 0.7788
AUC Score (Test): 0.500000
F1 Score (Test): 0.000000


  'precision', 'predicted', average, warn_for)


**For XGBoost**

In [20]:
#Predict test set:
threshold = .5
predprob = model.predict_proba(X_test)[:,1]
pred = (predprob > threshold).astype(int)
        
print("Accuracy : %.4g" % metrics.accuracy_score(y_test, pred))
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, predprob))
print("F1 Score (Test): %f" % metrics.f1_score(y_test, pred))

Accuracy : 0.8188
AUC Score (Test): 0.780556
F1 Score (Test): 0.467418


The model here gives a good accuracy of 0.82 and AUC of 0.78.

In [22]:
pd.Series(pred).value_counts()

0    5286
1     714
dtype: int64

**For IsolationForest**

In [23]:
y_pred = clf.predict(X_test)
y_pred = (-y_pred+1)/2

print("Accuracy : %.4g" % metrics.accuracy_score(y_test, y_pred))
print("AUC Score (Test): %f" % metrics.roc_auc_score(y_test, y_pred))
print("F1 Score (Test): %f" % metrics.f1_score(y_test, y_pred))

Accuracy : 0.7408
AUC Score (Test): 0.530912
F1 Score (Test): 0.208651


In [24]:
pd.Series(y_pred).value_counts()

0.0    5362
1.0     638
dtype: int64

In [25]:
pd.Series(y_test).value_counts()

0    4673
1    1327
dtype: int64

IsolationForest is not performing better than XGBoost. However, both methods got less class 1 labels than ground truth.

## Summary
In this exercise, 2 models, XGBoost and IsolationForest are compared. XGBoost give a AUC of around 0.78 and seems like a stronger model compare to the IsolationForest. Feature engeerning is needed to improve the model further.