# Classification case study - credit card defaulters dataset
You can access the credit card data and its relevant documentation [here on UCI website](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

## Reading data

In [1]:
## Importing libraries
import pandas as pd
import numpy as np

credit = pd.read_csv("UCI_Credit_Card.csv")
credit.shape

(30000, 25)

In [2]:
## Checking data sample
credit.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


## Attribute information
Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

**Limit_Bal**: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 

**Gender** (1 = male; 2 = female). 

**Education** (1 = graduate school; 2 = university; 3 = high school; 4 = others). 

**Marital status** (1 = married; 2 = single; 3 = others). 

**Age** (year). 

**History of past payment**  Past monthly payment records. The measurement scale for the repayment status is:

    -1 = pay duly
    1 = payment delay for one month
    2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 

**Amount of bill statement** (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 

**Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 



## Data Exploration

### Summary statistics

In [3]:
credit.describe()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,167484.322667,1.603733,1.853133,1.551867,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,...,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567,0.2212
std,8660.398374,129747.661567,0.489129,0.790349,0.52197,9.217904,1.123802,1.197186,1.196868,1.169139,...,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775,0.415062
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,15000.5,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,22500.25,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,30000.0,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0


In [4]:
credit['EDUCATION'].unique()

array([2, 1, 3, 5, 4, 6, 0])

## Important observations and assumptions

1. As per description education has only 4 valid levels (1,2,3,4). However when we noticed that there are three additional levels (0,5,6). **We will combine (0,5,6) into 4 (unknown).**


2. As per description PAY columns have values ranging from -1 to 9. However we see -2 value also. **We will consider -2 as advanced payment status.**


3. As per description marriage is supposed to have 3 levels (1,2,3). However he have noticed an extra level 0. **We will replace 0 with 3 (others)**

### Dealing with unknown levels in education column

In [5]:
credit['EDUCATION'].replace(0,4, inplace=True)
credit['EDUCATION'].replace(5,4, inplace=True)
credit['EDUCATION'].replace(6,4, inplace=True)
credit['EDUCATION'].value_counts()

2    14030
1    10585
3     4917
4      468
Name: EDUCATION, dtype: int64

### Dealing with unknown levels in marriage


In [6]:
credit['MARRIAGE'].replace(0,3, inplace=True)
credit['MARRIAGE'].value_counts()

2    15964
1    13659
3      377
Name: MARRIAGE, dtype: int64

### Check if the data types are as expected

In [7]:
credit.dtypes

ID                              int64
LIMIT_BAL                     float64
SEX                             int64
EDUCATION                       int64
MARRIAGE                        int64
AGE                             int64
PAY_0                           int64
PAY_2                           int64
PAY_3                           int64
PAY_4                           int64
PAY_5                           int64
PAY_6                           int64
BILL_AMT1                     float64
BILL_AMT2                     float64
BILL_AMT3                     float64
BILL_AMT4                     float64
BILL_AMT5                     float64
BILL_AMT6                     float64
PAY_AMT1                      float64
PAY_AMT2                      float64
PAY_AMT3                      float64
PAY_AMT4                      float64
PAY_AMT5                      float64
PAY_AMT6                      float64
default.payment.next.month      int64
dtype: object

### Dropping ID column and storing the list of categorical and numeric columns

In [8]:
credit.drop(["ID"],inplace=True, axis=1)
cat_cols = ['SEX','EDUCATION','MARRIAGE']
num_cols = credit[credit.columns.difference(cat_cols)]

### Check target distribution
**Notice the class imbalance**

In [9]:
credit['default.payment.next.month'].value_counts()

0    23364
1     6636
Name: default.payment.next.month, dtype: int64

## Check missing values
**Fortunately no missing values**

In [10]:
credit.isnull().sum()

LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

In [11]:
credit.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,20000.0,2,2,1,24,2,2,-1,-1,-2,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26,-1,2,0,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34,0,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37,0,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57,-1,0,-1,0,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [12]:
credit.describe()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,167484.322667,1.603733,1.842267,1.557267,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,-0.2662,...,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567,0.2212
std,129747.661567,0.489129,0.744494,0.521405,9.217904,1.123802,1.197186,1.196868,1.169139,1.133187,...,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775,0.415062
min,10000.0,1.0,1.0,1.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,1000000.0,2.0,4.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0


In [13]:
credit.isnull().sum()

LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

### Test-Train split

In [17]:
y = credit['default.payment.next.month']
X = credit[credit.columns.difference(['default.payment.next.month'])]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=124421)

## Ideally it is preferred that you do your **Data Preprocessing** after train-test split, especially standardization and imputation.

In [18]:
X_train = pd.get_dummies(X_train,columns=cat_cols)
X_test = pd.get_dummies(X_test,columns=cat_cols)

X_train, X_test = X_train.align(X_test, axis=1, fill_value=0)

## ML Modeling

In [None]:
## Importing ML packages and accuracy metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, precision_score, f1_score, make_scorer, classification_report

### Model 1 - Logistic with original features and default parameters (lr1)

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr1 = LogisticRegression()
lr1.fit(X_train, y_train)

lr1_train_pred = lr1.predict(X_train)
lr1_test_pred = lr1.predict(X_test)

print(classification_report(y_train, lr1_train_pred))
print(classification_report(y_test, lr1_test_pred))



              precision    recall  f1-score   support

           0       0.78      1.00      0.88     18673
           1       0.00      0.00      0.00      5327

    accuracy                           0.78     24000
   macro avg       0.39      0.50      0.44     24000
weighted avg       0.61      0.78      0.68     24000

              precision    recall  f1-score   support

           0       0.78      1.00      0.88      4691
           1       0.00      0.00      0.00      1309

    accuracy                           0.78      6000
   macro avg       0.39      0.50      0.44      6000
weighted avg       0.61      0.78      0.69      6000



  'precision', 'predicted', average, warn_for)


### Model 2 - Logistic with original features and class_weights='balanced' (lr2)

In [26]:
lr2 = LogisticRegression(class_weight='balanced')
lr2.fit(X_train, y_train)

lr2_train_pred = lr2.predict(X_train)
lr2_test_pred = lr2.predict(X_test)

print(classification_report(y_train, lr2_train_pred))
print(classification_report(y_test, lr2_test_pred))



              precision    recall  f1-score   support

           0       0.86      0.65      0.74     18673
           1       0.34      0.64      0.45      5327

    accuracy                           0.65     24000
   macro avg       0.60      0.65      0.60     24000
weighted avg       0.75      0.65      0.68     24000

              precision    recall  f1-score   support

           0       0.87      0.65      0.75      4691
           1       0.34      0.64      0.44      1309

    accuracy                           0.65      6000
   macro avg       0.60      0.65      0.59      6000
weighted avg       0.75      0.65      0.68      6000



# Decision Trees

### Model 3 - Decision trees with default parameters
**Notice the overfitting**

In [27]:
from sklearn.tree import DecisionTreeClassifier

dt1 = DecisionTreeClassifier()
dt1.fit(X_train,y_train)

dt1_train_pred = dt1.predict(X_train)
dt1_test_pred = dt1.predict(X_test)

print(classification_report(y_train, dt1_train_pred))
print(classification_report(y_test, dt1_test_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18673
           1       1.00      1.00      1.00      5327

    accuracy                           1.00     24000
   macro avg       1.00      1.00      1.00     24000
weighted avg       1.00      1.00      1.00     24000

              precision    recall  f1-score   support

           0       0.83      0.81      0.82      4691
           1       0.37      0.40      0.38      1309

    accuracy                           0.72      6000
   macro avg       0.60      0.60      0.60      6000
weighted avg       0.73      0.72      0.72      6000



### Model 4 - Decision tree with max_depth=5

In [28]:
dt2 = DecisionTreeClassifier(max_depth=5)
dt2.fit(X_train,y_train)

dt2_train_pred = dt2.predict(X_train)
dt2_test_pred = dt2.predict(X_test)

print(classification_report(y_train, dt2_train_pred))
print(classification_report(y_test, dt2_test_pred))

              precision    recall  f1-score   support

           0       0.84      0.95      0.89     18673
           1       0.70      0.37      0.49      5327

    accuracy                           0.82     24000
   macro avg       0.77      0.66      0.69     24000
weighted avg       0.81      0.82      0.80     24000

              precision    recall  f1-score   support

           0       0.84      0.95      0.89      4691
           1       0.65      0.35      0.45      1309

    accuracy                           0.82      6000
   macro avg       0.74      0.65      0.67      6000
weighted avg       0.80      0.82      0.79      6000



### Model 5 - Decision tree with max_depth=5 and balanced class weights

In [30]:
dt3 = DecisionTreeClassifier(max_depth=5, class_weight='balanced')
dt3.fit(X_train,y_train)

dt3_train_pred = dt3.predict(X_train)
dt3_test_pred = dt3.predict(X_test)

print(classification_report(y_train, dt3_train_pred))
print(classification_report(y_test, dt3_test_pred))

              precision    recall  f1-score   support

           0       0.89      0.76      0.82     18673
           1       0.44      0.66      0.53      5327

    accuracy                           0.74     24000
   macro avg       0.66      0.71      0.67     24000
weighted avg       0.79      0.74      0.75     24000

              precision    recall  f1-score   support

           0       0.88      0.75      0.81      4691
           1       0.42      0.63      0.50      1309

    accuracy                           0.73      6000
   macro avg       0.65      0.69      0.66      6000
weighted avg       0.78      0.73      0.74      6000



## Plotting the decision tree
**This code block will generate a pdf in your active directory (from where you started jupyter notebook)**

In [31]:
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(dt3, out_file=None, feature_names=X_train.columns, filled=True, rounded=True)
graph = graphviz.Source(dot_data)
graph.render("dt3_max_depth_5") 

'dt3_max_depth_5.pdf'

## Hyper parameter tuning - Randomised Grid Search with Cross Validation

In [33]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn import tree

dt = tree.DecisionTreeClassifier(class_weight='balanced') 
param_grid = {'criterion':['gini','entropy'],
             'max_leaf_nodes': np.arange(5,30,1),
             'min_samples_split': np.arange(0.001,0.1,0.001),
             'max_depth':np.arange(3,15,1),
             'min_weight_fraction_leaf':np.arange(0.01,0.25,0.005)}

rsearch = RandomizedSearchCV(estimator=dt, param_distributions=param_grid,n_iter=20,n_jobs=-1)
rsearch.fit(X_train, y_train)

print(rsearch.best_estimator_)
print("Train - Report")
print(classification_report(y_train,rsearch.predict(X_train)))
print("Test - Confusion Matrix")
print(classification_report(y_test,rsearch.predict(X_test)))



DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=12, max_features=None, max_leaf_nodes=23,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=0.032,
                       min_weight_fraction_leaf=0.20999999999999996,
                       presort=False, random_state=None, splitter='best')
Train - Report
              precision    recall  f1-score   support

           0       0.86      0.86      0.86     18673
           1       0.51      0.52      0.51      5327

    accuracy                           0.78     24000
   macro avg       0.68      0.69      0.69     24000
weighted avg       0.78      0.78      0.78     24000

Test - Confusion Matrix
              precision    recall  f1-score   support

           0       0.86      0.85      0.86      4691
           1       0.49      0.51      0.50      1309

    accuracy                           0

## Random Forest

In [34]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=3, random_state=123, n_estimators=50, class_weight="balanced")
rf.fit(X_train,y_train)

y_train_pred_rf = rf.predict(X_train)
y_test_pred_rf = rf.predict(X_test)

print(classification_report(y_train, y_train_pred_rf))
print(classification_report(y_test, y_test_pred_rf))

              precision    recall  f1-score   support

           0       0.87      0.84      0.86     18673
           1       0.51      0.58      0.54      5327

    accuracy                           0.78     24000
   macro avg       0.69      0.71      0.70     24000
weighted avg       0.79      0.78      0.79     24000

              precision    recall  f1-score   support

           0       0.88      0.83      0.85      4691
           1       0.49      0.58      0.53      1309

    accuracy                           0.78      6000
   macro avg       0.68      0.71      0.69      6000
weighted avg       0.79      0.78      0.78      6000



## Hyper-parameter tuning for RandomForest

In [37]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, precision_score, f1_score, make_scorer

score_metric = make_scorer(recall_score)
rfc = RandomForestClassifier(class_weight='balanced')

param_grid = {'criterion':['gini','entropy'],
              'n_estimators': np.arange(25,200,25),
              'min_samples_split': np.arange(0.001,0.1,0.01),
             'max_depth':np.arange(3,15,1)}

rsearch = RandomizedSearchCV(estimator=rfc, param_distributions=param_grid,n_iter=20, n_jobs=3,scoring=score_metric)
rsearch.fit(X_train, y_train)
print(rsearch.best_estimator_)
print("Train - Report")
print(classification_report(y_train,rsearch.predict(X_train)))
print("Test - Confusion Matrix")
print(classification_report(y_test,rsearch.predict(X_test)))



RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=11, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=0.09099999999999998,
                       min_weight_fraction_leaf=0.0, n_estimators=25,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Train - Report
              precision    recall  f1-score   support

           0       0.88      0.79      0.83     18673
           1       0.46      0.64      0.54      5327

    accuracy                           0.75     24000
   macro avg       0.67      0.71      0.68     24000
weighted avg       0.79      0.75      0.77     24000

Test - Confusion Matrix
              precision    recall  f1-score   support

           0       0.88      0.78      0.83 