# Models


# Model 4

- get_dummies categorical data(`EDUCATION`,`MARRIAGE`,`SEX`,'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6').
- exclude features `BILL_ATM2`, ..., `BILL_ATM6`.

## Import libraries/packages 

In [1]:
### General libraries ###
import pandas as pd
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import graphviz 
from graphviz import Source
from IPython.display import SVG
import os

##################################

### ML Models ###
from sklearn.linear_model import LogisticRegression
from sklearn import tree
# from sklearn.tree.export import export_text
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.feature_selection import SelectKBest

##################################

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.decomposition import PCA

### Metrics ###
from yellowbrick.classifier import ConfusionMatrix
from sklearn import metrics
from sklearn.metrics import f1_score,confusion_matrix, mean_squared_error, mean_absolute_error, classification_report, roc_auc_score, roc_curve, precision_score, recall_score

## Part 1: Load and clean the data

In this section we will load the data from the csv file and check for any "impurities", such as null values or duplicate rows. If any of these will appear, we will remove them from the data set. We will also plot the correlations of the class column with all the other columns.

In [2]:
# Load the data.
data = pd.read_csv('default of credit card clients.csv')

# Information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
ID           30000 non-null int64
LIMIT_BAL    30000 non-null int64
SEX          30000 non-null int64
EDUCATION    30000 non-null int64
MARRIAGE     30000 non-null int64
AGE          30000 non-null int64
PAY_1        30000 non-null int64
PAY_2        30000 non-null int64
PAY_3        30000 non-null int64
PAY_4        30000 non-null int64
PAY_5        30000 non-null int64
PAY_6        30000 non-null int64
BILL_AMT1    30000 non-null int64
BILL_AMT2    30000 non-null int64
BILL_AMT3    30000 non-null int64
BILL_AMT4    30000 non-null int64
BILL_AMT5    30000 non-null int64
BILL_AMT6    30000 non-null int64
PAY_AMT1     30000 non-null int64
PAY_AMT2     30000 non-null int64
PAY_AMT3     30000 non-null int64
PAY_AMT4     30000 non-null int64
PAY_AMT5     30000 non-null int64
PAY_AMT6     30000 non-null int64
dpnm         30000 non-null int64
dtypes: int64(25)
memory usage: 5.7 MB


Since the `ID` column is for indexing purposes only, we remove it from the data set.

In [3]:
# Replace value '0' with value '3'.
data['MARRIAGE'] = data['MARRIAGE'].replace(0, 3)

# Replace values '0','5' and '6' with value '4'.
data['EDUCATION'] = data['EDUCATION'].replace([0, 5, 6], 4)

In [4]:
# Drop "ID" column.
data = data.drop(['ID'], axis=1)

Now we check for duplicate rows. If any, we remove them from the data set, since they provide only reduntant information.

In [5]:
# Check for duplicate rows.
print(f"There are {data.duplicated().sum()} duplicate rows in the data set.")

# Remove duplicate rows.
data = data.drop_duplicates()
print("The duplicate rows were removed.")

There are 35 duplicate rows in the data set.
The duplicate rows were removed.


We also check for null values.

In [6]:
# Check for null values.
print(
    f"There are {data.isna().any().sum()} cells with null values in the data set.")

There are 0 cells with null values in the data set.


Below is the plot of the correlation matrix for the data set.

## Part 2: Pre-processing

In this part we prepare our data for our models. This means that we choose the columns that will be our independed variables and which column the class that we want to predict. Once we are done with that, we split our data into train and test sets and perfom a standardization upon them.

In [7]:
# OneHot encoder for columns 'EDUCATION','MARRIAGE','SEX','PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6' .
data = pd.get_dummies(
    data, columns=['EDUCATION','MARRIAGE','SEX','PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])

In [8]:
# Select feature and class column.
X = data.drop(columns=['dpnm', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']) # , 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'
y = data['dpnm']

In [9]:
len(X.columns)

82

In [10]:
X_ = X.copy()
X_clust = X.copy()

## Feature selection

In [11]:
# Best k for CV accuracy:17, 23

In [12]:
X_ = SelectKBest(k=23).fit_transform(X_, y)
# Split to train and test sets.
X_train, X_test, y_train, y_test = train_test_split(
    X_, y, test_size=0.3, random_state=25)

# Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# Initialize a Logistic Regression estimator.
logreg = LogisticRegression(random_state=25, n_jobs=-1)

# Train the estimator.
logreg.fit(X_train, y_train)
# Make predictions.
log_pred = logreg.predict(X_test)
# CV accuracy
cv_logreg = cross_validate(logreg, X_, y, scoring=scoring, cv=10)

## Perform K-means clustering analysis

In [13]:
X_clust = SelectKBest(k=17).fit_transform(X_clust, y)

In [14]:
# Standardization
sc = StandardScaler()
sc.fit(X_clust)
X_clust = sc.transform(X_clust)
X_clust = pd.DataFrame(X_clust)

In [15]:
model = KMeans(n_clusters=2, random_state=25).fit(X_clust)

In [16]:
X_clust['cluster'] = model.labels_

In [17]:
# Split to train and test sets for X_clust.
X_clust_train, X_clust_test, y_clust_train, y_clust_test = train_test_split(
    X_clust, y, test_size=0.3, random_state=25)

In [18]:
# Standardization for X_clust
scaler_ = StandardScaler()
X_clust_train = scaler_.fit_transform(X_clust_train)
X_clust_test = scaler_.transform(X_clust_test)

## Part 3: Modeling

In this section we build and try 3 models:
 - Logistic Regression
 - Decision tree
 - Neural network



## Logistic Regression

In [19]:
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

In [20]:
# scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
# # Initialize a Logistic Regression estimator.
# logreg = LogisticRegression(random_state=25, n_jobs=-1)

# # Train the estimator.
# logreg.fit(X_train, y_train)
# # Make predictions.
# log_pred = logreg.predict(X_test)
# # CV accuracy
# cv_logreg = cross_validate(logreg, X, y, scoring=scoring, cv=10)


# Initialize a Logistic Regression estimator.
logreg_clust = LogisticRegression(random_state=25, n_jobs=-1)
# Train the estimator for clustering.
logreg_clust.fit(X_clust_train, y_clust_train)
# Make predictions for clustering.
log_clust_pred = logreg_clust.predict(X_clust_test)

cv_logreg_clust = cross_validate(
    logreg_clust, X_clust, y, scoring=scoring, cv=10)

In [21]:
d = {
    'Models': ['Logistic Regression', 'Logistic Regression w/ clustering'],
    'CV Accuracy': [cv_logreg['test_accuracy'].mean(), cv_logreg_clust['test_accuracy'].mean()],
    'CV Precision': [cv_logreg['test_precision'].mean(), cv_logreg_clust['test_precision'].mean()],
    'CV Recall': [cv_logreg['test_recall'].mean(), cv_logreg_clust['test_recall'].mean()],
    'CV F1': [cv_logreg['test_f1'].mean(), cv_logreg_clust['test_f1'].mean()],
    'CV AUC': [cv_logreg['test_roc_auc'].mean(), cv_logreg_clust['test_roc_auc'].mean()]
}

results = pd.DataFrame(data=d).round(3).set_index('Models')
results

Unnamed: 0_level_0,CV Accuracy,CV Precision,CV Recall,CV F1,CV AUC
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Logistic Regression,0.78,0.068,0.009,0.016,0.638
Logistic Regression w/ clustering,0.819,0.677,0.349,0.46,0.758


## Decision tree

In [22]:
# Initialize a decision tree estimator.
tr = tree.DecisionTreeClassifier(
    max_depth=30, criterion='gini', random_state=25)

# Train the estimator.
tr.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=30, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=25, splitter='best')

In [23]:
# Make predictions.
tr_pred = tr.predict(X_test)

# CV accuracy.
cv_tr = cross_validate(tr, X_, y, scoring=scoring, cv=10)

In [24]:
# Initialize a decision tree estimator.
tr_clust = tree.DecisionTreeClassifier(
    max_depth=30, criterion='gini', random_state=25)

# Train the estimator.
tr_clust.fit(X_clust_train, y_clust_train)

# Make predictions.
tr_clust_pred = tr_clust.predict(X_clust_test)

# CV accuracy.
cv_tr_clust = cross_validate(tr_clust, X_clust, y, scoring=scoring, cv=10)

In [25]:
d = {
    'Models': ['Decision Tree', 'Decision Tree w/ clustering'],
    'CV Accuracy': [cv_tr['test_accuracy'].mean(), cv_tr_clust['test_accuracy'].mean()],
    'CV Precision': [cv_tr['test_precision'].mean(), cv_tr_clust['test_precision'].mean()],
    'CV Recall': [cv_tr['test_recall'].mean(), cv_tr_clust['test_recall'].mean()],
    'CV F1': [cv_tr['test_f1'].mean(), cv_tr_clust['test_f1'].mean()],
    'CV AUC': [cv_tr['test_roc_auc'].mean(), cv_tr_clust['test_roc_auc'].mean()]
}

results = pd.DataFrame(data=d).round(3).set_index('Models')
results

Unnamed: 0_level_0,CV Accuracy,CV Precision,CV Recall,CV F1,CV AUC
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Decision Tree,0.758,0.442,0.357,0.395,0.638
Decision Tree w/ clustering,0.8,0.586,0.33,0.422,0.682


## Neural network

In [26]:
# Initialize a Multi-layer Perceptron classifier.
mlp = MLPClassifier(hidden_layer_sizes=(12, 5), max_iter=1000,
                    random_state=25, shuffle=True, verbose=False)

# Train the classifier.
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(12, 5), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=1000,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=25, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [27]:
# Make predictions.
mlp_pred = mlp.predict(X_test)

# CV accuracy
cv_mlp = cross_validate(tr, X_, y, scoring=scoring, cv=10)

In [28]:
# Initialize a decision tree estimator.
mlp_clust = MLPClassifier(hidden_layer_sizes=(12, 5), max_iter=1000,
                    random_state=25, shuffle=True, verbose=False)

# Train the estimator.
mlp_clust.fit(X_clust_train, y_clust_train)

# Make predictions.
mlp_clust_pred = mlp_clust.predict(X_clust_test)

# CV accuracy.
cv_mlp_clust = cross_validate(mlp_clust, X_clust, y, scoring=scoring, cv=10)

In [29]:
d = {
    'Models': ['MLP', 'MLP w/ clustering'],
    'CV Accuracy': [cv_mlp['test_accuracy'].mean(), cv_mlp_clust['test_accuracy'].mean()],
    'CV Precision': [cv_mlp['test_precision'].mean(), cv_mlp_clust['test_precision'].mean()],
    'CV Recall': [cv_mlp['test_recall'].mean(), cv_mlp_clust['test_recall'].mean()],
    'CV F1': [cv_mlp['test_f1'].mean(), cv_mlp_clust['test_f1'].mean()],
    'CV AUC': [cv_mlp['test_roc_auc'].mean(), cv_mlp_clust['test_roc_auc'].mean()]
}

results = pd.DataFrame(data=d).round(3).set_index('Models')
results

Unnamed: 0_level_0,CV Accuracy,CV Precision,CV Recall,CV F1,CV AUC
Models,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MLP,0.758,0.442,0.357,0.395,0.638
MLP w/ clustering,0.82,0.672,0.366,0.473,0.765


## Paper results


### 1) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36 (2009) 2473–2480

|                     | Error rate | Accuracy |
|---------------------|:----------:|:--------:|
| Logistic Regression |    0.18    |   0.82   |
| Decision tree       |    0.17    |   0.83   |
| Neural Network      |    0.17    |   0.83   |

### 2) Liu,  R.L.  (2018) Machine  Learning  Approaches  to  Predict Default  of  Credit  Card  Clients. Modern Economy, 9, 1828-1838. 

|                | Accuracy |   F1   |
|----------------|:--------:|:------:|
|  Decision tree |  0.7973  | 0.4912 |
| Neural Network |  0.8227  | 0.4593 |

***