# CAPSTONE PROJECT: CREDIT CARD DEFAULT



Center for Machine Learning and Intelligent Systems

Default of credit card clients Data Set
Download: Data Folder, Data Set Description

Abstract: This research aimed at the case of customersâ€™ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods.

Data Set Characteristics:  Multivariate

Number of Instances: 30000

Area: Business

Attribute Characteristics: Integer, Real

Number of Attributes: 24

Associated Tasks:

Classification

Missing Values?

N/A

Source:

Name: I-Cheng Yeh
email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw
institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering, Tamkang University, Taiwan.
other contact information: 886-2-26215656 ext. 3181


Data Set Information:

This research aimed at the case of customersâ€™ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel â€œSorting Smoothing Methodâ€ to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.


Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.




In [None]:
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline 

In [None]:
pwd

In [None]:
#importing the data and looking at the first 10 and last 10 rows
default = pd.read_csv('C:/Users/fb8502oa/Desktop/Projects using Python/default of credit card clients.csv', header = 1)
default.head(10)
default.tail(10)

In [None]:
default.dtypes

As you can see, some of the variables are int variables but are supposed to be factor variables.
Education, sex, marriage, pay and default payment next month.
Let's look at the Education levels for now.

In [None]:
#looking at default 
import seaborn as sb
from matplotlib import pyplot as plt
sb.distplot(default['default payment next month'],kde = False)
plt.show()
#very few people are likely to default.
#data is imbalanced.

In [None]:

# Remaning the default variable name
default.rename(columns={'default payment next month':'DEFAULT'},inplace=True)

# DATA CLEANING TO FIT SKITLEARN FORMAT.

In [None]:
#separeting the education dummy variable features for skitlearn
#one-hot encoding for Education
default['GRAD_SCHOOL'] = (default['EDUCATION']==1).astype('int')
default['UNIVERSITY'] = (default['EDUCATION']==2).astype('int')
default['HIGH_SCHOOL'] = (default['EDUCATION']==3).astype('int')
default.drop('EDUCATION', axis =1, inplace = True)
default.head(10)

In [None]:
#separeting the sex dummy variable features for skitlearn
#one-hot encoding for sex

default['MALE']= (default['SEX']==1).astype('int')
default.drop('SEX', axis = 1, inplace = True)
default.head(10)

In [None]:
#separeting the married and pay dummy variable features for skitlearn
#one-hot encoding for marriage
default['MARRIED'] = (default['MARRIAGE']==1).astype('int')
default.drop('MARRIAGE', axis=1, inplace = True)
default.head(10)

In [None]:
#dealing with the pay columns. anything less thaN 0 means it was not delayed.
#this is an assumption
PAY = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
for i in PAY:
    default.loc[default[i]<=0, i] = 0
    
default.head(10)

# MODELING

In [None]:
import itertools
import matplotlib.ticker as ticker
from sklearn import preprocessing
from matplotlib.ticker import NullFormatter

In [None]:
#looking at default values so that we know their real classification
default['DEFAULT'].value_counts()

In [None]:
#default.columns
#x variables
X = default[['ID', 'LIMIT_BAL', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5',
       'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4',
       'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3',
       'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'GRAD_SCHOOL',
       'UNIVERSITY', 'HIGH_SCHOOL', 'MALE', 'MARRIED']]
X[0:5]


In [None]:
#y variable
y = default['DEFAULT'].values
y[0:10]

In [None]:
#The x variables have values ranging from 0 to some that have more than 1000. 
#scaling is important
X =preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state = 2)
print('Train set: ', X_train.shape, y_train.shape)
print('Test set: ', X_test.shape, y_test.shape)

# MODEL 1. K NEAREST NEIGHBOR (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier


In [None]:
#lets start with k= 2
k=2 
#model 
DFneigh = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)
DFneigh

In [None]:
yhat = DFneigh.predict(X_test)
yhat[0:5]

In [None]:
#accuracy
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, f1_score

print('Train set Accuracy: ', metrics.accuracy_score(y_train, DFneigh.predict(X_train)))
print('Test set Accuracy: ', metrics.accuracy_score(y_test, yhat))
F1_score = f1_score(y_test, yhat, average = 'weighted')
print("the F1 score is: ", F1_score)

In [None]:
from sklearn import metrics
#How i got 7 as the best k 
ks = 4
mean_acc= np.zeros((ks-1))
std_acc = np.zeros((ks-1))
confusionMx = [];
for n in range (1, ks):
    #Train model and predict
    neighb = KNeighborsClassifier(n_neighbors = n).fit(X_train, y_train)
    yhat = neighb.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test,yhat)
    
    std_acc[n-1] = np.std(yhat == y_test)/np.sqrt(yhat.shape[0])
    
mean_acc


In [None]:

#printing the best k 
print("the best accuracy was with ", mean_acc.max(), "with k = ", mean_acc.argmax()+1)

In [None]:
## ACCURACY REPORT

In [None]:
#since my model started with the optimal number of k, lets look at the report 
#the report
print(confusion_matrix(y_test, yhat))
print(classification_report(y_test, yhat))

# MODEL 2: DECISION TREE

In [None]:
#importing libraries
from sklearn.tree import DecisionTreeClassifier 

In [None]:
DFtree = DecisionTreeClassifier(criterion = "entropy", max_depth = 4)
DFtree

#fitting the model 
DFtree.fit(X_train, y_train)

In [None]:
predTree = DFtree.predict(X_test)
predTree[0:5]
print(y_test[0:5])

In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DesicionTree's accuracy: ", metrics.accuracy_score(y_test, predTree))

In [None]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, predTree)

from sklearn.metrics import f1_score
F1_score = f1_score(y_test, predTree, average='weighted')
F1_score

In [None]:
## ACCURACY REPORT

In [None]:
#Evaluation for f1 score
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, predTree))
print(classification_report(y_test, predTree))

# MODEL 3: SUPPORT VECTOR MACHINE

In [None]:
#importing libraries
import pylab as pl
from sklearn import svm

In [None]:
## TRAINING THE MODEL
DFsvm = svm.SVC(kernel = "rbf", gamma = 'scale')
DFsvm.fit(X_train, y_train)

In [None]:
#prediction 
yhat1 =DFsvm.predict(X_test)
yhat1[0:5]

In [None]:
#evaluation of the model usinf sklearn 
from sklearn import metrics
print("Accuracy is: ", metrics.accuracy_score(y_test, yhat1))

from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat1)


#finding the f1 score
from sklearn.metrics import f1_score
F1_score = f1_score(y_test, yhat1, average='weighted')
F1_score

In [None]:
#report
print(classification_report(y_test, yhat1))

# MODEL 4: LOGISTIC REGRESSION

In [None]:
#importing libaries
import scipy.optimize as opt
#the data has already been split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [None]:
LR = LogisticRegression(C= 0.01, solver = 'liblinear').fit(X_train, y_train)
LR

In [None]:
#prediction 
yhat2 = LR.predict(X_test)
yhat2

In [None]:
#estimates for all classes 
yhat2_prob = LR.predict_proba(X_test)
yhat2_prob

In [None]:

#evaluation 
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat2)

In [None]:
#finding the f1 score
from sklearn.metrics import f1_score
F1_score = f1_score(y_test, yhat2, average='weighted')
F1_score

In [None]:
print(classification_report(y_test, yhat2))

# MODEL 5: NAIVE BAYES CLASSIFIER

In [None]:
#library
from sklearn.naive_bayes import GaussianNB

In [None]:
NBC = GaussianNB()
NBC.fit(X_train, y_train)

In [None]:
#predictions 
ypred = NBC.predict(X_test)

In [None]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, ypred)

In [None]:
from sklearn.metrics import f1_score
F1_score = f1_score(y_test, ypred, average='weighted')
F1_score

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, precision_recall_curve
metrics = pd.DataFrame(index=['accuracy', 'precision', 'recall'],
                      columns = ['KNNeigh', 'Desc_Trees', 'SVM', 'LogisticReg', 'NaiveB','NeuralNet'])
ypred = NBC.predict(X_test)

metrics.loc['accuracy', 'NaiveB'] = accuracy_score(ypred, y_test)
metrics.loc['precision', 'NaiveB'] = precision_score(ypred, y_test)
metrics.loc['recall', 'NaiveB'] = recall_score(ypred, y_test)

# MODEL 6: NEURAL NETWORKS

In [None]:
# libraries 
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)

predict_train = mlp.predict(X_train)
predict_test = mlp.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_train,predict_train))
print(classification_report(y_train,predict_train))

In [None]:
## ACCURACY REPORT

In [None]:
print(confusion_matrix(y_test,predict_test))
print(classification_report(y_test,predict_test))

# THE BEST MODEL

## The best model is logistic regression with: % metrics

Accuracy 82.17%

Precision 70%

Recall 35%

## Naive Bayes has the best recall: % metrics

Accuracy 76.85%

Precision 57.08%

Recall 48.05%

In [None]:
precision_lr, recall_lr, thresholds_lr = precision_recall_curve(y_test,yhat2)


In [None]:
#trying to adjust the model 
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(thresholds_lr, precision_lr[1:], label = 'precision')
ax.plot(thresholds_lr, recall_lr[1:], label = 'recall')
ax.set_xlabel('Classification threshold')
ax.set_ylabel('precision, recall')
ax.set_title('LogisticREg: precision recall')
ax.hlines(y=0.6, xmin =0, xmax=1, color ='Blue')
ax.legend()
ax.grid();

In [None]:
#dealing with thres
yhatz = LR.predict_proba(X_test)[:,1]
ypredtest = (yhatz>=0.25).astype('int')
print(classification_report(y_test,ypredtest ))

# RECALL EXPLAINATION

# PREDICTION FOR A CUSTOMER.
NEW DATA.

In [None]:
#RAW DATA
ID = 2
LIMIT_BAL= 6000
AGE= 24                    
BILL_AMT1= 608
BILL_AMT2= 57800
BILL_AMT3= 500                       
BILL_AMT4= 1000
BILL_AMT5= 600
BILL_AMT6= 1000
PAY_AMT1=6000                 
PAY_AMT2= 50
PAY_AMT3= 0
PAY_AMT4= 0
PAY_AMT5= 0
PAY_AMT6=0
MALE=-1
GRAD_SCHOOL= 1 
UNIVERSITY= 0 
HIGH_SCHOOL= 0 
MARRIED= 1
PAY_0= 0 
PAY_2= 0
PAY_3= 0 
PAY_4= 0 
PAY_5= 1
PAY_6= 0

In [None]:
#default.drop('ID', axis =1)
prediction = LR.predict([[ID,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,
                          BILL_AMT3,BILL_AMT4,BILL_AMT5,
                          BILL_AMT6,PAY_AMT1,PAY_AMT2,
                          PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,MALE,
                          GRAD_SCHOOL, UNIVERSITY,HIGH_SCHOOL,MARRIED,PAY_0,
                          PAY_2,PAY_3,PAY_4,PAY_5,PAY_6]])

probability = LR.predict_proba([[ID,LIMIT_BAL,AGE,BILL_AMT1,BILL_AMT2,
                          BILL_AMT3,BILL_AMT4,BILL_AMT5,
                          BILL_AMT6,PAY_AMT1,PAY_AMT2,
                          PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,MALE,
                          GRAD_SCHOOL, UNIVERSITY,HIGH_SCHOOL,MARRIED,PAY_0,
                          PAY_2,PAY_3,PAY_4,PAY_5,PAY_6]])

prediction = prediction[0]
probability = float(probability[0][1])

#features = [ 'LIMIT_BAL' , 'SEX' , 'EDUCATION' , 'MARRIAGE','AGE','PAY_MAX_SCORE','BILL_AV_AMT', 'PAY_AMT_AV', 'AVAILABLE_CRED_PERCENT']
#prints predictions


##trying with another formular.

In [None]:
def pred(prediction):
    if prediction >=0.25:
        return 'will default'
    else:
        return 'will pay'

In [None]:
pred(prediction)

In [None]:
from sklearn import preprocessing
from matplotlib.ticker import NullFormatter
#scaler = StandardScaler()

def ind_prediction(newdata):
    data = newdata.values.reshape(1,-1)
    data = preprocessing.StandardScaler().fit(data).transform(data)
    prob = LR.predict_proba(data)[0][1]
    if prob >=0.25:
        return 'default'
    else:
        return 'will pay'

In [None]:
pay = default[default['DEFAULT']==0]
pay.head(10)

In [None]:
from collections import OrderedDict
new_cust = OrderedDict([('ID', 0),('LIMIT_BAL', 4000), ('AGE', 50), ('BILL_AMT1', 500),
                        ('BILL_AMT2',35509), ('BILL_AMT3',689), ('BILL_AMT4', 0), 
                        ('BILL_AMT5', 0), ('BILL_AMT6',0),('PAY_AMT1',0), ('PAY_AMT2', 35509),
                        ('PAY_AMT3', 0), ('PAY_AMT4',0), ('PAY_AMT5',0), ('PAY_AMT6',0),('MALE',1),
                        ('GRAD_SCHOOL',0), ('UNIVERSITY',1),
                        ('HIGH_SCHOOL',0), ('MARRIED',1), ('PAY_0',-1), ('PAY_2', -1),('PAY_3', -1), 
                        ('PAY_4',0), ('PAY_5', -1), ('PAY_6',0)])
new_cust = pd.Series(new_cust)
ind_prediction(new_cust)