# Introduction

The aim of this project is to analyze the client's behaviour towards credit card issuance and repayment on loans. The project uses data from the UCI Malchine Learning Repository [found here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). The ultimate goal is to develop a model that predicts which clients will default on their credit card payment and those who wouldn't.

The project is executed in the following outlines:

- Exploratory data analysis
- preprocessing
- Application of neural network algorithm
- Application of different classification algorithms for model generation
- Selection of best prediction model
- Training on entire dataset
- Final notes

The goal of this project is to create a reliable service for banks and other credit card issuance companies that will be able to detect clients who will default on the card repayments given some highlighted features of the client. This service will help these credit card issuance companies detect defaulters beforehand, thereby averting loss of credit, mitigate the cost of legal actions accrued as a result of default repayments, increase productivity of staff who will now spend less time verifying clients credibility, ensure smoother credit card issuance process for credible clients amongst others.

The features of the dataset are as follows:
```
ID: ID of each client

LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

SEX: Gender (1=male, 2=female)

EDUCATION: (1 = graduate school; 2 = university; 3 = high school; 0, 4, 5, 6 = others).

MARRIAGE: Marital status (1=married, 2=single, 3=divorced, 0=others)

AGE: Age in years

PAY_0: Repayment status in September, 2005 (-2: No consumption; -1: Paid in full; 0: The use of revolving
credit, 1 = payment delay for one month; 2 = payment delay for two months; . . ., 8 = payment delay for eight months,
9 = payment delay for nine months and above.)

PAY_2: Repayment status in August, 2005 (scale same as above)

PAY_3: Repayment status in July, 2005 (scale same as above)

PAY_4: Repayment status in June, 2005 (scale same as above)

PAY_5: Repayment status in May, 2005 (scale same as above)

PAY_6: Repayment status in April, 2005 (scale same as above)

BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

default.payment.next.month: Default payment (1=yes, 0=no)
```

More explicitly, the non self-explanatory columns are illustrated as:  
* PAY_0 and PAY_2–PAY_6: These columns show the status of repayments made by each credit-card customer whose details are listed in the dataset. The six columns cover repayments made from April 2005 through September 2005, in reverse order. For example, PAY_0 indicates a customer's repayment status in September 2005 and PAY_6 indicates the customer's repayment status in April 2005.  
In each of the six PAY_X columns, the status code -2 = Balance paid in full and no transactions this period (we may refer to this credit card account as having been 'inactive' this period), -1 means that payment was made on time and the
code 1 means that payment was delayed by one month. 0 = Customer paid the minimum due amount, but not the entire balance. I.e., the customer paid enough for their account to remain in good standing, but did revolve a balance. The codes 2 through 8 represent delays in payment by two through eight months, respectively. And 9 means that payment was delayed by nine
or more months.  
PAY_0 should ideally be renamed to PAY_1. This will ensure that the PAY_X names conform to the naming convention used for the BILL_AMTX and PAY_AMTX columns. It will also preclude any questions about why PAY_0 is followed immediately by PAY_2.

* BILL_AMT1–BILL_AMT6: These columns list the amount billed to each customer from April 2005 through September 2005, in reverse order. The amounts are in New Taiwan (NT) dollars.


* PAY_AMT1–PAY_AMT6: These columns list, in reverse order, the amount that each customer paid back to the credit-card company from April 2005 through September 2005. Each of these amounts was paid to settle the preceding month's bill, either in full or partially. For example, each September 2005 amount was paid to settle the corresponding customer's August 2005 bill. The amounts are in NT dollars

### Exploratory Data Analysis (EDA)

For ease of analysis, the EDA will be performed using `Pandas-profiling` package.  
The `Pandas-Profiling` package generates profile reports (.html or other extensions) from a pandas DataFrame. As we know, the `pandas df.describe()` function is great but a little basic for serious exploratory data analysis. `pandas_profiling` extends the pandas DataFrame with `df.profile_report()` for quick data analysis.We get a great visual handy report to see the dossier about our data set. The link to the descriptives report can be found [here]()

In [None]:
!pip install ipython-autotime

%load_ext autotime

time: 2.38 ms (started: 2021-05-29 17:33:08 +00:00)


In [None]:
# connect colab notebook to drive
from google.colab import drive

# mount google drive
drive.mount('/content/gdrive')

# change directory to project's directory
%cd /content/gdrive/My Drive/predict_credit_card_default

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/predict_credit_card_default
time: 8.17 ms (started: 2021-05-29 17:33:10 +00:00)


In [None]:
# uninstall older version of sklearn
# !pip uninstall scikit-learn==0.22.2

time: 889 µs (started: 2021-05-29 17:33:12 +00:00)


In [None]:
# install latest version of scikit learn
# !pip install scikit-learn==0.24.2

time: 807 µs (started: 2021-05-29 17:33:13 +00:00)


In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# load the dataset
data = pd.read_excel('data.xls', header = 1)

# check
data.tail()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
29995,29996,220000,1,3,1,39,0,0,0,0,0,0,188948,192815,208365,88004,31237,15980,8500,20000,5003,3047,5000,1000,0
29996,29997,150000,1,3,2,43,-1,-1,-1,-1,0,0,1683,1828,3502,8979,5190,0,1837,3526,8998,129,0,0,0
29997,29998,30000,1,2,2,37,4,3,2,-1,0,0,3565,3356,2758,20878,20582,19357,0,0,22000,4200,2000,3100,1
29998,29999,80000,1,3,1,41,1,-1,0,0,0,-1,-1645,78379,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804,1
29999,30000,50000,1,2,1,46,0,0,0,0,0,0,47929,48905,49764,36535,32428,15313,2078,1800,1430,1000,1000,1000,1


time: 2.39 s (started: 2021-05-29 17:33:14 +00:00)


In [None]:
# rename PAY_0 and default.payment.next.month columns
data.rename(columns={'PAY_0': 'PAY_1',
                     'default payment next month': 'default'},
            inplace = True)

# check
data.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


time: 38.9 ms (started: 2021-05-29 17:33:19 +00:00)


In [None]:
# Check for missing values in the dataset
data.isnull().sum()

ID           0
LIMIT_BAL    0
SEX          0
EDUCATION    0
MARRIAGE     0
AGE          0
PAY_1        0
PAY_2        0
PAY_3        0
PAY_4        0
PAY_5        0
PAY_6        0
BILL_AMT1    0
BILL_AMT2    0
BILL_AMT3    0
BILL_AMT4    0
BILL_AMT5    0
BILL_AMT6    0
PAY_AMT1     0
PAY_AMT2     0
PAY_AMT3     0
PAY_AMT4     0
PAY_AMT5     0
PAY_AMT6     0
default      0
dtype: int64

time: 7.58 ms (started: 2021-05-29 17:33:19 +00:00)


From the above output we discover that there are no missing values in all columns of the dataset. Hence, we have an already cleaned dataset and we'll proceed to perform Exploratory data analysis

Below we installed the latest version of `pandas-profiling` using pip. This is because the default version installs comes with a deprecated package. 
After installation, we import the package

In [None]:
# get the dependent and independent variables
X = data.drop(['default', 'ID'], 1)
y = data['default']

time: 12 ms (started: 2021-05-29 17:33:21 +00:00)


In [None]:
#install latest version of pandas profiling
# !pip install pandas-profiling==3.0.0

time: 620 µs (started: 2021-05-29 17:33:22 +00:00)


In [None]:
# import pandas_profiling 
# #create a pandas profile report for the dataset
# profile = pandas_profiling.ProfileReport(data, minimal=True)

# #save the report in a html document
# profile.to_file('credi_card_default_EDA.html')

time: 878 µs (started: 2021-05-29 17:33:22 +00:00)


From the profile report generated, the following observations were discovered:

- The average credit amount as calculated from the `LIMIT_BAL` column is `$167484.3227` and a standard deviation of `129747.6616`. This depicts that there is a very large variation in the amount of credit issued to clients which is evident also in the minimum credit of `1000` and maximum of `1000000`. The skewness of the `LIMIT_BAL` feature (`0.9928669605`), reveals that the variable is skewed to the right, with majority of the credit limit on cards less than `500000` dollars.

- Females represent 60.4% of sampled clients in the dataset
- Majority of sampled clients are university graduates. They represent about 46.8% of clients in the dataset, and are closely followed by clients in graduate school representing 35.3%.
-  Single clients represent 53.2% of total clients
- Majority of clients are less than 50 years, with the youngest client being just 21 years and the oldest being 79 years. The average age of clients is 35 years.
- Majority of the clients used revolving credit, as depicted by the profile report for the `PAY_1` column, where 49.1% of clients indicated a `0` value
- The `default` column has 23,364 values assigned to the class `0` while 6,636 values are assigned the class `1`. This shows that the data is highly skewed towards the `Non-defaulters` class and we will account for this by modifying class weights when training the data for all models and evaluating model performance using the `f1` score instead of the `accuracy` score, as the **f1** score is considered a better metric for evaluating models with highly imbalanced classes.

# Preprocessing
Since the dataset has no missing values from our EDA, we'll move to preprocess the data, using appropriate techniques required for each feature.

The preprocessing steps we'll take includes:  
- Unlike what we saw in the Data Dictionary, the profile report shows that `EDUCATION` column has 7 distinct values. Since the values with explicit information are just `1=graduate school, 2=university, 3=high school, 4=others,` with the rest (0, 5, 6) representing unknown, we'll convert rows that hold these under representing values to `4` (others).
- splitting the entire dataset into training and test sets
- standardization

In [None]:
# bin some values in the education column
X[(X['EDUCATION'] == 0) | (X['EDUCATION'] == 5) | (X['EDUCATION'] == 6)] = 4

X['EDUCATION'].value_counts()

2    14030
1    10585
3     4917
4      468
Name: EDUCATION, dtype: int64

time: 21.6 ms (started: 2021-05-29 17:33:24 +00:00)


In [None]:
# import package for splitting data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score
from sklearn.model_selection import GridSearchCV

#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=33)

# import standard scalar
from sklearn.preprocessing import StandardScaler

# instantiate the standard scalar
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

# check
X_train_ss

array([[-1.19482966,  0.67320416,  0.21251328, ..., -0.26855851,
        -0.28198535, -0.25332386],
       [-0.73314976, -1.14735997, -1.13124751, ..., -0.14411153,
        -0.31741068, -0.28694982],
       [-1.04093636, -1.14735997,  1.55627407, ..., -0.30100384,
        -0.31741068, -0.28694982],
       ...,
       [-0.42536315, -1.14735997, -1.13124751, ..., -0.18772759,
        -0.17101282, -0.17486329],
       [-0.3484165 ,  0.67320416,  1.55627407, ..., -0.25343409,
        -0.18288835, -0.26677424],
       [ 1.49830313,  0.67320416,  0.21251328, ...,  1.41552363,
        -0.1987895 , -0.16864249]])

time: 420 ms (started: 2021-05-29 17:33:25 +00:00)


The preprocesing steps is complete for the dataset. Next we'll move to modelling

# Modelling

The task at hand is a supervised learning task that requires the application of binary classification algorithms to correctly classify credit card users into defaulters and non-defaulters. There are numerous classification algorithms with implementations in scikit-learn. For the purpose of this project, 6 most popular clasification algorithms will be used for modelling. The accuracy metric will be used for evaluating the performance of each model. The algorithms include:

- Naive Bayes
- Logistic regression
- K-nearest neighbors
- (Kernel) SVM
- Decision tree
- Ensemble learning
- Extreme Gradient Boosting (XGB)

Deep learning techniques will also be implemented to generate a neural deep network for classification using the `tensorflow.keras` library.

The f1 metrics of the generated models will be compared and the model with the highest f1 score will be used for training on the entire dataset.


# Applying Neural Networks 

In [None]:
# import the keras module
import keras

# import the sequential model 
from keras.models import Sequential

# import dense layer that connects all nodes in the previous layer to nodes in current layer
from keras.layers import Dense

# import optimizers
from keras.optimizers import SGD

# import dropout
from keras.layers import Dropout

# as recommended in the original paper on Dropout, a constraint is imposed on the weights for each hidden layer,
# ensuring that the maximum norm of the weights does not exceed a value of 3.
# This is done by setting the kernel_constraint argument on the Dense class when constructing the layers.
from keras.constraints import maxnorm

# instantiate the model
model = Sequential()

# Add the input layer  which transforms each input to a 1-dimensional array 
model.add(keras.layers.Flatten(input_shape = (23, )))

# Add the first hidden layer with 2000 nodes, specify activation='relu'
model.add(Dense(2000, activation='relu', kernel_constraint=maxnorm(3)))

# Add the first dropout layer
model.add(Dropout(0.2))

# Add a second hidden Dense layer. This should have 1000 nodes and a 'relu' activation.
model.add(Dense(1000, activation='relu', kernel_constraint=maxnorm(3)))

# Add a second dropout layer
model.add(Dropout(0.2))

# Add a third hidden Dense layer. This should have 1000 nodes and a 'relu' activation.
model.add(Dense(500, activation='relu', kernel_constraint=maxnorm(3)))

# Add the third dropout layer
model.add(Dropout(0.2))

# Finally, add an output layer, which is a Dense layer with 2 classes. Use sigmoid activation function for binary classification.
model.add(Dense(1, activation='sigmoid'))

# see the summary of the model
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 23)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 2000)              48000     
_________________________________________________________________
dropout_3 (Dropout)          (None, 2000)              0         
_________________________________________________________________
dense_5 (Dense)              (None, 1000)              2001000   
_________________________________________________________________
dropout_4 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 500)               500500    
_________________________________________________________________
dropout_5 (Dropout)          (None, 500)              

In [None]:
# Compile model
model.compile(loss='binary_crossentropy', optimizer=SGD(learning_rate=0.00001, momentum=0.9), metrics=['accuracy'])

time: 15.7 ms (started: 2021-05-27 14:37:31 +00:00)


In [None]:
# Defines model callbacks
model_path = './model_dir/model.h5'
my_callbacks = [keras.callbacks.ModelCheckpoint(model_path, monitor='loss'), 
                keras.callbacks.EarlyStopping(patience=3, monitor='loss')
                ]

time: 2.02 ms (started: 2021-05-27 14:37:33 +00:00)


In [None]:
# fit the model on the train dataset
history = model.fit(X_train_ss, y_train, epochs=100, callbacks=[my_callbacks])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [None]:
loaded_model_1 = keras.models.load_model('./model_dir/model.h5')

pred_prob = loaded_model_1.predict(X_test_ss)
nn_pred = loaded_model_1.predict_classes(X_test_ss).ravel()
nn_acc = accuracy_score(y_test, nn_pred)
nn_f1 = f1_score(y_test, nn_pred)
nn_roc_auc = roc_auc_score(y_test, pred_prob)


print(f'Accuracy score of ANN without class imbalnce consideration is {nn_acc}', '\n')
print(f'ROC_AUC score of ANN without class imbalnce consideration is {nn_roc_auc}', '\n')
print(f'The f1 score of ANN without class imbalance consideration is {nn_f1}', '\n')
print('The confusion matrix of ANN without class imbalance consideration is:', '\n', f'{confusion_matrix(y_test, nn_pred)}',)



Accuracy score of ANN without class imbalnce consideration is 0.81 

ROC_AUC score of ANN without class imbalnce consideration is 0.7087027431360287 

The f1 score of ANN without class imbalance consideration is 0.3595505617977528 

The confusion matrix of ANN without class imbalance consideration is: 
 [[2270   67]
 [ 503  160]]
time: 1.49 s (started: 2021-05-27 19:05:41 +00:00)


The confusion matrix above shows that the model is highly skewed in predicting the minority. This is shown in the false negative value of 503, and Only correctly predicting 160 of the minority class.  
The F1 score and ROC_AUC scores of 0.36 0.70 respectively is quite fair but let's see if we can improve both scores by adjusting class weights.  

# Create a network that takes class imbalance into consideration

In [None]:
# instantiate the model
model_2 = Sequential()

# Add the input layer  which transforms each input to a 1-dimensional array 
model_2.add(keras.layers.Flatten(input_shape = (23, )))

# Add the first hidden layer with 4000 nodes, specify activation='relu'
model_2.add(Dense(4000, activation='relu', kernel_constraint=maxnorm(4)))

# Add the first dropout layer to reduce overfitting and ensure randomization of layers
model_2.add(Dropout(0.2))

# Add a second hidden Dense layer. This should have 3000 nodes and a 'relu' activation.
model_2.add(Dense(2000, activation='relu', kernel_constraint=maxnorm(4)))

# Add a second dropout layer
model_2.add(Dropout(0.2))

# Add a third hidden Dense layer. This should have 1000 nodes and a 'relu' activation.
model_2.add(Dense(1000, activation='relu', kernel_constraint=maxnorm(4)))

# Add the third dropout layer
model_2.add(Dropout(0.2))

# Finally, add an output layer, which is a Dense layer with 2 classes. Use sigmoid activation function for binary classification.
model_2.add(Dense(1, activation='sigmoid'))

# see the summary of the model
model_2.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_8 (Flatten)          (None, 23)                0         
_________________________________________________________________
dense_32 (Dense)             (None, 4000)              96000     
_________________________________________________________________
dropout_24 (Dropout)         (None, 4000)              0         
_________________________________________________________________
dense_33 (Dense)             (None, 2000)              8002000   
_________________________________________________________________
dropout_25 (Dropout)         (None, 2000)              0         
_________________________________________________________________
dense_34 (Dense)             (None, 1000)              2001000   
_________________________________________________________________
dropout_26 (Dropout)         (None, 1000)             

In [None]:
# Compile model with a lower learning rate
model_2.compile(loss='binary_crossentropy', optimizer=SGD(learning_rate=0.00001, momentum=0.99), metrics=['accuracy'])

time: 10.7 ms (started: 2021-05-27 17:08:25 +00:00)


In [None]:
# Defines model callbacks
model_path_2 = './model_dir/model_2.h5'
my_callbacks_2 = [keras.callbacks.ModelCheckpoint(model_path_2, save_best_only=True, monitor='loss'), 
                keras.callbacks.EarlyStopping(patience=3, monitor='loss')
                ]


time: 3.07 ms (started: 2021-05-27 17:08:37 +00:00)


In [None]:
# define the weights using the inverse of the ratio ~ 4:1
weights = {0:1, 1:4}

# fit the model
history_2 = model_2.fit(X_train_ss, y_train, epochs=100, callbacks=[my_callbacks_2], class_weight=weights)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
time: 1h 11min 38s (started: 2021-05-27 17:08:45 +00:00)


In [None]:
loaded_model_2 = keras.models.load_model('./model_dir/model_2.h5')

pred_prob_2 = loaded_model_2.predict(X_test_ss)
nn_pred_2 = loaded_model_2.predict_classes(X_test_ss).ravel()
nn_acc_2 = accuracy_score(y_test, nn_pred)
nn_f1_2 = f1_score(y_test, nn_pred)
nn_roc_auc_2 = roc_auc_score(y_test, pred_prob_2)


print(f'Accuracy score of ANN with class imbalnce consideration is {nn_acc_2}', '\n')
print(f'ROC_AUC score of ANN with class imbalnce consideration is {nn_roc_auc_2}', '\n')
print(f'The f1 score of ANN with class imbalance consideration is {nn_f1_2}', '\n')
print('The confusion matrix of ANN with class imbalance consideration is:', '\n', f'{confusion_matrix(y_test, nn_pred_2)}',)



Accuracy score of ANN without class imbalnce consideration is 0.81 

ROC_AUC score of ANN without class imbalnce consideration is 0.783007439505212 

The f1 score of ANN without class imbalance consideration is 0.3595505617977528 

The confusion matrix of ANN without class imbalance consideration is: 
 [[1812  525]
 [ 230  433]]
time: 4.49 s (started: 2021-05-27 19:26:55 +00:00)


The model still maintained a the same f1 score but the confusion matrix shows that the model now accurately predicts about 68% of the minority class. The ROC_AUC score increased to 0.78

Lets train with other classifier algorithms

### MultiNomial Naive Bayes

Since multinomial naive bayes does not work with negative values and some of the values in the array are negative, we'll use a different scalar for the data other than standard scalar

In [None]:
# import minmaxscalar
from sklearn.preprocessing import MinMaxScaler

# import the naive bayes classifier
from sklearn.naive_bayes import MultinomialNB

# import gridsearch cv
from sklearn.model_selection import GridSearchCV

# instantiate minmax scalar
mms = MinMaxScaler()

X_train_mm = mms.fit_transform(X_train)
X_test_mm = mms.transform(X_test)

# instantiate the classifier with default parameters
mnb_model = MultinomialNB()

# define search space
space_mnb = dict()
space_mnb['alpha'] = [0.1, 0.01, 0.001, 0.0001, 0.00001, 1.0, 2.0]

# grid search the hyper parameter
mnb_search = GridSearchCV(estimator=mnb_model, param_grid=space_mnb, cv=10, n_jobs=-1)


# fit the mnb classifier on the train set
mnb_result = mnb_search.fit(X_train_mm, y_train)

# print the best score from the search
print(f'The best score from the search is {mnb_result.best_score_}', '\n')

# print the best parameter
print(f'The best parameter from the search is {mnb_result.best_params_}')

The best score from the search is 0.7787777777777778 

The best parameter from the search is {'alpha': 0.1}
time: 2.22 s (started: 2021-05-27 18:50:55 +00:00)


From the above score and parameter, we'll run the training on the data using the parameter observed from the search

In [None]:
# instantiate the MNB with the resultant alpha parameter
mnb = MultinomialNB(alpha=0.1)

# fit the model on the train data
mnb.fit(X_train_mm, y_train)

# predict on the test set
mnb_pred = mnb.predict(X_test_mm)

mnb_acc = accuracy_score(y_test, mnb_pred)
mnb_f1 = f1_score(y_test, mnb_pred)
mnb_roc_auc = roc_auc_score(y_test, mnb_pred)


print(f'Accuracy score of MNB is {mnb_acc}', '\n')
print(f'ROC_AUC score of MNB is {mnb_roc_auc}', '\n')
print(f'The f1 score MNB is {mnb_f1}', '\n')
print('The confusion matrix of MNB is:', '\n', f'{confusion_matrix(y_test, mnb_pred)}',)

Accuracy score of MNB is 0.779 

ROC_AUC score of MNB is 0.5 

The f1 score MNB is 0.0 

The confusion matrix of MNB is: 
 [[2337    0]
 [ 663    0]]
time: 40.6 ms (started: 2021-05-27 19:34:57 +00:00)


MNB performed worse by classifyng all the data as True Positives. That is all clients as non defaulters
# Logistic Regression

In [None]:
# import logistic regression classifier
from sklearn.linear_model import LogisticRegression

# instantiate the LR classifier with default parameters
lr_model = LogisticRegression()

# define search space
space_lr = dict()
space_lr['solver'] = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
space_lr['penalty'] = ['l1', 'l2', 'elasticnet']
space_lr['C'] = [0.1, 0.5, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 1.0, 2.0, ]

# grid search the hyper parameter
lr_search = GridSearchCV(estimator=lr_model, param_grid=space_lr, cv=10, n_jobs=-1)

# fit the lr classifier on the train set
lr_result = lr_search.fit(X_train_ss, y_train)

# print the best score from the search
print(f'The best score from the search is {lr_result.best_score_}', '\n')

# print the best parameters from the search
print(f'The best parameter from the search is {lr_result.best_params_}')

The best score from the search is 0.8042592592592592 

The best parameter from the search is {'C': 0.5, 'penalty': 'l1', 'solver': 'liblinear'}
time: 3min 26s (started: 2021-05-01 11:47:21 +00:00)


In [None]:
# import logistic regression classifier
from sklearn.linear_model import LogisticRegression

# instantiate the LR with the resultant hyperparameters and a balanced class weight to account for class imbalance
lr = LogisticRegression(solver='liblinear', class_weight='balanced')

# fit the model on the train data
lr.fit(X_train_ss, y_train)

# predict on the test set
lr_pred = lr.predict(X_test_ss)

lr_acc = accuracy_score(y_test, lr_pred)
lr_f1 = f1_score(y_test, lr_pred)
lr_roc_auc = roc_auc_score(y_test, lr_pred)


print(f'Accuracy score of Logistic Regression is {lr_acc}', '\n')
print(f'ROC_AUC score of Logistic Regression is {lr_roc_auc}', '\n')
print(f'The f1 score Logistic Regression is {lr_f1}', '\n')
print('The confusion matrix of Logistic Regression is:', '\n', f'{confusion_matrix(y_test, lr_pred)}',)

Accuracy score of Logistic Regression is 0.6676666666666666 

ROC_AUC score of Logistic Regression is 0.6608261355297524 

The f1 score Logistic Regression is 0.4631125471190091 

The confusion matrix of Logistic Regression is: 
 [[1573  764]
 [ 233  430]]
time: 157 ms (started: 2021-05-27 19:50:35 +00:00)


The accuracy increased by 0.002 after the hyper parameters were tuned and the Logistic Regression classifier performed better than the MNB Classifier  
But since we're judging by the f1 and ROC_AUC scores, it is quite obvious that the LR model performed way better than the MNB but the ANN with class imbalance consideration still outperforms in almost all metrics

# K-Nearest Neighbors Classifier

In [None]:
# Import the KNN classifier
from sklearn.neighbors import  KNeighborsClassifier

# instantiate the LR classifier with default parameters
knn_model = KNeighborsClassifier()

# define search space for hyperparameters
space_knn = dict()
space_knn['n_neighbors'] = [2, 3, 4, 5, 6, 7, 8, 9, 10]
space_knn['weights'] = ['uniform', 'distance']
space_knn['algorithm'] = ['auto', 'ball_tree', 'kd_tree', 'brute']
space_knn['leaf_size'] = [10, 20, 30, 40, 50]

# grid search the hyperparameters
knn_search = GridSearchCV(estimator=knn_model, param_grid=space_knn, cv=10, n_jobs=-1)

# fit the knn classifier on the train set
knn_result = knn_search.fit(X_train_ss, y_train)

# print the best score from the search
print(f'The best score from the search is {knn_result.best_score_}', '\n')

# print the best parameters from the search
print(f'The best parameter from the search is {knn_result.best_params_}')



The best score from the search is 0.8045555555555556 

The best parameter from the search is {'algorithm': 'auto', 'leaf_size': 50, 'n_neighbors': 10, 'weights': 'uniform'}
time: 2h 22min 55s (started: 2021-05-01 12:07:34 +00:00)


In [None]:
# Import the KNN classifier
from sklearn.neighbors import  KNeighborsClassifier

# instantiate the KNN classifier with the resultant hyperparameters
knn = KNeighborsClassifier(algorithm='auto', leaf_size=50, n_neighbors=10, weights='uniform', n_jobs=-1)

# fit the model on the train data
knn.fit(X_train_ss, y_train)

# predict on the test set
knn_pred = knn.predict(X_test_ss)

knn_acc = accuracy_score(y_test, knn_pred)
knn_f1 = f1_score(y_test, knn_pred)
knn_roc_auc = roc_auc_score(y_test, knn_pred)


print(f'Accuracy score of KNN is {knn_acc}', '\n')
print(f'ROC_AUC score of KNN is {knn_roc_auc}', '\n')
print(f'The f1 score KNN is {knn_f1}', '\n')
print('The confusion matrix of KNN is:', '\n', f'{confusion_matrix(y_test, knn_pred)}',)

Accuracy score of KNN is 0.8163333333333334 

ROC_AUC score of KNN is 0.6357833940330353 

The f1 score KNN is 0.42901554404145076 

The confusion matrix of Logistic Regression is: 
 [[2242   95]
 [ 456  207]]
time: 2.8 s (started: 2021-05-27 20:01:40 +00:00)


Knn is obviously performing best in Accuracy score, but since it is not out metric of interest, the ROC_AUC score of KNN is quite lower than that of LR and ANN. The f1 score of KNN is quite lower than that of LR also, but greater than that of ANN and MNB

# Support Vector Machine (SVM Classifier)

In [None]:
# import the SVM classifier
from sklearn.svm import SVC

# instantiate the SVC classifier with a hyperparameter to account for imbalanced classes
svc = SVC(class_weight='balanced')

# fit the model on the train data
svc.fit(X_train_ss, y_train)
# predict on the test set
svc_pred = svc.predict(X_test_ss)

svc_acc = accuracy_score(y_test, svc_pred)
svc_f1 = f1_score(y_test, svc_pred)
svc_roc_auc = roc_auc_score(y_test, svc_pred)

print(f'Accuracy score of SVC is {svc_acc}', '\n')
print(f'ROC_AUC score of SVC is {svc_roc_auc}', '\n')
print(f'The f1 score SVC is {svc_f1}', '\n')
print('The confusion matrix of KNN is:', '\n', f'{confusion_matrix(y_test, svc_pred)}',)

Accuracy score of SVC is 0.7823333333333333 

ROC_AUC score of SVC is 0.7144374289658593 

The f1 score SVC is 0.5462126476719944 

The confusion matrix of KNN is: 
 [[1954  383]
 [ 270  393]]
time: 60 s (started: 2021-05-29 11:47:21 +00:00)


SVC so far has a better f1 score compared to other algorithms including ANN

# Decision Trees
Use grid search to find the best parameters 

In [None]:
# import decision tree classifier
from sklearn.tree import DecisionTreeClassifier

# instantiate the DT classifier with default parameters
dtc_model = DecisionTreeClassifier()

# define search space for hyperparameters
space_dtc = dict()
space_dtc['criterion'] = ['gini', 'entropy']
space_dtc['splitter'] = ['best', 'random']
space_dtc['min_samples_split'] = [2, 4, 6, 8, 10]
space_dtc['max_features'] = ['auto', 'sqrt', 'log2']

# grid search the hyperparameters
dtc_search = GridSearchCV(estimator=dtc_model, param_grid=space_dtc, cv=10, n_jobs=-1)

# fit the DT classifier on the train set
dtc_result = dtc_search.fit(X_train_ss, y_train)

# print the best score from the search
print(f'The best score from the search is {dtc_result.best_score_}', '\n')

# print the best parameters from the search
print(f'The best parameter from the search is {dtc_result.best_params_}')

The best score from the search is 0.7732962962962964 

The best parameter from the search is {'criterion': 'gini', 'max_features': 'log2', 'min_samples_split': 10, 'splitter': 'random'}
time: 45.2 s (started: 2021-05-03 09:25:26 +00:00)


In [None]:
# import decision tree classifier
from sklearn.tree import DecisionTreeClassifier

# instantiate the DT classifier with the resultant hyperparameters and account for class imbalance
dtc = DecisionTreeClassifier(splitter='random', max_features='log2', class_weight='balanced', criterion='entropy')

# fit the model on the train data
dtc.fit(X_train_ss, y_train)

# predict on the test set
dtc_pred = dtc.predict(X_test_ss)

dtc_acc = accuracy_score(y_test, dtc_pred)
dtc_f1 = f1_score(y_test, dtc_pred)
dtc_roc_auc = roc_auc_score(y_test, dtc_pred)

print(f'Accuracy score of DTC is {dtc_acc}', '\n')
print(f'ROC_AUC score of DTC is {dtc_roc_auc}', '\n')
print(f'The f1 score of DTC is {dtc_f1}', '\n')
print('The confusion matrix of DTC is:', '\n', f'{confusion_matrix(y_test, dtc_pred)}')

Accuracy score of DTC is 0.7376666666666667 

ROC_AUC score of DTC is 0.6285071745692451 

The f1 score of DTC is 0.4217487141807495 

The confusion matrix of DTC is: 
 [[1926  411]
 [ 376  287]]
time: 78.5 ms (started: 2021-05-29 12:10:29 +00:00)


The accuracy score of the decision tree classifier performed worse than the all other preceeding models.

# Random Forest Classifier

In [None]:
# import random forest classifier
from sklearn.ensemble import RandomForestClassifier

# instantiate the RF classifier with the default hyperparameters
rfc = RandomForestClassifier(max_features='log2', class_weight='balanced', n_jobs=-1, ccp_alpha=0.001, criterion='entropy')

# fit the model on the train data
rfc.fit(X_train_ss, y_train)

# predict on the test set
rfc_pred = rfc.predict(X_test_ss)

rfc_acc = accuracy_score(y_test, rfc_pred)
rfc_f1 = f1_score(y_test, rfc_pred)
rfc_roc_auc = roc_auc_score(y_test, rfc_pred)

print(f'Accuracy score of RFC is {rfc_acc}', '\n')
print(f'ROC_AUC score of RFC is {rfc_roc_auc}', '\n')
print(f'The f1 score of RFC is {rfc_f1}', '\n')
print('The confusion matrix of RFC is:', '\n', f'{confusion_matrix(y_test, rfc_pred)}')

Accuracy score of RFC is 0.7763333333333333 

ROC_AUC score of RFC is 0.7257118903649146 

The f1 score of RFC is 0.5565102445472571 

The confusion matrix of RFC is: 
 [[1908  429]
 [ 242  421]]
time: 13.1 s (started: 2021-05-29 12:22:55 +00:00)


Random Forest performed best in training the data when evaluating with the f1 metric. Its ROC_AUC score was beaten by that of the neural network model with a small fraction.

# XGB

In [None]:
# get the ratio of class values in the default column
from collections import Counter

counter = Counter(y)

estimate = counter[0] / counter[1]
estimate

3.5207956600361663

time: 11.2 ms (started: 2021-05-29 17:47:20 +00:00)


In [None]:
import xgboost as xgb

# instantiate an XGB classifier that accounts for class imbalance
xgb_model= xgb.XGBClassifier(scale_pos_weight=round(estimate, 1))

xgb_model.fit(X_train_ss, y_train)

xgb_pred = xgb_model.predict(X_test_ss)

xgb_acc = accuracy_score(y_test, xgb_pred)
xgb_f1 = f1_score(y_test, xgb_pred)
xgb_roc_auc = roc_auc_score(y_test, xgb_pred)


print(f'Accuracy score of XGB is {xgb_acc}', '\n')
print(f'ROC_AUC score of XGB is {xgb_roc_auc}', '\n')
print(f'The f1 score of XGB is {xgb_f1}', '\n')
print('The confusion matrix of XGB is:', '\n', f'{confusion_matrix(y_test, xgb_pred)}',)

Accuracy score of XGB is 0.7713333333333333 

ROC_AUC score of XGB is 0.7333066138472768 

The f1 score of XGB is 0.5625 

The confusion matrix of XGB is: 
 [[1873  464]
 [ 222  441]]
time: 2.72 s (started: 2021-05-29 17:10:21 +00:00)


The f1 score and ROC_AUC and f1 scores of XGBoost performed better than that of Random Forest Clasifier

# Selection of best prediction model

From all the applied algorithms, the **eXtreme Gradient Boosting** performed best by yielding an **f1 score** of approximately **0.56** on the test set of **3,000** samples. This f1 score on the test data outperformed that of **SVC** which had an **f1 score** of **0.546**, **KNN’s 0.429, Logistics Regression’s 0.463, Multinomial Naive Bayes’ 0.5, Decision Trees 0.421**,  **Random Forest’s 0.556** and Neural Network's **0.359**. Thus, the XGB Classifier was selected for training the entire data.

# Training on entire dataset
The Entire data was trained using the XGB Classifier Algorithm. This is to ensure that the model trains on enough data in order to improve its predictive efficiency on unseen data. The **f1 score** will be computed still on the test data and compared to the **f1 score** of XGB on the training data, to ensure there was a significant improvement. The accuracy score of the model trained on the entire data will also be computed for validation of model's performance.




In [None]:
# check the entire dataset
data.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


time: 41.9 ms (started: 2021-05-29 17:33:57 +00:00)


We'll preprocess the entire data using the following steps:

- Drop the `ID` column
- Isolate the `default` column which is the predicted feature
- Bin values in the `Education` Column
- Scale the features data using standard scaler

#### Drop the ID column

In [None]:
# check the shape of the data
data.shape

(30000, 25)

time: 5.08 ms (started: 2021-05-29 17:34:02 +00:00)


In [None]:
# Isolate the predicted feature
target = data['default']

# check
print(target.shape)

# drop ID and default columns
data.drop(['ID', 'default'], axis=1, inplace=True)

# check
data.head()

(30000,)


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679


time: 39.1 ms (started: 2021-05-29 17:34:03 +00:00)


In [None]:
# check the values of the education column
data['EDUCATION'].value_counts()

2    14030
1    10585
3     4917
5      280
4      123
6       51
0       14
Name: EDUCATION, dtype: int64

time: 7.43 ms (started: 2021-05-29 17:34:04 +00:00)


In [None]:
# bin values of the education column
data[(data['EDUCATION'] == 0) | (data['EDUCATION'] == 5) | (data['EDUCATION'] == 6)] = 4

# check
data['EDUCATION'].value_counts()

2    14030
1    10585
3     4917
4      468
Name: EDUCATION, dtype: int64

time: 13.6 ms (started: 2021-05-29 17:34:05 +00:00)


### Preprocess the entire data using standard scaler and save the scaler

In [None]:
# check the data
data.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679


time: 37.6 ms (started: 2021-05-29 17:34:08 +00:00)


In [None]:
# import the scaler
from sklearn.preprocessing import StandardScaler

# import joblib for saving and loading binary files
from  joblib import dump
from joblib import load

# instantiate the scaler
scaler = StandardScaler()

# scale the features data using standard scaler
scaler.fit(data)

# save the scaler
dump(scaler, 'scaler.joblib')

['scaler.joblib']

time: 33 ms (started: 2021-05-29 17:34:21 +00:00)


In [None]:
# check the scaler mean and standard deviations
print(scaler.mean_, '\n'); scaler.scale_

[ 1.65561502e+05  1.63100000e+00  1.84226667e+00  1.58076667e+00
  3.51064667e+01  3.09666667e-02 -8.39333333e-02 -1.15633333e-01
 -1.70066667e-01 -2.15200000e-01 -2.38500000e-01  5.03164653e+04
  4.83344410e+04  4.62154227e+04  4.25832832e+04  3.97323530e+04
  3.83719462e+04  5.58849837e+03  5.82144120e+03  5.13657327e+03
  4.76963970e+03  4.73512893e+03  5.11648983e+03] 



array([1.30360641e+05, 5.49398762e-01, 7.44482054e-01, 5.80755323e-01,
       9.75462958e+00, 1.19747278e+00, 1.27164796e+00, 1.27318325e+00,
       1.24945748e+00, 1.21784877e+00, 1.23359275e+00, 7.31004605e+04,
       7.06607261e+04, 6.88607994e+04, 6.39480431e+04, 6.04612734e+04,
       5.92820578e+04, 1.64941099e+04, 2.29447389e+04, 1.74765097e+04,
       1.56169251e+04, 1.51522715e+04, 1.75499039e+04])

time: 8.93 ms (started: 2021-05-29 17:34:30 +00:00)


In [None]:
# load the saved scaler
load_scaler = load('scaler.joblib')

# check
load_scaler

StandardScaler()

time: 8.81 ms (started: 2021-05-29 17:34:32 +00:00)


In [None]:
# transform the entire data
scaled_data = load_scaler.transform(data)

# check
scaled_data

array([[-1.11660622,  0.6716433 ,  0.21186989, ..., -0.30541478,
        -0.31250291, -0.29153948],
       [-0.34950351,  0.6716433 ,  0.21186989, ..., -0.24138169,
        -0.31250291, -0.17757874],
       [-0.57963432,  0.6716433 ,  0.21186989, ..., -0.24138169,
        -0.2465062 , -0.00663763],
       ...,
       [-1.03989595, -1.14852825,  0.21186989, ..., -0.03647579,
        -0.1805095 , -0.11490033],
       [-0.65634459, -1.14852825,  1.55508562, ..., -0.18208704,
         3.1829466 , -0.18874689],
       [-0.8864754 , -1.14852825,  0.21186989, ..., -0.24138169,
        -0.2465062 , -0.23455911]])

time: 11.3 ms (started: 2021-05-29 17:34:44 +00:00)


In [None]:
import xgboost as xgb

# instantiate an XGB classifier that accounts for class imbalance
xgb_model_all= xgb.XGBClassifier(scale_pos_weight=round(estimate, 1))

xgb_model_all.fit(scaled_data, y)

xgb_all_pred = xgb_model_all.predict(X_test_ss)

xgb_all_acc = accuracy_score(y_test, xgb_all_pred)
xgb_all_f1 = f1_score(y_test, xgb_all_pred)
xgb_all_roc_auc = roc_auc_score(y_test, xgb_all_pred)


print(f'Accuracy score of XGB Trained on entire data is {xgb_all_acc}', '\n')
print(f'ROC_AUC score of XGB Trained on entire data is {xgb_all_roc_auc}', '\n')
print(f'The f1 score of XGB Trained on entire data is {xgb_all_f1}', '\n')
print('The confusion matrix of XGB Trained on entire data is:', '\n', f'{confusion_matrix(y_test, xgb_all_pred)}')

Accuracy score of XGB Trained on entire data is 0.8006666666666666 

ROC_AUC score of XGB Trained on entire data is 0.720262470545639 

The f1 score of XGB Trained on entire data is 0.5609397944199707 

The confusion matrix of XGB Trained on entire data is: 
 [[2020  317]
 [ 281  382]]
time: 2.91 s (started: 2021-05-29 17:49:56 +00:00)


# Final notes

The above shows that there is an improvement in the accuracy score of XGB, when trained on more data. **accuracy score** improved from  **0.771** to **0.800**.

Although the **f1 score** showed a fractional declines. When the network was trained on the just the **training data** of **27,000** observations the **f1 score** was **0.5625**, while the **f1 score** when training was done on the **entire data** of **30,000** observations improved to **0.561**

# Save XGBoost model

In [None]:
dump(xgb_model_all, 'xgb_model.joblib')

['xgb_model.joblib']

time: 14.2 ms (started: 2021-05-29 18:01:38 +00:00)


# Perform Scaling and Check the Class of a New Input

In [None]:
# load model
loaded_model = load('xgb_model.joblib')

# scale the input
Raw_Data =  np.array([50000, 1, 2, 1, 46, 3, 1, 1, 1, 1, 1, 47929, 48905, 49764, 36535, 32428, 15313, 0, 0, 0, 0, 0, 0]).reshape(1, -1)
scaled_input = load_scaler.transform(Raw_Data)

# check
scaled_input

array([[-0.8864754 , -1.14852825,  0.21186989, -1.00001953,  1.1167552 ,
         2.47941613,  0.85238476,  0.87625511,  0.93645977,  0.99782504,
         1.00397801, -0.03266006,  0.00807463,  0.05153262, -0.09458121,
        -0.12081044, -0.38897007, -0.33881782, -0.25371573, -0.29391299,
        -0.30541478, -0.31250291, -0.29153948]])

time: 14.4 ms (started: 2021-05-29 18:03:18 +00:00)


In [None]:
# get prediction result
result = loaded_model.predict(scaled_input)

print(result)
if result >= 0.3: # adjust the threshold due to class imbalance
            print('Defaulter')
else:
    print('Non Defaulter')

[1]
Defaulter
time: 7.05 ms (started: 2021-05-29 18:03:29 +00:00)
