# Credit Risk Classification

In this notebook, a binary calssification model will be built and trained with a credit customers and transactions dataset to determine whether a customer is of high or low risk.


# Outline
- [ 1 - Packages ](#1)
- [ 2 - Preprocessing Data](#2)
  - [ 2.1 Loading and Visualizing the Data](#2.1)
  - [ 2.2 Processing](#2.2)
- [ 3 - Classification Model](#3)
  - [ 3.1 Training](#3.1)
- [ 4 - Results](#4)

<a name="1"></a>
## 1 - Packages 

Below are all the needed packages for this notebook.
- [numpy](https://www.numpy.org) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
- [tensorflow](https://www.tensorflow.org) is an end-to-end machine learning platform.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.activations import sigmoid, relu
from tensorflow.keras.callbacks import TensorBoard

<a name="2"></a>
## 2 - Preprocessing Data

The dataset for the model we'll build contains customer transaction and demographic data. As well if the customer is considered risky or not for sprecific products.
The demographic data has been encoded into numbered features, and OVD features indicate overdue paymente types.

The dataset can be found here: [Credit Risk](https://www.kaggle.com/datasets/praveengovi/credit-risk-classification-dataset?datasetId=637189&sortBy=)
<br/><br/>
<a name="2.1"></a>
### 2.1 Loading and Visualizing the Data

In [2]:
#Load Data
customer_data = pd.read_csv("./Data/customer_data.csv")
payment_data = pd.read_csv("./Data/payment_data.csv")

In [3]:
customer_data.head()

Unnamed: 0,label,id,fea_1,fea_2,fea_3,fea_4,fea_5,fea_6,fea_7,fea_8,fea_9,fea_10,fea_11
0,1,54982665,5,1245.5,3,77000.0,2,15,5,109,5,151300,244.948974
1,0,59004779,4,1277.0,1,113000.0,2,8,-1,100,3,341759,207.17384
2,0,58990862,7,1298.0,1,110000.0,2,11,-1,101,5,72001,1.0
3,1,58995168,7,1335.5,1,151000.0,2,11,5,110,3,60084,1.0
4,0,54987320,7,,2,59000.0,2,11,5,108,4,450081,197.403141


In [4]:
payment_data.head()

Unnamed: 0,id,OVD_t1,OVD_t2,OVD_t3,OVD_sum,pay_normal,prod_code,prod_limit,update_date,new_balance,highest_balance,report_date
0,58987402,0,0,0,0,1,10,16500.0,04/12/2016,0.0,,
1,58995151,0,0,0,0,1,5,,04/12/2016,588720.0,491100.0,
2,58997200,0,0,0,0,2,5,,04/12/2016,840000.0,700500.0,22/04/2016
3,54988608,0,0,0,0,3,10,37400.0,03/12/2016,8425.2,7520.0,25/04/2016
4,54987763,0,0,0,0,2,10,,03/12/2016,15147.6,,26/04/2016


In [5]:
print("# of items/customers in customer_data: ", len(customer_data))

# of items/customers in customer_data:  1125


In [6]:
print("# of items in payment_data: ", len(payment_data))
print("# of unique ids' in payment_data: ", len(payment_data["id"].unique()))

# of items in payment_data:  8250
# of unique ids' in payment_data:  1125


We can observe the number of customers in our customer data matches the number of unique IDs in the payment data. Let's join our dataframes to have all the transactions for each customer, with the relevant customer information.

<a name="2.2"></a>
### 2.2 Processing

In [7]:
#Drop unecessary columns
payment_data = payment_data.drop(['prod_code', 'prod_limit', 'update_date', 'highest_balance', 'report_date'], axis=1)

In [8]:
#Join customer and payment data
data = customer_data.join(payment_data.set_index('id'), on='id')

In [9]:
data

Unnamed: 0,label,id,fea_1,fea_2,fea_3,fea_4,fea_5,fea_6,fea_7,fea_8,fea_9,fea_10,fea_11,OVD_t1,OVD_t2,OVD_t3,OVD_sum,pay_normal,new_balance
0,1,54982665,5,1245.5,3,77000.0,2,15,5,109,5,151300,244.948974,0,0,0,0,9,6657.6
0,1,54982665,5,1245.5,3,77000.0,2,15,5,109,5,151300,244.948974,0,0,0,0,18,153792.0
0,1,54982665,5,1245.5,3,77000.0,2,15,5,109,5,151300,244.948974,0,0,0,0,1,0.0
0,1,54982665,5,1245.5,3,77000.0,2,15,5,109,5,151300,244.948974,0,2,26,11906,6,0.0
1,0,59004779,4,1277.0,1,113000.0,2,8,-1,100,3,341759,207.173840,0,0,0,0,4,15120.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1123,0,58998054,4,1250.0,3,137000.0,2,8,5,90,5,72000,1.000000,0,0,0,0,10,0.0
1123,0,58998054,4,1250.0,3,137000.0,2,8,5,90,5,72000,1.000000,0,0,0,0,1,-121.2
1124,0,54989781,4,1415.0,3,93000.0,2,8,5,113,4,151300,273.861279,0,0,0,0,12,334130.4
1124,0,54989781,4,1415.0,3,93000.0,2,8,5,113,4,151300,273.861279,0,0,0,0,7,456098.4


In [10]:
#Drop ID
data = data.drop(['id'], axis=1)

In [11]:
#Shuffle data
data = data.sample(frac = 1, random_state=2) #0,2

In [12]:
data

Unnamed: 0,label,fea_1,fea_2,fea_3,fea_4,fea_5,fea_6,fea_7,fea_8,fea_9,fea_10,fea_11,OVD_t1,OVD_t2,OVD_t3,OVD_sum,pay_normal,new_balance
369,0,5,1227.5,3,35000.0,2,15,4,105,5,510099,316.227766,0,0,0,0,36,32505.6
313,0,5,1236.5,1,52000.0,2,15,5,112,4,550009,172.615179,0,0,0,0,33,17560.8
814,0,7,1250.0,3,60000.0,2,11,5,114,4,72001,707.106781,0,0,0,0,12,0.0
253,0,7,1370.0,3,394000.0,2,11,5,112,5,510009,445.038201,0,0,24,21075,1,0.0
600,0,7,1308.5,1,110000.0,2,11,5,108,4,60004,1.000000,0,0,0,0,31,754.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,0,7,1254.5,3,97000.0,2,11,5,83,3,60017,282.842713,0,0,0,0,7,0.0
347,0,7,1304.0,3,290000.0,2,11,5,111,4,60000,310.383634,0,0,0,0,1,0.0
905,1,7,1316.0,1,64000.0,2,11,9,87,5,60079,1.000000,0,0,0,0,1,0.0
352,0,5,1242.5,3,30000.0,2,15,5,80,3,60050,200.000000,0,0,0,0,8,0.0


Before getting the features and labels for the model, let's check for missing values.

In [13]:
#Check for null values
data.isnull().sum()

label             0
fea_1             0
fea_2          1028
fea_3             0
fea_4             0
fea_5             0
fea_6             0
fea_7             0
fea_8             0
fea_9             0
fea_10            0
fea_11            0
OVD_t1            0
OVD_t2            0
OVD_t3            0
OVD_sum           0
pay_normal        0
new_balance       0
dtype: int64

In [14]:
print("Unique values in fea_2: ", len(data["fea_2"].unique()))
print(data["fea_2"].unique())

Unique values in fea_2:  159
[1227.5 1236.5 1250.  1370.  1308.5    nan 1271.  1256.  1338.5 1274.
 1320.5 1314.5 1244.  1259.  1323.5 1352.  1358.  1313.  1220.  1331.
 1286.  1296.5 1310.  1281.5 1376.  1278.5 1206.5 1347.5 1317.5 1211.
 1326.5 1239.5 1269.5 1292.  1241.  1365.5 1341.5 1343.  1356.5 1425.5
 1305.5 1223.  1212.5 1335.5 1337.  1257.5 1245.5 1275.5 1301.  1214.
 1272.5 1311.5 1251.5 1379.  1389.5 1349.  1302.5 1277.  1283.  1290.5
 1230.5 1218.5 1325.  1298.  1307.  1373.  1350.5 1229.  1319.  1224.5
 1419.5 1265.  1304.  1299.5 1293.5 1254.5 1397.  1364.  1263.5 1235.
 1179.5 1163.  1262.  1346.  1367.  1238.  1248.5 1316.  1287.5 1329.5
 1289.  1266.5 1215.5 1322.  1268.  1191.5 1400.  1362.5 1404.5 1205.
 1170.5 1284.5 1200.5 1361.  1280.  1377.5 1328.  1371.5 1415.  1203.5
 1475.  1137.5 1164.5 1340.  1413.5 1233.5 1232.  1226.  1253.  1355.
 1166.  1382.  1388.  1443.5 1334.  1260.5 1190.  1332.5 1386.5 1188.5
 1242.5 1368.5 1295.  1247.  1353.5 1217.  1449.5 1148.

Feature 2 seems to have some missing values. The feature has 150 different values and all seem close. Let's use the mean value to fill these.

In [15]:
#Replace null values with column mean
data['fea_2'] = data['fea_2'].fillna(data['fea_2'].mean())

<a name="2.3"></a>
### 2.3 Features and Labels

We can now get the training arrays.

In [16]:
#Get features and labels
X_train = data.drop(['label'], axis=1).to_numpy()
Y_train = data['label'].to_numpy()

In [17]:
print(f"First example features: {X_train[0]}")

First example features: [5.00000000e+00 1.22750000e+03 3.00000000e+00 3.50000000e+04
 2.00000000e+00 1.50000000e+01 4.00000000e+00 1.05000000e+02
 5.00000000e+00 5.10099000e+05 3.16227766e+02 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 3.60000000e+01
 3.25056000e+04]


In [18]:
print(f"First example label: {Y_train[0]}")

First example label: 0


<a name="3"></a>
## 3 - Classification Model

For the classification model, a sequential model will be built, consisting of 3 Dense layers, with 64, 128, and 256 neurons and relu activations. A Dropout layer to reduce overfitting, and a Dense output layer with a sigmoid activation for binary classification.

In [19]:
model = Sequential(
    [   
        Dense(64, activation='relu', input_shape=X_train.shape[1:]),
        
        Dense(128, activation='relu'),
        
        Dense(256, activation='relu'),
        Dropout(0.3),
        
        Dense(1, activation='sigmoid')
    ], name = "my_model" 
)

In [20]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                1152      
                                                                 
 dense_1 (Dense)             (None, 128)               8320      
                                                                 
 dense_2 (Dense)             (None, 256)               33024     
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_3 (Dense)             (None, 1)                 257       
                                                                 
Total params: 42,753
Trainable params: 42,753
Non-trainable params: 0
_________________________________________________________________


<a name="3.1"></a>
### 3.1 Training

In [21]:
model.compile(
    loss=tf.keras.losses.binary_crossentropy,
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.003),
    metrics=['accuracy']
)

history = model.fit(
    X_train, Y_train,
    batch_size = 32,
    validation_split = 0.3,
    epochs = 10,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<a name="4"></a>
## 4 - Results

The loss function and accuracy results of both the training and validation data is close and similar. This means our model did not overfit the traing data and generalizes well new examples. With a loss function of 0.43 and an accuracy of 0.84 for our validation data, we have a good model.