# Case Study: Credit Risk Prediction

This notebook builds a predictive classification model to identify credit card default payments based on customer attributes.

## Overview
### Objective:
Our goal is to:
- Preprocess the credit risk data using encoding methods.
- Train and tune predictive models using cross-validations with multiple algorithms
- Compare the models based on metrics including accuracy, precision, recall, F1-score
- Identify the best performing model based on these criteria and evaluate it on the test set
- Compute the optimal classification threshold for the selected model

### Dataset:
The dataset includes one target variable and 23 predictor variables:

- Target Variable (Y): Indicates whether the customer defaulted on a credit card payment (Yes = 1, No = 0).

- Predictor Variables (X1 to X23):
  - X1: Credit amount (NT dollar).
  - X2: Gender (1 = male; 2 = female).
  - X3: Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others).
  - X4: Marital status (1 = married; 2 = single; 3 = others).
  - X5: Age (years).
  - X6 - X11: Historical monthly repayment statuses (-1 = paid duly, 1-9 = months delayed).
  - X12 - X17: Monthly bill statement amounts (NT dollar).
  - X18 - X23: Amount paid each month (NT dollar).

### Tasks
1. Load and preprocess the training and test datasets, clearly applying appropriate encodings
2. Train and tune models using cross-validation for each algorithm, illustrating hyperparameter tuning clearly with plots.
3. Select and justify the best-performing model.
4. Evaluate the selected best model on the test set using suitable classification metrics.
5. Compute the optimal probability threshold for classifying defaults, improving the performance evaluation.


## Data Loading and Preprocessing


In [None]:
# Imports
import pandas as pd
import numpy as np

np.random.seed(42) # For reproducibility

# Sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [5]:
#Load Data
train_df = pd.read_csv('creditdefault_train.csv')
test_df = pd.read_csv('creditdefault_test.csv')

#Print
print(train_df.head())
print(test_df.head())

   Y      X1  X2  X3  X4  X5  X6  X7  X8  X9  ...     X14     X15     X16  \
0  1   20000   2   2   1  24   2   2  -1  -1  ...     689       0       0   
1  0   50000   2   2   1  37   0   0   0   0  ...   49291   28314   28959   
2  0   50000   1   2   1  57  -1   0  -1   0  ...   35835   20940   19146   
3  0   50000   1   1   2  37   0   0   0   0  ...   57608   19394   19619   
4  0  500000   1   1   2  29   0   0   0   0  ...  445007  542653  483003   

      X17    X18    X19    X20    X21    X22    X23  
0       0      0    689      0      0      0      0  
1   29547   2000   2019   1200   1100   1069   1000  
2   19131   2000  36681  10000   9000    689    679  
3   20024   2500   1815    657   1000   1000    800  
4  473944  55000  40000  38000  20239  13750  13770  

[5 rows x 24 columns]
   Y      X1  X2  X3  X4  X5  X6  X7  X8  X9  ...    X14    X15    X16    X17  \
0  1  120000   2   2   2  26  -1   2   0   0  ...   2682   3272   3455   3261   
1  0   90000   2   2   2  34

In [None]:
#Split Data
X_train = train_df.drop('Y', axis=1)
y_train = train_df['Y']

X_test = test_df.drop('Y', axis=1)
y_test = test_df['Y']

In [57]:
X_train.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,...,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,167450.245333,1.604867,1.85,1.5562,35.367933,-0.020467,-0.130933,-0.163,-0.214467,-0.256933,...,47117.562067,43077.445667,40272.922667,38708.685867,5615.96,5822.059,4942.959,4997.328867,4798.4784,5226.421267
std,130109.925023,0.488896,0.786686,0.522743,9.154118,1.125048,1.198451,1.202606,1.180578,1.148654,...,69182.43494,64016.907786,60503.339354,59212.42541,15551.708028,21556.75,13629.034736,16499.349511,15463.948485,18099.851948
min,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-34041.0,-170000.0,-46627.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,2733.5,2392.75,1800.0,1200.0,1000.0,833.0,390.0,290.0,204.0,80.0
50%,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,20165.0,19090.5,18178.0,17177.0,2113.0,2014.0,1809.0,1500.0,1500.0,1500.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,60263.25,54599.5,50134.75,49122.75,5023.25,5000.0,4571.5,4048.5,4019.5,4000.0
max,800000.0,2.0,6.0,3.0,75.0,8.0,8.0,8.0,8.0,7.0,...,855086.0,706864.0,587067.0,568638.0,493358.0,1227082.0,380478.0,528897.0,426529.0,528666.0


In [65]:
# Check for missing values
print(X_train.loc[X_train.isnull().any(axis=1)])
print(X_test.loc[X_test.isnull().any(axis=1)])

Empty DataFrame
Columns: [X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23]
Index: []

[0 rows x 23 columns]
Empty DataFrame
Columns: [X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23]
Index: []

[0 rows x 23 columns]


In [None]:
# X2 = gender, X4 = marital status
categorical_columns = ["X2", "X4"]

X_train_cat = X_train[categorical_columns].copy()
X_train_num = X_train.drop(categorical_columns, axis=1)

X_train_cat.describe()

Unnamed: 0,X2,X4
count,15000.0,15000.0
mean,1.604867,1.5562
std,0.488896,0.522743
min,1.0,0.0
25%,1.0,1.0
50%,2.0,2.0
75%,2.0,2.0
max,2.0,3.0


In [59]:
# Use one hot encoding for nominal data
one_hot_encoder = OneHotEncoder()

X_train_cat = one_hot_encoder.fit_transform(X_train_cat)

one_hot_encoder.categories_

[array([1, 2], dtype=int64), array([0, 1, 2, 3], dtype=int64)]