# Case Study: Credit Risk Prediction

This notebook builds a predictive classification model to identify credit card default payments based on customer attributes.

## Overview
### Objective:
Our goal is to:
- Preprocess the credit risk data using encoding methods.
- Train and tune predictive models using cross-validations with multiple algorithms
- Compare the models based on metrics including accuracy, precision, recall, F1-score
- Identify the best performing model based on these criteria and evaluate it on the test set
- Compute the optimal classification threshold for the selected model

### Dataset:
The dataset includes one target variable and 23 predictor variables:

- Target Variable (Y): Indicates whether the customer defaulted on a credit card payment (Yes = 1, No = 0).

- Predictor Variables (X1 to X23):
  - X1: Credit amount (NT dollar).
  - X2: Gender (1 = male; 2 = female).
  - X3: Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others).
  - X4: Marital status (1 = married; 2 = single; 3 = others).
  - X5: Age (years).
  - X6 - X11: Historical monthly repayment statuses (-1 = paid duly, 1-9 = months delayed).
  - X12 - X17: Monthly bill statement amounts (NT dollar).
  - X18 - X23: Amount paid each month (NT dollar).

### Tasks
1. Load and preprocess the training and test datasets, clearly applying appropriate encodings
2. Train and tune models using cross-validation for each algorithm, illustrating hyperparameter tuning clearly with plots.
3. Select and justify the best-performing model.
4. Evaluate the selected best model on the test set using suitable classification metrics.
5. Compute the optimal probability threshold for classifying defaults, improving the performance evaluation.


## Setup and Data Loading


In [None]:
# Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# For reproducibility: random states of all estimators are set to this value
np.random.seed(42)

In [23]:
# Console markdown variables
underline = '\033[04m'
bold = "\033[01m"
red = "\033[91m"
green = "\033[92m"
blue = "\033[34m"
purple = "\033[95m"
reset = "\033[0m"

In [24]:
# Load train and test data
train_df = pd.read_csv('creditdefault_train.csv')
test_df = pd.read_csv('creditdefault_test.csv')

print(train_df.head())

   Y      X1  X2  X3  X4  X5  X6  X7  X8  X9  ...     X14     X15     X16  \
0  1   20000   2   2   1  24   2   2  -1  -1  ...     689       0       0   
1  0   50000   2   2   1  37   0   0   0   0  ...   49291   28314   28959   
2  0   50000   1   2   1  57  -1   0  -1   0  ...   35835   20940   19146   
3  0   50000   1   1   2  37   0   0   0   0  ...   57608   19394   19619   
4  0  500000   1   1   2  29   0   0   0   0  ...  445007  542653  483003   

      X17    X18    X19    X20    X21    X22    X23  
0       0      0    689      0      0      0      0  
1   29547   2000   2019   1200   1100   1069   1000  
2   19131   2000  36681  10000   9000    689    679  
3   20024   2500   1815    657   1000   1000    800  
4  473944  55000  40000  38000  20239  13750  13770  

[5 rows x 24 columns]


In [25]:
# Separate features from labels
X_train = train_df.drop('Y', axis=1)
y_train = train_df['Y']

X_test = test_df.drop('Y', axis=1)
y_test = test_df['Y']

In [5]:
X_train.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,...,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,167450.245333,1.604867,1.85,1.5562,35.367933,-0.020467,-0.130933,-0.163,-0.214467,-0.256933,...,47117.562067,43077.445667,40272.922667,38708.685867,5615.96,5822.059,4942.959,4997.328867,4798.4784,5226.421267
std,130109.925023,0.488896,0.786686,0.522743,9.154118,1.125048,1.198451,1.202606,1.180578,1.148654,...,69182.43494,64016.907786,60503.339354,59212.42541,15551.708028,21556.75,13629.034736,16499.349511,15463.948485,18099.851948
min,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-34041.0,-170000.0,-46627.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,2733.5,2392.75,1800.0,1200.0,1000.0,833.0,390.0,290.0,204.0,80.0
50%,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,20165.0,19090.5,18178.0,17177.0,2113.0,2014.0,1809.0,1500.0,1500.0,1500.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,60263.25,54599.5,50134.75,49122.75,5023.25,5000.0,4571.5,4048.5,4019.5,4000.0
max,800000.0,2.0,6.0,3.0,75.0,8.0,8.0,8.0,8.0,7.0,...,855086.0,706864.0,587067.0,568638.0,493358.0,1227082.0,380478.0,528897.0,426529.0,528666.0


In [6]:
# Check for missing values
print(X_train.loc[X_train.isnull().any(axis=1)])
print(X_test.loc[X_test.isnull().any(axis=1)])

Empty DataFrame
Columns: [X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23]
Index: []

[0 rows x 23 columns]
Empty DataFrame
Columns: [X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23]
Index: []

[0 rows x 23 columns]


## Building a Preprocessing Pipeline

In [None]:
def preprocess_data(X: pd.DataFrame,) -> np.ndarray:
    nom_cols = ["X2", "X4"] # X2 = gender, X4 = marital status
    ord_cols = ["X3", "X6", "X7", "X8", "X9", "X10", "X11"]

    nominal_features = list(X[nom_cols].copy())
    ordinal_features = list(X[ord_cols].copy())
    numerical_features = list(X.drop(ord_cols + nom_cols, axis=1))

    full_pipeline = ColumnTransformer([
        ("nom", OneHotEncoder(), nominal_features),
        ("ord", "passthrough", ordinal_features),
        ("num", StandardScaler(), numerical_features),
    ])

    return full_pipeline.fit_transform(X)

In [22]:
prepared_X_train = preprocess_data(X_train)

prepared_X_train.shape

(15000, 27)

## Training Models

Performance across different models will be compared using `F1`. 

The F1 score is the harmonic mean of the precision and recall metrics (2 / 1/precision + 1/recall), and should be prioritised over accuracy when there is class imbalance in the dataset.

### Helper Function

In [None]:
def model_trainer(
        model, 
        X: np.ndarray,
        y: np.ndarray,
        param_grid: dict,
        validation_folds: int = 5,
        verbose: bool = True,
    ):
    """
    Automatically perform a grid search fitted to the given features and labels
    on an estimator, printing the mean cross validation score for all parameter
    setups and returning the best estimator.
    """
    print(f"{blue}Search Parameters:{reset}")
    for k, v in param_grid.items(): print(f"{k}: {v}")
    print(f"\n{blue}Validation Folds:{reset}\n{validation_folds}")
    print(f"\n{underline}{blue}Performing grid search...{reset}")

    # Perform grid search cross validation using parameter grid
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        cv=validation_folds,
        n_jobs=-1,
        scoring=["accuracy", "precision", "recall", "f1"],
        refit="f1",
    )
    grid_search.fit(X, y)

    # Print the results of the grid search
    if verbose:
        cvres = grid_search.cv_results_
        for i, (mean_f1, mean_pre, mean_rec, mean_acc, params) in enumerate(zip(
            cvres["mean_test_f1"], cvres["mean_test_precision"], cvres["mean_test_recall"],
            cvres["mean_test_accuracy"], cvres["params"])):
            print(f"\n{bold}{blue}{i+1}:{reset} {params}")
            print(f"    Mean F1 Score: {green}{round(mean_f1, 3)}{reset}")
            print(f"    Mean Precision: {green}{round(mean_pre, 3)}{reset}")
            print(f"    Mean Recall: {green}{round(mean_rec, 3)}{reset}")
            print(f"    Mean Accuracy: {green}{round(mean_acc, 3)}{reset}")

    print(f"\n{underline}{blue}BEST RESULT:{reset}\n")
    print(f"{bold}{blue}{grid_search.best_index_+1}:{reset} {grid_search.best_params_}")
    print(f"    Mean F1 Score: {green}{round(grid_search.best_score_, 3)}{reset}\n")

    return grid_search.best_estimator_

### K-Nearest Neighbour Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

k_neighbours = KNeighborsClassifier()

k_neighbours_param_grid = {
    "n_neighbors": [1,2,3,4,5,6], # Number of nearest neighbours to consider
    "p": [1,2,3,4,5], # Power to use for Minkowski distance (1=Manhattan, 2=Euclidean)
    "weights": ["uniform", "distance"], # Weight function to use
}

k_neighbours = model_trainer(
    k_neighbours, 
    prepared_X_train, 
    y_train, 
    k_neighbours_param_grid,
)

k_neighbours

[34mSearch Parameters:[0m
n_neighbors: [1, 2, 3, 4, 5, 6]
p: [1, 2, 3, 4, 5]
weights: ['uniform', 'distance']

[34mValidation Folds:[0m
5

[04m[34mPerforming grid search...[0m

[01m[34m1:[0m {'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}
    Mean F1 Score: [92m0.384[0m
    Mean Precision: [92m0.383[0m
    Mean Recall: [92m0.387[0m
    Mean Accuracy: [92m0.726[0m

[01m[34m2:[0m {'n_neighbors': 1, 'p': 1, 'weights': 'distance'}
    Mean F1 Score: [92m0.384[0m
    Mean Precision: [92m0.383[0m
    Mean Recall: [92m0.387[0m
    Mean Accuracy: [92m0.726[0m

[01m[34m3:[0m {'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}
    Mean F1 Score: [92m0.38[0m
    Mean Precision: [92m0.382[0m
    Mean Recall: [92m0.379[0m
    Mean Accuracy: [92m0.727[0m

[01m[34m4:[0m {'n_neighbors': 1, 'p': 2, 'weights': 'distance'}
    Mean F1 Score: [92m0.38[0m
    Mean Precision: [92m0.382[0m
    Mean Recall: [92m0.379[0m
    Mean Accuracy: [92m0.727[0m

[01m[3

As the model performed best when `n_neighbors` was at its maximum, further improvement in performance could be achieved by increasing the value. An additional grid search will be performed testing values >= 6

In [None]:
k_neighbours_param_grid = {
    "n_neighbors": [6,7,8,9,10],
    "p": [4,5,6],
    "weights": ["distance"],
}

k_neighbours = model_trainer(
    k_neighbours, 
    prepared_X_train, 
    y_train, 
    k_neighbours_param_grid,
)

k_neighbours

[34mSearch Parameters:[0m
n_neighbors: [6, 7, 8, 9, 10]
p: [4, 5, 6]
weights: ['distance']

[34mValidation Folds:[0m
5

[04m[34mPerforming grid search...[0m

[01m[34m1:[0m {'n_neighbors': 6, 'p': 4, 'weights': 'distance'}
    Mean F1 Score: [92m0.431[0m
    Mean Precision: [92m0.558[0m
    Mean Recall: [92m0.352[0m
    Mean Accuracy: [92m0.795[0m

[01m[34m2:[0m {'n_neighbors': 6, 'p': 5, 'weights': 'distance'}
    Mean F1 Score: [92m0.432[0m
    Mean Precision: [92m0.56[0m
    Mean Recall: [92m0.353[0m
    Mean Accuracy: [92m0.795[0m

[01m[34m3:[0m {'n_neighbors': 6, 'p': 6, 'weights': 'distance'}
    Mean F1 Score: [92m0.429[0m
    Mean Precision: [92m0.558[0m
    Mean Recall: [92m0.349[0m
    Mean Accuracy: [92m0.795[0m

[01m[34m4:[0m {'n_neighbors': 7, 'p': 4, 'weights': 'distance'}
    Mean F1 Score: [92m0.437[0m
    Mean Precision: [92m0.581[0m
    Mean Recall: [92m0.351[0m
    Mean Accuracy: [92m0.8[0m

[01m[34m5:[0m {'n_neigh

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()

decision_tree_param_grid = {
    "max_depth": [3, 4, 5, 6], # Pre-pruning, prevents tree from growing further
    "min_samples_split": [2, 3, 4, 5], # Min. samples at a node needed to split further
    "min_samples_leaf": [1, 3, 5, 7], # Min. samples needed at a leaf node
}

decision_tree = model_trainer(
    decision_tree,
    prepared_X_train,
    y_train,
    decision_tree_param_grid
)

decision_tree

[34mSearch Parameters:[0m
max_depth: [3, 4, 5, 6]
min_samples_split: [2, 3, 4, 5]
min_samples_leaf: [1, 3, 5, 7]

[34mValidation Folds:[0m
5

[04m[34mPerforming grid search...[0m

[01m[34m1:[0m {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
    Mean F1 Score: [92m0.458[0m
    Mean Precision: [92m0.69[0m
    Mean Recall: [92m0.345[0m
    Mean Accuracy: [92m0.82[0m

[01m[34m2:[0m {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 3}
    Mean F1 Score: [92m0.458[0m
    Mean Precision: [92m0.69[0m
    Mean Recall: [92m0.345[0m
    Mean Accuracy: [92m0.82[0m

[01m[34m3:[0m {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 4}
    Mean F1 Score: [92m0.458[0m
    Mean Precision: [92m0.69[0m
    Mean Recall: [92m0.345[0m
    Mean Accuracy: [92m0.82[0m

[01m[34m4:[0m {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 5}
    Mean F1 Score: [92m0.458[0m
    Mean Precision: [92m0.69[0m
    Mean Recall: [

In [None]:
decision_tree_param_grid = {
    "max_depth": [5],
    "min_samples_split": [2],
    "min_samples_leaf": [7, 8, 9, 10, 11, 12, 13, 14],
}

decision_tree = model_trainer(
    decision_tree,
    prepared_X_train,
    y_train,
    decision_tree_param_grid
)

decision_tree

[34mSearch Parameters:[0m
max_depth: [5]
min_samples_split: [2]
min_samples_leaf: [7, 8, 9, 10, 11, 12, 13, 14]

[34mValidation Folds:[0m
5

[04m[34mPerforming grid search...[0m

[01m[34m1:[0m {'max_depth': 5, 'min_samples_leaf': 7, 'min_samples_split': 2}
    Mean F1 Score: [92m0.483[0m
    Mean Precision: [92m0.658[0m
    Mean Recall: [92m0.383[0m
    Mean Accuracy: [92m0.819[0m

[01m[34m2:[0m {'max_depth': 5, 'min_samples_leaf': 8, 'min_samples_split': 2}
    Mean F1 Score: [92m0.482[0m
    Mean Precision: [92m0.659[0m
    Mean Recall: [92m0.381[0m
    Mean Accuracy: [92m0.819[0m

[01m[34m3:[0m {'max_depth': 5, 'min_samples_leaf': 9, 'min_samples_split': 2}
    Mean F1 Score: [92m0.481[0m
    Mean Precision: [92m0.658[0m
    Mean Recall: [92m0.38[0m
    Mean Accuracy: [92m0.819[0m

[01m[34m4:[0m {'max_depth': 5, 'min_samples_leaf': 10, 'min_samples_split': 2}
    Mean F1 Score: [92m0.478[0m
    Mean Precision: [92m0.656[0m
    Mean Reca

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()

random_forest_param_grid = {
    "n_estimators": [500], # Use a larger number of trees
    "max_features": [ # Amount of features to use when splitting nodes
        None, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, "sqrt", "log2"
    ],
    "max_depth": [None, 5], # Allow trees to overfit, or use optimal decision tree depth
    "min_samples_leaf": [7], # Optimal min samples per leaf for decision tree
}

random_forest = model_trainer(
    random_forest,
    prepared_X_train,
    y_train,
    random_forest_param_grid
)

random_forest


[34mSearch Parameters:[0m
n_estimators: [500]
max_features: [None, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 'sqrt', 'log2']
max_depth: [None, 5]
min_samples_leaf: [7]

[34mValidation Folds:[0m
5

[04m[34mPerforming grid search...[0m

[01m[34m1:[0m {'max_depth': None, 'max_features': None, 'min_samples_leaf': 7, 'n_estimators': 500}
    Mean F1 Score: [92m0.477[0m
    Mean Precision: [92m0.676[0m
    Mean Recall: [92m0.369[0m
    Mean Accuracy: [92m0.821[0m

[01m[34m2:[0m {'max_depth': None, 'max_features': 0.9, 'min_samples_leaf': 7, 'n_estimators': 500}
    Mean F1 Score: [92m0.479[0m
    Mean Precision: [92m0.68[0m
    Mean Recall: [92m0.37[0m
    Mean Accuracy: [92m0.822[0m

[01m[34m3:[0m {'max_depth': None, 'max_features': 0.8, 'min_samples_leaf': 7, 'n_estimators': 500}
    Mean F1 Score: [92m0.477[0m
    Mean Precision: [92m0.678[0m
    Mean Recall: [92m0.368[0m
    Mean Accuracy: [92m0.821[0m

[01m[34m4:[0m {'max_depth': None, 'max_