**In this notebook, we observe which model and method perform the best.**

# 1. Imports & Data
- raw_df : Raw Data used to observe baseline results

- Selct_df : Data saved after cleaning the Recent Application dataset in [early_default_application.ipynb] for model & method selection.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_validate, cross_val_predict
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, accuracy_score
from imblearn.over_sampling import SMOTE, RandomOverSampler, SMOTENC
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings("ignore")

In [None]:
raw_df = pd.read_csv("../data/application/current_application.csv")

In [None]:
select_df = pd.read_csv("../data/processed/mdoel_selection_data.csv")

# 2. Model & Method Selection

- This notebook compares Random Forest and Light Gradient Boosting Machine using 4 different methods to handle the imbalanced target

- This section provides the detailed process behind the results summarized in the Model & Method Selection part of the [early_default_application.ipynb](early_default_application.ipynb)

## 2.1. Target Imbalance

In [5]:
No_difficulties = select_df["TARGET"].value_counts()[0]
Payment_difficulties = select_df["TARGET"].value_counts()[1]
total_len = len(select_df)

print('No Default:', round(No_difficulties/total_len * 100,2), '%')
print('Default:', round(Payment_difficulties/total_len * 100,2), '%')

No Default: 91.93 %
Default: 8.07 %


=> This would lead to predicting all cases as "no default", resulting in an accuracy close to 92%, making accuracy a misleading metric.


## 2.2 StratifiedKFold

In [6]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = {"precision": make_scorer(precision_score, zero_division=0),
          "recall": make_scorer(recall_score, zero_division=0),
          "f1": make_scorer(f1_score, zero_division=0)
          }

## 2.3 Methods:

- **SMOTE (Synthetic Minority Oversampling Technique)**
    
  - SMOTE creates synthetic samples of the minority class by interpolating between the real data to match the size of the majority class
    
  - SMOTENC is used when the dataset contains both categorical and numerical features.
    
- **RandomOverSampler**
    
  - RandomOverSampler simply duplicates existing sample from the minority class.
  - It is simpler than SMOTE but can increase the risk of overfitting.
    
- **RandomUnderSampler**
    
  - RandomUnderSampler reduces the number of majority class samples by randomly selecting a subset equal to the minority class size.
    

- **Class Weight for Random Forest**
    
  - Random Forest uses the class_weight parameter, and setting it to “balanced” adjusts the sample weights automatically.
  - This increases the impact of the minority class during bootstrap sampling and node splitting.
    
- **Is Unbalance for Light GBM**
    
  - Light GBM has a parameter is_unbalance, which automatically adjusts class weights inside the loss function.
  - This increases the influence of the minority class during gradient calculation, reducing bias toward the majority class.

## 2.4. Random Forest

In [7]:
X = select_df.drop(columns=["TARGET","SK_ID_CURR"])
y = select_df["TARGET"]

X_encoded = pd.get_dummies(X, drop_first=False)

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42, stratify=y)

In [8]:
model_dicision_rf = []

### 2.4.1. class_weight

In [9]:
rf_cw = RandomForestClassifier(class_weight = "balanced",random_state=42, n_jobs=-1)
cv_cw = cross_validate(rf_cw, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)

results = {
    "model": "class_weight",
    "precision_mean": np.mean(cv_cw["test_precision"]),
    "precision_std": np.std(cv_cw["test_precision"]),
    "recall_mean": np.mean(cv_cw["test_recall"]),
    "recall_std": np.std(cv_cw["test_recall"]),
    "f1_mean": np.mean(cv_cw["test_f1"]),
    "f1_std": np.std(cv_cw["test_f1"]),
}

model_dicision_rf.append(results)

### 2.4.2. SMOTE

In [10]:
smote = {
    "smote_03_rf": SMOTE(sampling_strategy=0.3, random_state=42),
    "smote_05_rf": SMOTE(sampling_strategy=0.5, random_state=42)
    }

rf_sm = RandomForestClassifier(random_state=42, n_jobs=-1)

for name, method in smote.items():
  pipe = Pipeline([("method", method), ("model",rf_sm)])
  cv_sm = cross_validate(pipe, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)
  results = {
    "model": name,
    "precision_mean": np.mean(cv_sm["test_precision"]),
    "precision_std": np.std(cv_sm["test_precision"]),
    "recall_mean": np.mean(cv_sm["test_recall"]),
    "recall_std": np.std(cv_sm["test_recall"]),
    "f1_mean": np.mean(cv_sm["test_f1"]),
    "f1_std": np.std(cv_sm["test_f1"]),
  }
  model_dicision_rf.append(results)

### 2.4.3. RandomOverSampler

In [11]:
over = {
    "over_05_rf": RandomOverSampler(sampling_strategy=0.5, random_state=42),
    "over_07_rf": RandomOverSampler(sampling_strategy=0.7, random_state=42)
    }

rf_over = RandomForestClassifier(random_state=42, n_jobs=-1)

for name, method in over.items():
  pipe = Pipeline([("method", method), ("model",rf_over)])
  cv_over = cross_validate(pipe, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)
  results = {
    "model": name,
    "precision_mean": np.mean(cv_over["test_precision"]),
    "precision_std": np.std(cv_over["test_precision"]),
    "recall_mean": np.mean(cv_over["test_recall"]),
    "recall_std": np.std(cv_over["test_recall"]),
    "f1_mean": np.mean(cv_over["test_f1"]),
    "f1_std": np.std(cv_over["test_f1"]),
  }
  model_dicision_rf.append(results)

### 2.4.4. RandomUnderSampler

In [12]:

under = {
    "under_03_rf": RandomUnderSampler(sampling_strategy=0.3, random_state=42),
    "under_05_rf": RandomUnderSampler(sampling_strategy=0.5, random_state=42)
    }

rf_under = RandomForestClassifier(random_state=42, n_jobs=-1)

for name, method in under.items():
  pipe = Pipeline([("method", method), ("model",rf_under)])
  cv_under = cross_validate(pipe, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)
  results = {
    "model": name,
    "precision_mean": np.mean(cv_under["test_precision"]),
    "precision_std": np.std(cv_under["test_precision"]),
    "recall_mean": np.mean(cv_under["test_recall"]),
    "recall_std": np.std(cv_under["test_recall"]),
    "f1_mean": np.mean(cv_under["test_f1"]),
    "f1_std": np.std(cv_under["test_f1"]),
  }
  model_dicision_rf.append(results)

### 2.4.5. Random Forest Results

In [13]:
model_scores_rf = pd.DataFrame(model_dicision_rf).round(3)
model_scores_rf

Unnamed: 0,model,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
0,class_weight,0.534,0.169,0.001,0.001,0.002,0.001
1,smote_03_rf,0.49,0.09,0.001,0.0,0.003,0.001
2,smote_05_rf,0.399,0.105,0.003,0.0,0.005,0.001
3,over_05_rf,0.496,0.019,0.009,0.001,0.018,0.002
4,over_07_rf,0.457,0.057,0.011,0.003,0.021,0.006
5,under_03_rf,0.352,0.007,0.124,0.004,0.183,0.005
6,under_05_rf,0.255,0.006,0.317,0.01,0.283,0.007


## 2.5. Light Gradient Boosting Machine

In [14]:
X = select_df.drop(columns=["TARGET","SK_ID_CURR"])
y = select_df['TARGET']

cat = X.select_dtypes(include=["object"]).columns
cat_idx = [X.columns.get_loc(c) for c in cat]
for c in cat:
  X[c] = LabelEncoder().fit_transform(X[c])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [15]:
model_decision_lgbm = []

### 2.5.1. is_unbalance

In [16]:
lgbm_iu = LGBMClassifier(is_unbalance=True,categorical_feature=cat_idx, random_state=42)
cv_iu = cross_validate(lgbm_iu, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)

results = {
    "model": "is_unbalance",
    "precision_mean": np.mean(cv_iu["test_precision"]),
    "precision_std": np.std(cv_iu["test_precision"]),
    "recall_mean": np.mean(cv_iu["test_recall"]),
    "recall_std": np.std(cv_iu["test_recall"]),
    "f1_mean": np.mean(cv_iu["test_f1"]),
    "f1_std": np.std(cv_iu["test_f1"]),
}

model_decision_lgbm.append(results)

### 2.5.2 SMOTENC

In [17]:
smotenc = {
    "smote_03_lgbm": SMOTENC(sampling_strategy=0.3,categorical_features=cat_idx, random_state=42),
    "smote_05_lgbm": SMOTENC(sampling_strategy=0.5,categorical_features=cat_idx, random_state=42)
    }

lgbm_sm = LGBMClassifier(categorical_feature=cat_idx, random_state=42)

for name, method in smotenc.items():
  pipe = Pipeline([("method", method), ("model",lgbm_sm)])
  cv_sm = cross_validate(pipe, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)
  results = {
    "model": name,
    "precision_mean": np.mean(cv_sm["test_precision"]),
    "precision_std": np.std(cv_sm["test_precision"]),
    "recall_mean": np.mean(cv_sm["test_recall"]),
    "recall_std": np.std(cv_sm["test_recall"]),
    "f1_mean": np.mean(cv_sm["test_f1"]),
    "f1_std": np.std(cv_sm["test_f1"]),
  }
  model_decision_lgbm.append(results)

### 2.5.3. RandomOverSampler

In [18]:
over = {
    "over_05_lgbm": RandomOverSampler(sampling_strategy=0.5, random_state=42),
    "over_07_lgbm": RandomOverSampler(sampling_strategy=0.7, random_state=42)
    }

lgbm_over = LGBMClassifier(categorical_feature=cat_idx, random_state=42)

for name, method in over.items():
  pipe = Pipeline([("method", method), ("model",lgbm_over)])
  cv_over = cross_validate(pipe, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)
  results = {
    "model": name,
    "precision_mean": np.mean(cv_over["test_precision"]),
    "precision_std": np.std(cv_over["test_precision"]),
    "recall_mean": np.mean(cv_over["test_recall"]),
    "recall_std": np.std(cv_over["test_recall"]),
    "f1_mean": np.mean(cv_over["test_f1"]),
    "f1_std": np.std(cv_over["test_f1"]),
  }
  model_decision_lgbm.append(results)

### 2.5.4 UnderSampler

In [19]:
under = {
    "under_03_lgbm": RandomUnderSampler(sampling_strategy=0.3, random_state=42),
    "under_05_lgbm": RandomUnderSampler(sampling_strategy=0.5, random_state=42)
    }

lgbm_under = LGBMClassifier(categorical_feature=cat_idx,random_state=42)

for name, method in under.items():
  pipe = Pipeline([("method", method), ("model",lgbm_under)])
  cv_under = cross_validate(pipe, X_train,y_train, cv=cv, scoring=scores, n_jobs=-1, return_train_score=False)
  results = {
    "model": name,
    "precision_mean": np.mean(cv_under["test_precision"]),
    "precision_std": np.std(cv_under["test_precision"]),
    "recall_mean": np.mean(cv_under["test_recall"]),
    "recall_std": np.std(cv_under["test_recall"]),
    "f1_mean": np.mean(cv_under["test_f1"]),
    "f1_std": np.std(cv_under["test_f1"]),
  }
  model_decision_lgbm.append(results)

### 2.5.5. Light GBM Results

In [20]:
model_scores_lgbm = pd.DataFrame(model_decision_lgbm).round(3)
model_scores_lgbm

Unnamed: 0,model,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
0,is_unbalance,0.171,0.001,0.67,0.005,0.272,0.002
1,smote_03_lgbm,0.478,0.024,0.028,0.002,0.052,0.003
2,smote_05_lgbm,0.438,0.032,0.029,0.003,0.054,0.006
3,over_05_lgbm,0.247,0.004,0.411,0.006,0.309,0.005
4,over_07_lgbm,0.207,0.002,0.533,0.01,0.298,0.004
5,under_03_lgbm,0.308,0.007,0.248,0.007,0.275,0.006
6,under_05_lgbm,0.236,0.006,0.43,0.009,0.305,0.006


# 3. Baseline Results


The baseline model is a simple model built with minimal preprocessing.
It serves as a reference point before applying feature engineering or optimization.

In [21]:
X = raw_df.drop(columns=["TARGET","SK_ID_CURR"])
y = raw_df['TARGET']

cat = X.select_dtypes(include=["object"]).columns
X[cat] = X[cat].astype("category")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
baseline = LGBMClassifier(random_state=42, verbose=-2)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [23]:
y_proba_train = cross_val_predict(baseline, X_train, y_train, cv=cv, method="predict_proba", n_jobs=-1)[:,1]

thresholds = np.linspace(0, 1, 100)

f1_scores = []
for threshold in thresholds:
    y_pred_train = (y_proba_train > threshold).astype(int)
    f1_scores.append(f1_score(y_train, y_pred_train))

best_th = thresholds[np.argmax(f1_scores)]
y_pred_train = (y_proba_train > best_th).astype(int)

accuracy = accuracy_score(y_train, y_pred_train)
precision = precision_score(y_train, y_pred_train)
recall = recall_score(y_train, y_pred_train)
f1 = f1_score(y_train, y_pred_train)
print(f"Accuracy: {accuracy*100:.2f}%, Precision: {precision*100:.2f}%, recall: {recall*100:.2f}%, f1: {f1*100:.2f}%")

Accuracy: 84.84%, Precision: 24.07%, recall: 40.66%, f1: 30.24%
