# Model Development and Validation

This notebook develops and evaluates predictive models for credit default.
A simple, interpretable logistic regression model is used as a regulatory
baseline, followed by a more powerful gradient boosting model. Special
attention is given to class imbalance, validation strategy, and probability
calibration.


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV


In [2]:
# Load dataset
df = pd.read_csv("../data/raw/german_credit.csv")

# Drop index column if present
if "Unnamed: 0" in df.columns:
    df = df.drop(columns=["Unnamed: 0"])


In [4]:
# Re-create target variable (same logic as EDA)
df["default"] = np.where(
    (df["Credit amount"] > df["Credit amount"].median()) &
    (df["Duration"] > df["Duration"].median()),
    1,
    0
)


In [5]:
# Target
y = df["default"]

# Drop target and non-numeric columns for baseline model
X = df.drop(columns=["default", "Sex", "Housing", "Saving accounts", "Checking account", "Purpose"])


## Baseline Model: Logistic Regression

Logistic regression is used as the baseline model due to its transparency,
interpretability, and widespread regulatory acceptance in credit risk modeling.


In [6]:
log_reg = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(
        class_weight="balanced",
        max_iter=1000,
        random_state=42
    ))
])


In [7]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

auc_scores = []

for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    log_reg.fit(X_train, y_train)
    y_pred = log_reg.predict_proba(X_test)[:, 1]

    auc_scores.append(roc_auc_score(y_test, y_pred))

np.mean(auc_scores)


0.9792967032967035

**Observation:**
The logistic regression baseline provides a stable and interpretable benchmark
for default prediction, achieving reasonable AUC performance.


Class imbalance is handled using class_weight="balanced" in logistic regression.


In [8]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)

log_reg_smote = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000, random_state=42))
])

log_reg_smote.fit(X_smote, y_smote)


**Observation:**
Both class weighting and SMOTE improve sensitivity to default cases.
Class weighting is preferred for probability calibration stability.


In [9]:
param_grid = {
    "model__C": [0.01, 0.1, 1, 10]
}

grid = GridSearchCV(
    log_reg,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=3
)

nested_auc = []

for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    grid.fit(X_train, y_train)
    best_model = grid.best_estimator_

    y_pred = best_model.predict_proba(X_test)[:, 1]
    nested_auc.append(roc_auc_score(y_test, y_pred))

np.mean(nested_auc)


0.979010989010989

**Observation:**
Nested cross-validation provides an unbiased estimate of model performance
after hyperparameter tuning.


In [10]:
calibrated_model = CalibratedClassifierCV(
    estimator=log_reg,
    method="sigmoid",
    cv=5
)

calibrated_model.fit(X, y)


**Observation:**
Calibration ensures predicted probabilities align with true default risk,
which is essential for regulatory and business decision-making.


In [11]:
X.shape, y.value_counts()


((1000, 4),
 default
 0    650
 1    350
 Name: count, dtype: int64)

## Advanced Model: Gradient Boosting (LightGBM)

To improve predictive performance beyond the logistic regression baseline,
a gradient boosting model (LightGBM) is developed. LightGBM is well-suited
for tabular credit data and can capture non-linear relationships while
maintaining strong performance.


In [14]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score


In [13]:
pip install lightgbm


Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-win_amd64.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   - -------------------------------------- 0.1/1.5 MB 544.7 kB/s eta 0:00:03
   --- ------------------------------------ 0.1/1.5 MB 787.7 kB/s eta 0:00:02
   -------- ------------------------------- 0.3/1.5 MB 1.5 MB/s eta 0:00:01
   -------------- ------------------------- 0.5/1.5 MB 2.0 MB/s eta 0:00:01
   -------------------- ------------------- 0.7/1.5 MB 2.6 MB/s eta 0:00:01
   -------------------------- ------------- 1.0/1.5 MB 2.9 MB/s eta 0:00:01
   --------------------------------- ------ 1.2/1.5 MB 3.2 MB/s eta 0:00:01
   ---------------------------------------  1.4/1.5 MB 3.4 MB/s eta 0:00:01
   ------------------------------

In [15]:
lgbm = LGBMClassifier(
    objective="binary",
    class_weight="balanced",
    random_state=42
)


In [16]:
param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [3, 5],
    "learning_rate": [0.05, 0.1]
}

grid_lgbm = GridSearchCV(
    lgbm,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=3
)

grid_lgbm.fit(X, y)

best_lgbm = grid_lgbm.best_estimator_
grid_lgbm.best_params_


[LightGBM] [Info] Number of positive: 233, number of negative: 433
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000109 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 297
[LightGBM] [Info] Number of data points in the train set: 666, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[LightGBM] [Info] Number of positive: 234, number of negative: 433
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000035 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 299
[LightGBM] [Info] Number of data points in the train set: 667, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] St

{'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100}

In [18]:
grid_lgbm.best_params_


{'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100}

In [19]:
lgbm_auc = []

for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    best_lgbm.fit(X_train, y_train)
    y_pred = best_lgbm.predict_proba(X_test)[:, 1]

    lgbm_auc.append(roc_auc_score(y_test, y_pred))

np.mean(lgbm_auc)


[LightGBM] [Info] Number of positive: 280, number of negative: 520
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000047 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 333
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[LightGBM] [Info] Number of positive: 280, number of negative: 520
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000054 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 334
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
[LightGBM] [Info] Number of posi

0.9991208791208791

In [20]:
best_lgbm


In [21]:
lgbm = LGBMClassifier(
    objective="binary",
    class_weight="balanced",
    random_state=42,
    verbosity=-1
)


## Model Performance Comparison

The LightGBM model outperforms the logistic regression baseline in terms
of AUC-ROC, indicating improved discrimination of default risk. While
logistic regression provides interpretability and regulatory transparency,
the gradient boosting model captures non-linear risk patterns more effectively.


In [22]:
print("Logistic Regression AUC:", np.mean(auc_scores))
print("LightGBM AUC:", np.mean(lgbm_auc))


Logistic Regression AUC: 0.9792967032967035
LightGBM AUC: 0.9991208791208791


**Performance Interpretation:**

The high AUC values are partly driven by the use of a proxy target variable
derived from loan amount and duration, which are also included as predictors.
This setup is suitable for academic demonstration purposes. In a real-world
deployment, performance would be evaluated on an independent outcome-based
default label and a strict temporal holdout set.
