<div style="border:2px solid #4CAF50; border-radius:10px; padding:16px; background:#f9fff9">

<h3>🔬 Notebook 3 — Tuning, Calibration & Interpretability</h3>

**Purpose**  
- Take the best baseline model (LightGBM) forward.  
- Apply cross-validation and hyperparameter tuning for robust probability estimates.  
- Assess calibration improvements (isotonic / Platt scaling).  
- Generate interpretability outputs:  
  - Global feature importance (SHAP, gain, permutation).  
  - Local explanations for individual borrowers.  
- Produce business-aligned outputs: threshold selection, lift/KS charts, and risk segmentation tables.  

**Inputs**  
- <code>models/lgbm_pipeline.joblib</code> (baseline pipeline)  
- <code>models/lgbm_metadata.json</code> (features + thresholds + metrics)  
- <code>data/processed/loan_default_full55.parquet</code> (full dataset, 54 features)  

**Outputs**  
- Calibrated, tuned LightGBM model  
- Feature importance + SHAP explanations  
- Threshold guidance for probability-based interventions  
- Final artifacts saved for deployment in Notebook 4  

</div>


### <div class="alert alert-info" align = center> Imports</div>

In [1]:
# === Core Imports ===
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# === Modeling & Validation ===
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import (
    roc_auc_score, log_loss, brier_score_loss,
    confusion_matrix, classification_report, ConfusionMatrixDisplay
)

import lightgbm as lgb
import shap
import joblib
import json
import pathlib
import time
import warnings
warnings.filterwarnings("ignore")


#Start a timer to check the execution time of the notebook.
start_time = time.time()

<div class="alert alert-info"><strong>Load The Data</strong></div>

In [3]:
# === Load Data & Artifacts for Notebook 3 ===

# Paths
datapath = pathlib.Path("data/processed")
modelpath = pathlib.Path("models")

full_file = datapath / "loan_default_full51.parquet"
slim_file = datapath / "loan_default_slim55.parquet"
meta_file = modelpath / "lgbm_metadata.json"
model_file = modelpath / "lgbm_pipeline.joblib"

# Load dataset (full version for modeling)
df = pd.read_parquet(full_file)
print(f"✅ Loaded dataset: {df.shape[0]:,} rows × {df.shape[1]} columns")

# Load metadata
with open(meta_file, "r") as f:
    meta = json.load(f)
print("✅ Loaded metadata:", list(meta.keys()))

# Target + features
target = "target_default"
features = meta.get("features", [c for c in df.columns if c != target])

X = df[features]
y = df[target]

# Train/test split (same random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"✅ Split into train={X_train.shape}, test={X_test.shape}")

# Load baseline LightGBM pipeline (from Notebook 2)
clf = joblib.load(model_file)
print("✅ Baseline LightGBM pipeline loaded")


✅ Loaded dataset: 1,329,272 rows × 51 columns
✅ Loaded metadata: ['model', 'pipeline_file', 'created_at', 'deploy_threshold', 'metrics', 'train_test_sizes', 'feature_groups']
✅ Split into train=(1063417, 50), test=(265855, 50)
✅ Baseline LightGBM pipeline loaded


In [4]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt_inv,int_rate,annual_inc,mths_since_last_record,open_acc,pub_rec,out_prncp,out_prncp_inv,total_rec_prncp,...,mths_since_last_delinq_is_missing,mths_since_last_major_derog_is_missing,annual_inc_joint_is_missing,il_util_is_missing,mths_since_recent_bc_dlq_is_missing,mths_since_recent_revol_delinq_is_missing,target_default,emp_length,emp_length_is_missing,int_rate_trunc
0,30000.0,30000.0,22.35,100000.0,84.0,11.0,1.0,0.0,0.0,30000.0,...,0,1,0,0,1,1,0,5 years,0,22.35
1,40000.0,40000.0,16.139999,45000.0,74.0,18.0,0.0,0.0,0.0,40000.0,...,1,1,0,0,1,1,0,< 1 year,0,16.1399
2,20000.0,20000.0,7.56,100000.0,74.0,9.0,0.0,0.0,0.0,20000.0,...,0,1,0,0,1,0,0,10+ years,0,7.5599
3,4500.0,4500.0,11.31,38500.0,74.0,12.0,0.0,0.0,0.0,4500.0,...,0,0,1,0,1,1,0,10+ years,0,11.31
4,8425.0,8425.0,27.27,450000.0,74.0,21.0,0.0,0.0,0.0,8425.0,...,1,1,0,0,1,1,0,3 years,0,27.27


### <div class="alert alert-info" > We will do hyperparameter tuning and cross-validation next </div>

In [2]:
#Stop the timer to check the execution time of the notebook.
end_time = time.time()
print(f"Total execution time: {round(end_time - start_time, 2)} seconds")

Total execution time: 16.57 seconds
