# Model Evaluation and Interpretation

In this notebook, we evaluate the performance of our trained models (GLM & LightGBM) and interpret their behavior using **Dalex**.

**Objectives:**
1. Load pre-trained models and test data.
2. Compare model performance using ROC curves.
3. Analyze Feature Importance to understand key drivers of loan default.
4. Visualize Partial Dependence Plots (PDP) for the top 5 features.

In [1]:
import sys
import joblib
import dalex as dx
import matplotlib.pyplot as plt
from pathlib import Path

# Add project root to sys.path to import from src
project_root = Path.cwd().parent

if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

from src.model_training import load_and_split_data

MODELS_DIR = project_root / "models"
FIGURES_DIR = project_root / "reports" / "figures"
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

## 1. Data and Model Loading

We load the test dataset (using the same stratified split strategy as training) and the persisted `.joblib` models. This ensures we are evaluating on unseen data.

In [2]:
# Use the same parameters as training to ensure the split is identical
print("Loading test data...")
(_, X_test, _, y_test), _, _ = load_and_split_data(max_rows=30000)

# Load saved models
print("Loading saved models...")
glm_path = MODELS_DIR / "glm_model.joblib"
lgbm_path = MODELS_DIR / "lgbm_model.joblib"

if not glm_path.exists() or not lgbm_path.exists():
    raise FileNotFoundError("Saved models not found. Please run src/model_training.py first.")

best_glm = joblib.load(glm_path)
best_lgbm = joblib.load(lgbm_path)
print("Done.")

Loading test data...
Loading data from /Users/wangyifan/Desktop/Work/qfyüêë/25-12-D100/d100_loan_default_prediction/data/processed/cleaned_data.parquet...
‚ö†Ô∏è  Downsampling data from 148670 to 30000 rows (Stratified) since the training time may be too long...
Features selected: 9 numerical, 20 categorical.
Total samples for training/testing: 30000
Loading saved models...
Done.


## 2. Model Performance Evaluation (ROC Curve)

We initialize `dalex.Explainer` wrappers for both models. These explainers provide a unified interface for model diagnostics.

Below, we plot the **ROC Curve (Receiver Operating Characteristic)**.
* The closer the curve is to the top-left corner, the better the model.
* **AUC (Area Under Curve)** represents the overall ability of the model to distinguish between default and non-default classes.

In [3]:
# Create explainers for both models
print("\nInitializing Dalex explainers...")
exp_glm = dx.Explainer(best_glm, X_test, y_test, label="GLM_ElasticNet", verbose=False)
exp_lgbm = dx.Explainer(best_lgbm, X_test, y_test, label="LightGBM", verbose=False)

# Calculate model performance
print("\nGenerating 'Predicted vs Actual' plots (Calibration)...")
mp_glm = exp_glm.model_performance(model_type='classification')
mp_lgbm = exp_lgbm.model_performance(model_type='classification')
mp_lgbm.plot(mp_glm, geom='roc', show=False)


Initializing Dalex explainers...

Generating 'Predicted vs Actual' plots (Calibration)...


## 3. Feature Importance Analysis

We use permutation-based feature importance to identify which variables have the most significant impact on the model's predictions.

* **Method:** We measure the drop in model performance (1 - AUC) when a feature's values are randomly shuffled. A larger drop indicates a more important feature.
* We compare the top 10 features for both GLM and LGBM to see if they rely on similar information.

In [4]:
# Feature importance
print("\nCalculating Feature Importance...")
vi_glm = exp_glm.model_parts(loss_function='1-auc')
vi_lgbm = exp_lgbm.model_parts(loss_function='1-auc')
vi_lgbm.plot(vi_glm, max_vars=10, show=False)


Calculating Feature Importance...


## 4. Partial Dependence Plots (PDP)

Partial Dependence Plots help us understand the **marginal effect** of a feature on the predicted outcome (probability of default), marginalizing over the values of all other features.

* We select the **Top 5 most important features** from the LightGBM model.
* **Interpretation:**
    * **Increasing trend:** Higher feature values lead to higher default risk.
    * **Decreasing trend:** Higher feature values lead to lower default risk.
    * **Non-linear patterns:** Complex relationships captured by the tree-based model (LGBM) that the linear model (GLM) might miss.

In [5]:
# Partial Dependence Plots (PDP)
print("\nIdentifying top features for PDP...")
top_features_df = vi_lgbm.result[~vi_lgbm.result.variable.isin(['_baseline_', '_full_model_', '_one_look_'])]
top_5_features = top_features_df.sort_values('dropout_loss', ascending=False)['variable'].head(5).tolist()
print(f"Top 5 features (LGBM): {top_5_features}")

# Calculate PDP profiles
print("Generating Partial Dependence Plots (PDP)...")
pdp_glm = exp_glm.model_profile(variables=top_5_features)
pdp_lgbm = exp_lgbm.model_profile(variables=top_5_features)
pdp_lgbm.plot(pdp_glm, show=False)


Identifying top features for PDP...
Top 5 features (LGBM): ['Upfront_charges', 'credit_type', 'LTV', 'loan_purpose', 'open_credit']
Generating Partial Dependence Plots (PDP)...


Calculating ceteris paribus: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 31.58it/s]
Calculating ceteris paribus: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00, 20.44it/s]
