# Validation & Evaluation Phase

## Overview
This phase ensures the model is **unbiased, error-free, and production-ready**. We perform comprehensive checks on model performance, fairness, and stability across demographic groups.

## Why Validation?

âœ… **Performance Verification** â€” Ensure the model generalizes well to unseen data  
âœ… **Bias Detection** â€” Identify disparities across demographic groups (age, etc.)  
âœ… **Error Analysis** â€” Understand failure modes and potential risks  
âœ… **Stability Assessment** â€” Bootstrap confidence intervals and feature importance  
âœ… **Calibration Check** â€” Verify predicted probabilities are reliable  

## What Gets Evaluated?

### 1. Performance Metrics (`validation_metrics.py`)
- **Accuracy, Precision, Recall, F1-Score** â€” Overall model quality
- **ROC-AUC, PR-AUC** â€” Discrimination ability across thresholds
- **Confusion Matrix** â€” TP/TN/FP/FN breakdown
- **Calibration Curve** â€” Is predicted probability = actual probability?
- **Brier Score, Log Loss** â€” Probability calibration metrics
- **Classification Report** â€” Per-class precision/recall

**Outputs:** `Model_Results/{ModelName}_metrics_summary.csv`, confusion matrix PNG, ROC/PR/calibration plots

### 2. Age-Related Features & Fairness (`age_features_engineering.py` + `age_fairness_analysis.py`)

#### Feature Engineering
- **age** â€” Parsed from DOB or birth_year column
- **age_norm** â€” Normalized age (mean=0, std=1)
- **age_sq** â€” Age squared (for nonlinear effects)
- **age_bin_fixed** â€” Human-friendly bins: 0-25, 26-45, 46-65, 66+
- **age_bin_q** â€” Quartile bins (equal frequency)

**Output:** `Model_Results/age_features.csv`

#### Fairness Analysis by Age Group
- **Positive Rate** â€” % predicted default by age group
- **TPR (True Positive Rate)** â€” Sensitivity: correctly identified defaults
- **FPR (False Positive Rate)** â€” False alarm rate
- **Precision & Recall** â€” Per-age-group performance

**Outputs:**  
- `Model_Results/fairness_by_age.csv` â€” Combined fairness metrics all models  
- `Model_Results/{ModelName}_fairness_by_age.csv` â€” Per-model breakdown  
- `Model_Results/age_hist.png` â€” Age distribution  
- `Model_Results/age_box_by_target.png` â€” Age vs target (boxplot)  
- `Model_Results/age_posrate_by_agebin.png` â€” Positive rate by age bin  
- `Model_Results/age_stats.csv` â€” Descriptive statistics (mean, std, min, max, etc.)  

## How to Run
### Option: Python Scripts (Standalone)
```python
from explainability.validation_metrics import run_validation
from explainability.age_features_engineering import engineer_age_features
from explainability.age_fairness_analysis import plot_age_distribution, compute_fairness_by_age

# Run validation
run_validation(X_eval, y_eval, models, output_dir='Model_Results')

# Engineer age features
df = engineer_age_features(df, age_col=None, output_dir='Model_Results')

# Fairness analysis
plot_age_distribution(df, output_dir='Model_Results')
compute_fairness_by_age(X, y, models, output_dir='Model_Results')
```

## Interpreting Results

### Red Flags ðŸš©
- **Large metric gaps by age group** â†’ Potential bias against certain ages  
- **High FPR disparity** â†’ Model may false-alarm more for one age group  
- **Low calibration** â†’ Predicted probabilities don't match actual rates  
- **High variance in bootstrap estimates** â†’ Model may be unstable

### Green Flags âœ…
- **Similar TPR/FPR across age groups** â†’ Fair treatment across demographics  
- **Calibration curve close to diagonal** â†’ Reliable probability predictions  
- **Tight bootstrap CI (small std)** â†’ Stable predictions  
- **High ROC-AUC + balanced metrics** â†’ Good overall performance

## Output Directory Structure
```
Model_Results/
â”œâ”€â”€ {ModelName}_metrics_summary.csv
â”œâ”€â”€ {ModelName}_confusion_matrix.png
â”œâ”€â”€ {ModelName}_roc.png
â”œâ”€â”€ {ModelName}_pr.png
â”œâ”€â”€ {ModelName}_calibration.png
â”œâ”€â”€ {ModelName}_classification_report.csv
â”œâ”€â”€ age_features.csv
â”œâ”€â”€ age_stats.csv
â”œâ”€â”€ age_hist.png
â”œâ”€â”€ age_box_by_target.png
â”œâ”€â”€ age_posrate_by_agebin.png
â”œâ”€â”€ age_bin_summary.csv
â”œâ”€â”€ age_target_correlation.txt
â”œâ”€â”€ fairness_by_age.csv
â””â”€â”€ {ModelName}_fairness_by_age.csv
```

In [1]:
import joblib
import pandas as pd

In [2]:
models = {
    "XGBoost": joblib.load("../Export/xgb_model.joblib"),
    "LightGBM": joblib.load("../Export/lgb_model.joblib"),
    "RandomForest": joblib.load("../Export/rf_model.joblib")
}

configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

for more details about differences between saving model and serializing.

  setstate(state)
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [3]:
# From a notebook or main script:
from validation_metrics import run_validation
from age_features_engineering import engineer_age_features
from age_fairness_analysis import plot_age_distribution, compute_fairness_by_age
from generate_validation_report import generate_report

df = pd.read_csv("../Export/test_data.csv")
y = df["Actual"]
X = df.drop("Actual", axis=1)

# Run validation
if 'X_eval' in globals():
    run_validation(X_eval, y_eval, models)
else:
    print("X_eval not found â€” running validation on X")
    run_validation(X, y, models)

# Engineer age features
df = engineer_age_features(df)

# Plot & analyze fairness by age
plot_age_distribution(df)
compute_fairness_by_age(X, y, models)

print("Generating consolidated report...")
generate_report(output_dir='Model_Results', report_name='Validation_Report')

print("âœ… All validation complete! Check Model_Results/ folder for outputs.")

X_eval not found â€” running validation on X
Evaluating -> XGBoost
Evaluating -> LightGBM
Evaluating -> RandomForest

All evaluation artifacts saved under `Model_Results/`
              n_samples  accuracy  precision    recall        f1   roc_auc  \
model                                                                        
XGBoost           45000  0.798267   0.216937  0.773271  0.338820  0.865556   
LightGBM          45000  0.799311   0.217998  0.773936  0.340177  0.865396   
RandomForest      45000  0.936778   0.618287  0.141622  0.230457  0.863902   

                 brier   logloss  
model                             
XGBoost       0.139322  0.439745  
LightGBM      0.138296  0.436608  
RandomForest  0.048969  0.178026  
Age features created. Non-null age count: 45000
Saved to Model_Results/age_features.csv
Age distribution plots saved to Model_Results/
Computing fairness by age -> XGBoost
Computing fairness by age -> LightGBM
Computing fairness by age -> RandomForest

Fairness 