# Advanced Models: XGBoost, LightGBM, Neural Network

## Objective

Push performance beyond baselines with advanced models:
1. **XGBoost** - Gradient boosting with regularization
2. **LightGBM** - Faster gradient boosting
3. **Neural Network** - Deep learning approach
4. **Ensemble** - Combine multiple models (optional)

## Goal
Maximize ROC-AUC and Recall @ 1% FPR for production deployment.

---

In [None]:
# Setup
import sys
sys.path.insert(0, '../..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss, roc_curve
import warnings
warnings.filterwarnings('ignore')

from src.bankruptcy_prediction.evaluation import ResultsCollector

plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("‚úì Setup complete")

In [None]:
# Load prepared splits
import os

splits_dir = '../../data/processed/splits'

if os.path.exists(splits_dir):
    X_train = pd.read_parquet(f'{splits_dir}/X_train_full.parquet')
    X_test = pd.read_parquet(f'{splits_dir}/X_test_full.parquet')
    y_train = pd.read_parquet(f'{splits_dir}/y_train.parquet')['y']
    y_test = pd.read_parquet(f'{splits_dir}/y_test.parquet')['y']
    print("‚úì Loaded splits")
else:
    # Fallback
    from sklearn.model_selection import train_test_split
    from src.bankruptcy_prediction.data import DataLoader
    
    loader = DataLoader()
    df = loader.load_poland(horizon=1, dataset_type='full')
    X, y = loader.get_features_target(df)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    print("‚úì Created splits")

print(f"\nTrain: {len(y_train):,} samples ({y_train.mean():.2%} bankrupt)")
print(f"Test:  {len(y_test):,} samples ({y_test.mean():.2%} bankrupt)")

In [None]:
# Helper function (same as baseline)
def evaluate_model(y_true, y_pred_proba, model_name='Model'):
    roc_auc = roc_auc_score(y_true, y_pred_proba)
    pr_auc = average_precision_score(y_true, y_pred_proba)
    brier = brier_score_loss(y_true, y_pred_proba)
    
    fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
    idx_1pct = np.where(fpr <= 0.01)[0]
    recall_1pct = tpr[idx_1pct[-1]] if len(idx_1pct) > 0 else 0.0
    idx_5pct = np.where(fpr <= 0.05)[0]
    recall_5pct = tpr[idx_5pct[-1]] if len(idx_5pct) > 0 else 0.0
    
    return {
        'model_name': model_name,
        'roc_auc': roc_auc,
        'pr_auc': pr_auc,
        'brier_score': brier,
        'recall_1pct_fpr': recall_1pct,
        'recall_5pct_fpr': recall_5pct,
        'horizon': 1
    }

def print_results(results):
    print(f"\n{'='*60}")
    print(f"{results['model_name']:^60}")
    print(f"{'='*60}")
    print(f"ROC-AUC:            {results['roc_auc']:.4f}")
    print(f"PR-AUC:             {results['pr_auc']:.4f}")
    print(f"Brier Score:        {results['brier_score']:.4f}")
    print(f"Recall @ 1% FPR:    {results['recall_1pct_fpr']:.2%}")
    print(f"Recall @ 5% FPR:    {results['recall_5pct_fpr']:.2%}")
    print(f"{'='*60}\n")

print("‚úì Helper functions defined")

## Model 1: XGBoost

Gradient boosting with built-in regularization.

In [None]:
try:
    import xgboost as xgb
    
    print("Training XGBoost...\n")
    
    # Calculate scale_pos_weight for imbalanced data
    scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    
    xgb_model = xgb.XGBClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        scale_pos_weight=scale_pos_weight,
        random_state=42,
        eval_metric='logloss',
        use_label_encoder=False
    )
    
    xgb_model.fit(X_train, y_train, verbose=False)
    
    y_pred_xgb = xgb_model.predict_proba(X_test)[:, 1]
    results_xgb = evaluate_model(y_test, y_pred_xgb, 'XGBoost')
    print_results(results_xgb)
    
    xgb_available = True
    
except ImportError:
    print("‚ö†Ô∏è  XGBoost not installed. Install with: pip install xgboost")
    xgb_available = False
    results_xgb = None
    y_pred_xgb = None

### XGBoost Interpretation:

**Strengths:**
- State-of-the-art gradient boosting
- Built-in regularization (L1, L2)
- Handles imbalanced data well
- Often best performance

**Parameters:**
- `scale_pos_weight`: Handles class imbalance
- `max_depth`: Controls tree depth (prevents overfitting)
- `learning_rate`: Shrinkage for regularization

## Model 2: LightGBM

Microsoft's fast gradient boosting framework.

In [None]:
try:
    import lightgbm as lgb
    
    print("Training LightGBM...\n")
    
    lgb_model = lgb.LGBMClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        class_weight='balanced',
        random_state=42,
        verbose=-1
    )
    
    lgb_model.fit(X_train, y_train)
    
    y_pred_lgb = lgb_model.predict_proba(X_test)[:, 1]
    results_lgb = evaluate_model(y_test, y_pred_lgb, 'LightGBM')
    print_results(results_lgb)
    
    lgb_available = True
    
except ImportError:
    print("‚ö†Ô∏è  LightGBM not installed. Install with: pip install lightgbm")
    lgb_available = False
    results_lgb = None
    y_pred_lgb = None

### LightGBM Interpretation:

**Strengths:**
- Very fast training
- Memory efficient
- Often matches XGBoost performance

**Use case:**
- Large datasets
- When speed matters
- Production systems

## Model 3: Neural Network

Deep learning with keras/tensorflow.

In [None]:
try:
    from tensorflow import keras
    from tensorflow.keras import layers
    from sklearn.preprocessing import StandardScaler
    
    print("Training Neural Network...\n")
    
    # Scale features (NN requires scaling)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Calculate class weights
    from sklearn.utils.class_weight import compute_class_weight
    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    class_weight_dict = {0: class_weights[0], 1: class_weights[1]}
    
    # Build model
    nn_model = keras.Sequential([
        layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
        layers.Dropout(0.3),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(1, activation='sigmoid')
    ])
    
    nn_model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['AUC']
    )
    
    # Train
    history = nn_model.fit(
        X_train_scaled, y_train,
        epochs=50,
        batch_size=64,
        validation_split=0.2,
        class_weight=class_weight_dict,
        verbose=0
    )
    
    y_pred_nn = nn_model.predict(X_test_scaled, verbose=0).ravel()
    results_nn = evaluate_model(y_test, y_pred_nn, 'Neural Network')
    print_results(results_nn)
    
    nn_available = True
    
except ImportError:
    print("‚ö†Ô∏è  TensorFlow not installed. Install with: pip install tensorflow")
    nn_available = False
    results_nn = None
    y_pred_nn = None

### Neural Network Interpretation:

**Strengths:**
- Can learn complex non-linear patterns
- Flexible architecture
- Good for large datasets

**Limitations:**
- Requires more data
- Slower training
- Less interpretable
- Sensitive to hyperparameters

## Model Comparison: Advanced vs Baseline

In [None]:
# Load baseline results
results_collector = ResultsCollector.load_all()

# Add advanced model results
if xgb_available and results_xgb:
    results_collector.add(results_xgb)
if lgb_available and results_lgb:
    results_collector.add(results_lgb)
if nn_available and results_nn:
    results_collector.add(results_nn)

# Save
results_collector.save()

# Display comparison
print("\n" + "="*80)
print("ALL MODELS COMPARISON (Horizon = 1 year)")
print("="*80)
comparison = results_collector.show_comparison()
display(comparison)
print("="*80)

# Best model
best = results_collector.best_model(horizon=1)
if best:
    print(f"\nüèÜ Best model: {best['model_name']} (ROC-AUC: {best['roc_auc']:.4f})")

## Visualization: All Models

In [None]:
fig = results_collector.plot_comparison(output_path='../../results/figures/all_models_comparison.png')
plt.show()

print("‚úì Saved: results/figures/all_models_comparison.png")

## Summary & Recommendations

### Performance Ranking:

Typical results:
1. **XGBoost / LightGBM** - Usually best (0.91-0.93 AUC)
2. **Random Forest** - Close second (0.90 AUC)
3. **Neural Network** - Variable (0.88-0.92 AUC)
4. **Logistic / GLM** - Baseline (0.87 AUC)

### Model Selection:

**For Production:**
- Use **XGBoost** or **LightGBM** (best performance)
- Apply calibration (next notebook)
- Monitor drift

**For Thesis:**
- Compare **all models** to show thorough analysis
- Use **GLM** for statistical inference
- **Random Forest** for feature importance
- **XGBoost** for best results

**For Interpretation:**
- **GLM** - Coefficients and p-values
- **Random Forest** - Feature importance
- **XGBoost** - SHAP values (advanced)

### Next Steps:

1. **Calibration** (`06_model_calibration.ipynb`)
   - Improve probability reliability
   - Critical for decision thresholds

2. **Robustness** (`07_robustness_analysis.ipynb`)
   - Cross-horizon validation
   - All 5 horizons
   - Generalization testing

In [None]:
print("\n" + "="*80)
print("‚úì ADVANCED MODELS COMPLETE")
print("="*80)
if best:
    print(f"\nüèÜ Best model overall: {best['model_name']}")
    print(f"   ROC-AUC: {best['roc_auc']:.4f}")
    print(f"   Recall @ 1% FPR: {best['recall_1pct_fpr']:.2%}")
print(f"\nüìä All results saved to ResultsCollector")
print(f"   Check 00_MASTER_REPORT.ipynb for complete comparison")
print(f"\nNext: 06_model_calibration.ipynb")
print("="*80)