# 06 — Advanced Models: XGBoost-ES + LSTM

Trains two advanced model families and compares them against Phase-5 baselines:

| Model | Type | Notes |
|-------|------|-------|
| XGBoost-ES | Tree ensemble | Early stopping (patience=30, up to 500 rounds) |
| LSTM | Sequence (PyTorch) | 2-layer, hidden=64, seq_len=30, global sliding window |

Both regression (`latency_us`) and classification (`latency_violation >120µs`) tasks.

In [None]:
import json, warnings, os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

os.chdir(os.path.join(os.path.dirname(os.path.abspath('__file__')), '..'))
sys.path.insert(0, os.getcwd())
warnings.filterwarnings('ignore', message='X does not have valid feature names')
sns.set_style('whitegrid')
%matplotlib inline

## 1. Train advanced models

In [None]:
import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

from src.models.advanced import main as train_advanced
adv_metrics = train_advanced()

## 2. LSTM training curves

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Regression LSTM loss
h = adv_metrics['lstm_reg']['history']
ax1.plot(h['train_loss'], label='train')
ax1.plot(h['val_loss'], label='val')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('MSE Loss')
ax1.set_title(f"LSTM Regression — best epoch {adv_metrics['lstm_reg']['best_epoch']}")
ax1.legend()

# Classification LSTM loss
h = adv_metrics['lstm_clf']['history']
ax2.plot(h['train_loss'], label='train')
ax2.plot(h['val_loss'], label='val')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('BCE Loss')
ax2.set_title(f"LSTM Classification — best epoch {adv_metrics['lstm_clf']['best_epoch']}")
ax2.legend()

fig.suptitle('LSTM Training Curves', fontsize=13)
fig.tight_layout()
fig.savefig('figures/lstm_training_curves.png', dpi=150)
plt.show()

## 3. XGBoost early-stopping curves

In [None]:
import joblib
xgb_bundle = joblib.load('models/advanced_xgb.joblib')

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

evals_reg = xgb_bundle['evals_reg']
for key in evals_reg:
    metric_name = list(evals_reg[key].keys())[0]
    ax1.plot(evals_reg[key][metric_name], label=key)
ax1.set_xlabel('Boosting Round')
ax1.set_ylabel(metric_name.upper())
ax1.set_title(f'XGBoost Regression — best iter {adv_metrics["xgb_reg"]["best_iteration"]}')
ax1.legend()

evals_clf = xgb_bundle['evals_clf']
for key in evals_clf:
    metric_name = list(evals_clf[key].keys())[0]
    ax2.plot(evals_clf[key][metric_name], label=key)
ax2.set_xlabel('Boosting Round')
ax2.set_ylabel(metric_name)
ax2.set_title(f'XGBoost Classification — best iter {adv_metrics["xgb_clf"]["best_iteration"]}')
ax2.legend()

fig.suptitle('XGBoost Early-Stopping Curves', fontsize=13)
fig.tight_layout()
fig.savefig('figures/xgb_es_curves.png', dpi=150)
plt.show()

## 4. Run comparison pipeline

In [None]:
from src.models.compare import main as run_compare
comp = run_compare()

## 5. Comparison bar chart

In [None]:
from IPython.display import Image, display
display(Image(filename='figures/model_comparison.png'))

## 6. Test-set metrics tables

In [None]:
# Regression
reg = comp['regression']
reg_df = pd.DataFrame(reg).T
print('REGRESSION')
display(reg_df[['mae','rmse','r2']].round(4))

# Classification
clf = comp['classification']
clf_df = pd.DataFrame(clf).T
print('\nCLASSIFICATION')
display(clf_df[['accuracy','f1','roc_auc','avg_precision']].round(4))

## 7. Bootstrap test results

In [None]:
boot = comp['bootstrap_xgb_reg']
print('Bootstrap paired test: Baseline XGB vs Advanced XGB (MAE)')
print(f"  Baseline MAE:  {boot['metric_a']:.4f}")
print(f"  Advanced MAE:  {boot['metric_b']:.4f}")
print(f"  Diff (bl-adv): {boot['diff_a_minus_b']:.4f}")
print(f"  p-value:       {boot['p_value']:.4f}")
print(f"  95% CI:        [{boot['ci_95_lower']:.4f}, {boot['ci_95_upper']:.4f}]")

## Observations

1. **XGBoost-ES regression** early-stopped at ~31 rounds — additional rounds
   provide no benefit on this synthetic data, confirming the near-zero signal.

2. **LSTM regression** converges to MSE ≈ 226 (RMSE ≈ 15 µs) matching the
   expected variance of a uniform distribution, comparable to the mean predictor.

3. **All models cluster around the same error floor** (MAE ≈ 12 µs).
   With genuine temporal patterns in real data, the LSTM's sequential
   context and XGBoost's early-stopping would be expected to dominate.

4. **Bootstrap p-value**: the difference between baseline and advanced
   XGBoost MAE is statistically measurable but practically negligible
   given the synthetic data.