# 05 — Baseline Models

Trains and evaluates baseline models for two tasks:

| Task | Target | Models |
|------|--------|--------|
| Regression | `latency_us` | Mean predictor, Ridge, XGBoost |
| Classification | `latency_violation` (>120 µs) | Logistic Regression, LightGBM |

**Split:** time-ordered 70 / 15 / 15 on `timestamp_ns`.

In [None]:
import json, warnings
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import PrecisionRecallDisplay, RocCurveDisplay

warnings.filterwarnings("ignore", message="X does not have valid feature names")
sns.set_style("whitegrid")
%matplotlib inline

## 1. Run training (or load saved results)

In [None]:
import sys, os
os.chdir(os.path.join(os.path.dirname(os.path.abspath('__file__')), '..'))
sys.path.insert(0, os.getcwd())

from src.models.baseline import (
    load_train_ready, derive_targets, time_split,
    build_feature_transformer, _get_feature_names,
    train_regression_models, train_classification_models,
    evaluate_on_test, per_device_metrics_top10,
    REG_TARGET, CLF_TARGET, VIOLATION_THRESHOLD_US,
)

# Load & split
df = load_train_ready("data/train_ready.parquet")
df = derive_targets(df)
train_df, val_df, test_df = time_split(df)

# Build features (fit on train)
ct = build_feature_transformer()
X_train = ct.fit_transform(train_df)
X_val   = ct.transform(val_df)
X_test  = ct.transform(test_df)
feature_names = _get_feature_names(ct)

# Targets
y_train_reg, y_val_reg, y_test_reg = train_df[REG_TARGET].values, val_df[REG_TARGET].values, test_df[REG_TARGET].values
y_train_clf, y_val_clf, y_test_clf = train_df[CLF_TARGET].values, val_df[CLF_TARGET].values, test_df[CLF_TARGET].values

print(f"X_train: {X_train.shape}   Features: {len(feature_names)}")
print(f"Violation rate — train:{y_train_clf.mean():.3f}  val:{y_val_clf.mean():.3f}  test:{y_test_clf.mean():.3f}")

In [None]:
# Train models
reg_results = train_regression_models(X_train, y_train_reg, X_val, y_val_reg)
clf_results = train_classification_models(X_train, y_train_clf, X_val, y_val_clf)
test_metrics = evaluate_on_test(reg_results, clf_results, X_test, y_test_reg, y_test_clf)

## 2. Metrics summary tables

In [None]:
# Regression summary
reg_rows = []
for name, res in reg_results.items():
    row = {"model": name}
    for split, m in [("val", res["val_metrics"]), ("test", test_metrics[name])]:
        for k, v in m.items():
            row[f"{split}_{k}"] = round(v, 4)
    reg_rows.append(row)

reg_df = pd.DataFrame(reg_rows).set_index("model")
print("\nREGRESSION — latency_us")
reg_df

In [None]:
# Classification summary
clf_rows = []
for name, res in clf_results.items():
    row = {"model": name}
    for split, m in [("val", res["val_metrics"]), ("test", test_metrics[name])]:
        for k, v in m.items():
            row[f"{split}_{k}"] = round(v, 4)
    clf_rows.append(row)

clf_df = pd.DataFrame(clf_rows).set_index("model")
print("\nCLASSIFICATION — latency_violation (>120 µs)")
clf_df

## 3. Predicted vs Actual — Regression

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4), sharex=True, sharey=True)

for ax, (name, res) in zip(axes, reg_results.items()):
    y_pred = res["model"].predict(X_test)
    ax.scatter(y_test_reg, y_pred, alpha=0.15, s=8, edgecolors="none")
    lo, hi = y_test_reg.min(), y_test_reg.max()
    ax.plot([lo, hi], [lo, hi], "r--", lw=1, label="ideal")
    ax.set_title(name)
    ax.set_xlabel("Actual latency_us")
    ax.set_ylabel("Predicted latency_us")
    m = test_metrics[name]
    ax.text(0.05, 0.92, f"R²={m['r2']:.4f}\nMAE={m['mae']:.2f}",
            transform=ax.transAxes, fontsize=8, va="top",
            bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5))

fig.suptitle("Predicted vs Actual — Regression baselines (test set)", fontsize=12)
fig.tight_layout()
fig.savefig("figures/baseline_pred_vs_actual.png", dpi=150)
plt.show()

## 4. Precision-Recall & ROC curves — Classification

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

for name, res in clf_results.items():
    prob = res["model"].predict_proba(X_test)[:, 1]
    PrecisionRecallDisplay.from_predictions(
        y_test_clf, prob, name=name, ax=ax1
    )
    RocCurveDisplay.from_predictions(
        y_test_clf, prob, name=name, ax=ax2
    )

ax1.set_title("Precision-Recall Curve")
ax1.legend(loc="upper right")
ax2.plot([0, 1], [0, 1], "k--", lw=0.8, label="random")
ax2.set_title("ROC Curve")
ax2.legend(loc="lower right")

fig.suptitle("Classification baselines — latency_violation (test set)", fontsize=12)
fig.tight_layout()
fig.savefig("figures/baseline_pr_roc.png", dpi=150)
plt.show()

## 5. Feature importance — XGBoost regression

In [None]:
xgb_model = reg_results["xgboost_reg"]["model"]
importances = xgb_model.feature_importances_
top_k = 20
idx = np.argsort(importances)[-top_k:]

fig, ax = plt.subplots(figsize=(7, 6))
ax.barh(np.array(feature_names)[idx], importances[idx])
ax.set_xlabel("Importance (gain)")
ax.set_title(f"Top-{top_k} features — XGBoost regression")
fig.tight_layout()
fig.savefig("figures/baseline_xgb_importance.png", dpi=150)
plt.show()

## 6. Residual distribution — XGBoost regression

In [None]:
resid = y_test_reg - xgb_model.predict(X_test)

fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(resid, bins=60, edgecolor="white", alpha=0.7)
ax.axvline(0, color="red", ls="--")
ax.set_xlabel("Residual (actual − predicted)")
ax.set_ylabel("Count")
ax.set_title(f"XGBoost residuals (test)\nmean={resid.mean():.2f}, std={resid.std():.2f}")
fig.tight_layout()
fig.savefig("figures/baseline_xgb_residuals.png", dpi=150)
plt.show()

## 7. Load saved metrics JSON

In [None]:
with open("reports/baseline_metrics.json") as f:
    saved = json.load(f)

print("Keys:", list(saved.keys()))
print(f"\nSplit: {saved['split']}")
print(f"Features: {saved['n_features']}")
print(f"Violation threshold: {saved['violation_threshold_us']} µs")
print(f"Violation rate: {saved['violation_rate']}")

## Observations

1. **All regression R² ≈ 0 (negative)** — the features have effectively zero
   predictive power for `latency_us`. This is consistent with the EDA finding
   that the synthetic data has near-uniform distributions and zero correlations.

2. **Classification AUC ≈ 0.50** — both classifiers perform at random-guess
   level. Again, expected given the data generating process.

3. **Ridge ≈ Mean Predictor** — Ridge adds no lift, confirming no linear
   relationship exists between features and `latency_us`.

4. **XGBoost overfits slightly** (worse than mean on test) — with no real
   signal, the tree ensemble captures noise.

5. These baselines serve as a **reference floor**: any genuine improvement
   from more sophisticated models or better-engineered features should
   beat these numbers convincingly.