# Breakout Artist Detection — Model Training & Evaluation (v1)

Goal: predict, as of month **t**, which **eligible** artists will “break out” in the next **H=2 months**.

- Unit of analysis: `(artist, month=t)` rows from the pre-built modeling table.
- Primary objective: **binary classification** (breakout vs not).
- Secondary (diagnostic only): **ranking sanity checks** (are true breakouts near the top?).

Best-practice guardrails:
- **Time-based split** (60/20/20 by month), no shuffling.
- Hyperparameters and threshold are chosen using **validation only**.
- The test split is evaluated **once** at the end.


In [1]:
from pathlib import Path
import sys

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, roc_auc_score, precision_recall_curve
from sklearn.impute import SimpleImputer


import xgboost as xgb


# Make sure we're running from the repo root, not notebooks/
repo_root = Path.cwd().parent.parent
# os.chdir(repo_root)
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

from common.config_manager import ConfigManager
from common.io import read_csv


In [2]:
cfg = ConfigManager(repo_root)
project_cfg = cfg.project()
breakout_cfg = cfg.breakout()

paths = project_cfg["paths"]
features_dir = repo_root / paths["features"]
modeling_path = features_dir / project_cfg["breakout"]["modeling_filename"]

df = pd.read_csv(modeling_path)
df["month"] = pd.to_datetime(df["month"], errors="raise")

ID_COLS = ["artist_name", "month"]
DATE_COL = "month"
TARGET_COL = "y"
SPLIT_COL = "split"
CAT_COLS = ["genre_bucket"] if "genre_bucket" in df.columns else []

months = np.array(sorted(df[DATE_COL].unique()))
n_months = len(months)

n_train = int(n_months * 0.60)
n_val = int(n_months * 0.20)
n_test = n_months - n_train - n_val

train_months = set(months[:n_train])
val_months = set(months[n_train:n_train + n_val])
test_months = set(months[n_train + n_val:])

df[SPLIT_COL] = np.where(
    df[DATE_COL].isin(train_months),
    "train",
    np.where(df[DATE_COL].isin(val_months), "val", "test"),
)

train_df = df.loc[df[SPLIT_COL] == "train"].copy()
val_df = df.loc[df[SPLIT_COL] == "val"].copy()
test_df = df.loc[df[SPLIT_COL] == "test"].copy()

## Data

This notebook loads a pre-built modeling table (`breakout_modeling.csv`) that already includes:
- eligibility filtering (Option A)
- label creation for horizon **H=2**
- censoring of the last H months
- cold-start buffer

This keeps the notebook focused on **modeling + evaluation**, similar to the SOTW workflow.


In [3]:
BETA = 0.5
ALERTS_MEDIAN_TARGET = 3
PRECISION_TARGET = 0.15
MIN_TP_ON_VAL = 1


In [4]:
def evaluate_probs(name: str, y_true: np.ndarray, y_proba: np.ndarray) -> dict:
    out = {
        "pr_auc": float(average_precision_score(y_true, y_proba)),
        "roc_auc": float(roc_auc_score(y_true, y_proba)),
        "pos_rate": float(np.mean(y_true)) if len(y_true) else 0.0,
        "n": int(len(y_true)),
        "n_pos": int(np.sum(y_true)),
    }
    print(
        f"{name:<5} | n={out['n']:,} | pos={out['n_pos']:,} ({out['pos_rate']:.3%}) "
        f"| PR-AUC={out['pr_auc']:.4f} | ROC-AUC={out['roc_auc']:.4f}"
    )
    return out


## Model 1 — Logistic Regression (baseline)

We use Logistic Regression as a strong baseline for sparse, tabular problems:
- numeric features are imputed (median) and standardized
- categorical features are imputed (most frequent) and one-hot encoded
- class imbalance is handled with `class_weight="balanced"`

Primary metric: **PR-AUC** (more informative than ROC-AUC under heavy imbalance).

In [5]:
excluded = set(ID_COLS + [TARGET_COL, SPLIT_COL] + CAT_COLS)
NUM_COLS = [c for c in df.columns if c not in excluded]

num_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

cat_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ohe", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor_lr = ColumnTransformer(
    transformers=[
        ("num", num_transformer, NUM_COLS),
        ("cat", cat_transformer, CAT_COLS),
    ],
    remainder="drop",
)

lr_pipe = Pipeline(
    steps=[
        ("prep", preprocessor_lr),
        ("model", LogisticRegression(class_weight="balanced", max_iter=2000)),
    ]
)

In [6]:
X_train = train_df[NUM_COLS + CAT_COLS]
y_train = train_df[TARGET_COL].astype(int).to_numpy()

X_val = val_df[NUM_COLS + CAT_COLS]
y_val = val_df[TARGET_COL].astype(int).to_numpy()

lr_pipe.fit(X_train, y_train)

p_train = lr_pipe.predict_proba(X_train)[:, 1]
p_val = lr_pipe.predict_proba(X_val)[:, 1]

lr_train_metrics = evaluate_probs("train", y_train, p_train)
lr_val_metrics = evaluate_probs("val", y_val, p_val)


train | n=1,413 | pos=16 (1.132%) | PR-AUC=0.1544 | ROC-AUC=0.8849
val   | n=544 | pos=4 (0.735%) | PR-AUC=0.0313 | ROC-AUC=0.8222


In [7]:
C_GRID = [0.01, 0.1, 1.0, 10.0]

lr_tuning_rows = []
for C in C_GRID:
    lr_pipe.set_params(model__C=C)
    lr_pipe.fit(X_train, y_train)

    p_val = lr_pipe.predict_proba(X_val)[:, 1]
    row = {
        "C": C,
        "pr_auc_val": float(average_precision_score(y_val, p_val)),
        "roc_auc_val": float(roc_auc_score(y_val, p_val)),
    }
    lr_tuning_rows.append(row)

lr_tuning = pd.DataFrame(lr_tuning_rows).sort_values("pr_auc_val", ascending=False).reset_index(drop=True)
lr_tuning


Unnamed: 0,C,pr_auc_val,roc_auc_val
0,10.0,0.032528,0.830093
1,1.0,0.031326,0.822222
2,0.1,0.02861,0.806019
3,0.01,0.027006,0.773148


In [8]:
best_C = float(lr_tuning.loc[0, "C"])
print(f"best_C (by val PR-AUC): {best_C}")

lr_best_C = best_C

lr_pipe.set_params(model__C=lr_best_C)
lr_pipe.fit(X_train, y_train)

p_val_lr = lr_pipe.predict_proba(X_val)[:, 1]
lr_val_metrics = evaluate_probs("val", y_val, p_val_lr)


best_C (by val PR-AUC): 10.0
val   | n=544 | pos=4 (0.735%) | PR-AUC=0.0325 | ROC-AUC=0.8301


## Threshold selection (validation-only)

In the product, predictions become “alerts”. We therefore choose a threshold using an **operational policy**:
- pick a threshold that keeps the **median alerts per month** under a target budget (validation only),
- while still catching at least `MIN_TP_ON_VAL` positives on validation.

This avoids unstable “maximize F-score” behavior when validation has very few positives.


In [9]:
def metrics_at_threshold(y_true: np.ndarray, y_proba: np.ndarray, t: float, beta: float = 0.5) -> dict:
    y_pred = (y_proba >= t).astype(int)

    tp = int(((y_pred == 1) & (y_true == 1)).sum())
    fp = int(((y_pred == 1) & (y_true == 0)).sum())
    fn = int(((y_pred == 0) & (y_true == 1)).sum())

    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall = tp / (tp + fn) if (tp + fn) else 0.0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) else 0.0
    fbeta = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall + 1e-12)

    out = {
        "t": float(t),
        "tp": tp,
        "fp": fp,
        "fn": fn,
        "precision": float(precision),
        "recall": float(recall),
        "f1": float(f1),
        f"f{beta}": float(fbeta),
    }
    return out

def precision_at_top_pct_by_month(df_split: pd.DataFrame, proba: np.ndarray, pct: float = 0.05) -> pd.DataFrame:
    tmp = df_split[[DATE_COL, TARGET_COL]].copy()
    tmp["p"] = proba

    rows = []
    for m, g in tmp.groupby(DATE_COL):
        g = g.sort_values("p", ascending=False)
        n = len(g)
        k = int(np.ceil(pct * n))
        k = max(k, 1)

        topk = g.head(k)
        n_pos = int(g[TARGET_COL].sum())
        if n_pos == 0:
            continue

        prec = float(topk[TARGET_COL].mean())
        rows.append({"month": m, "n": n, "k": k, "pos_in_month": n_pos, "precision_top_pct": prec})

    return pd.DataFrame(rows).sort_values("month").reset_index(drop=True)

In [10]:
def pick_threshold_alert_budget(
    df_split: pd.DataFrame,
    y_true: np.ndarray,
    y_proba: np.ndarray,
    median_alerts_target: int,
    beta: float = 0.5,
    min_tp: int = 1,
) -> tuple[float, pd.DataFrame]:
    precision, recall, thresholds = precision_recall_curve(y_true, y_proba)
    thresholds = np.concatenate([thresholds, [1.0]])

    rows = []
    for t in thresholds:
        m = metrics_at_threshold(y_true, y_proba, float(t), beta=beta)

        alerts = (
            df_split.assign(pred=(y_proba >= t).astype(int))
                    .groupby(DATE_COL)["pred"]
                    .sum()
        )

        rows.append({
            "t": float(t),
            "precision": m["precision"],
            "recall": m["recall"],
            f"f{beta}": m[f"f{beta}"],
            "tp": m["tp"],
            "median_alerts": float(alerts.median()) if len(alerts) else 0.0,
            "max_alerts": float(alerts.max()) if len(alerts) else 0.0,
        })

    grid = pd.DataFrame(rows)

    feasible = grid[(grid["median_alerts"] <= median_alerts_target) & (grid["tp"] >= min_tp)]
    if len(feasible) == 0:
        # fallback: best F-beta among thresholds that get at least one TP (or overall best F-beta if none)
        fallback = grid[grid["tp"] >= min_tp]
        best = (fallback if len(fallback) else grid).sort_values(f"f{beta}", ascending=False).iloc[0]
        return float(best["t"]), grid.sort_values("t")

    # choose the threshold with best F-beta among feasible; break ties toward fewer alerts
    best = feasible.sort_values([f"f{beta}", "median_alerts"], ascending=[False, True]).iloc[0]
    return float(best["t"]), grid.sort_values("t")


In [11]:
lr_T, lr_threshold_grid = pick_threshold_alert_budget(
    df_split=val_df,
    y_true=y_val,
    y_proba=p_val_lr,
    median_alerts_target=ALERTS_MEDIAN_TARGET,
    beta=BETA,
    min_tp=MIN_TP_ON_VAL,
)

print(f"LR threshold (val-only) | policy=median_alerts<={ALERTS_MEDIAN_TARGET} | T={lr_T:.6f}")

lr_val_at_T = metrics_at_threshold(y_val, p_val_lr, lr_T, beta=BETA)
print(lr_val_at_T)

val_alerts = (
    val_df.assign(p=p_val_lr, pred=(p_val_lr >= lr_T).astype(int))
          .groupby(DATE_COL)["pred"]
          .sum()
)

print(
    "Val alerts/month (min/median/max) [LR]: "
    f"{int(val_alerts.min()):,} / {int(val_alerts.median()):,} / {int(val_alerts.max()):,}"
)

lr_threshold_grid.sort_values(["median_alerts", f"f{BETA}"], ascending=[True, False]).head(12)


LR threshold (val-only) | policy=median_alerts<=3 | T=0.569427
{'t': 0.5694266158070337, 'tp': 1, 'fp': 30, 'fn': 3, 'precision': 0.03225806451612903, 'recall': 0.25, 'f1': 0.05714285714285715, 'f0.5': 0.039062499999848635}
Val alerts/month (min/median/max) [LR]: 0 / 2 / 5


Unnamed: 0,t,precision,recall,f0.5,tp,median_alerts,max_alerts
430,0.946281,0.0,0.0,0.0,0,0.0,2.0
431,0.957968,0.0,0.0,0.0,0,0.0,2.0
432,0.961058,0.0,0.0,0.0,0,0.0,2.0
433,0.966195,0.0,0.0,0.0,0,0.0,1.0
434,0.973109,0.0,0.0,0.0,0,0.0,1.0
435,0.984156,0.0,0.0,0.0,0,0.0,1.0
436,1.0,0.0,0.0,0.0,0,0.0,0.0
427,0.787562,0.0,0.0,0.0,0,0.5,3.0
428,0.80577,0.0,0.0,0.0,0,0.5,2.0
429,0.906955,0.0,0.0,0.0,0,0.5,2.0


In [12]:
rank_val = precision_at_top_pct_by_month(val_df, p_val_lr, pct=0.05)
print(f"Val months included (>=1 positive) [LR]: {len(rank_val):,} / {val_df[DATE_COL].nunique():,}")
rank_val.head(12)


Val months included (>=1 positive) [LR]: 4 / 12


Unnamed: 0,month,n,k,pos_in_month,precision_top_pct
0,2023-10-01 00:00:00+00:00,45,3,1,0.0
1,2023-11-01 00:00:00+00:00,46,3,1,0.333333
2,2024-05-01 00:00:00+00:00,44,3,1,0.0
3,2024-06-01 00:00:00+00:00,46,3,1,0.0


## Model 2 — XGBoost (light tuning)

We train a lightweight XGBoost model with:
- small hyperparameter grid
- early stopping using validation
- `scale_pos_weight = n_neg / n_pos` computed from the training split

We evaluate and threshold it using the **same validation-only policy** as Logistic Regression.


In [13]:
preprocessor_xgb = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), CAT_COLS),
        ("num", "passthrough", NUM_COLS),
    ],
    remainder="drop",
)

X_train_xgb = preprocessor_xgb.fit_transform(X_train)
X_val_xgb = preprocessor_xgb.transform(X_val)

n_pos = int(y_train.sum())
n_neg = int((y_train == 0).sum())
scale_pos_weight = (n_neg / max(n_pos, 1))
print(f"scale_pos_weight (train): {scale_pos_weight:.3f} | n_pos={n_pos:,} | n_neg={n_neg:,}")


scale_pos_weight (train): 87.312 | n_pos=16 | n_neg=1,397


In [14]:
xgb_base_params = {
    "objective": "binary:logistic",
    "eval_metric": "aucpr",
    "seed": 42,
    "nthread": -1,
    "scale_pos_weight": scale_pos_weight,
}

param_grid = [
    {"max_depth": 3, "eta": 0.05, "subsample": 0.8, "colsample_bytree": 0.8},
    {"max_depth": 3, "eta": 0.10, "subsample": 0.8, "colsample_bytree": 0.8},
    {"max_depth": 4, "eta": 0.05, "subsample": 0.8, "colsample_bytree": 0.8},
    {"max_depth": 4, "eta": 0.10, "subsample": 0.8, "colsample_bytree": 0.8},
]

dtrain = xgb.DMatrix(X_train_xgb, label=y_train)
dval = xgb.DMatrix(X_val_xgb, label=y_val)

xgb_rows = []
for p in param_grid:
    params = {**xgb_base_params, **p}
    booster = xgb.train(
        params=params,
        dtrain=dtrain,
        num_boost_round=2000,
        evals=[(dtrain, "train"), (dval, "val")],
        early_stopping_rounds=50,
        verbose_eval=False,
    )
    best_iter = int(booster.best_iteration)
    p_val_xgb = booster.predict(dval, iteration_range=(0, best_iter + 1))
    xgb_rows.append(
        {
            **p,
            "best_iter": best_iter,
            "pr_auc_val": float(average_precision_score(y_val, p_val_xgb)),
            "roc_auc_val": float(roc_auc_score(y_val, p_val_xgb)),
        }
    )

xgb_tuning = pd.DataFrame(xgb_rows).sort_values("pr_auc_val", ascending=False).reset_index(drop=True)
xgb_tuning


Unnamed: 0,max_depth,eta,subsample,colsample_bytree,best_iter,pr_auc_val,roc_auc_val
0,3,0.05,0.8,0.8,12,0.055706,0.909722
1,3,0.1,0.8,0.8,13,0.051384,0.898148
2,4,0.05,0.8,0.8,5,0.045967,0.884259
3,4,0.1,0.8,0.8,3,0.042764,0.883102


In [15]:
xgb_best = xgb_tuning.loc[0].to_dict()

xgb_best_params = {
    "max_depth": int(xgb_best["max_depth"]),
    "eta": float(xgb_best["eta"]),
    "subsample": float(xgb_best["subsample"]),
    "colsample_bytree": float(xgb_best["colsample_bytree"]),
}

xgb_best_iter = int(xgb_best["best_iter"])

best_params = {**xgb_base_params, **xgb_best_params}
xgb_model = xgb.train(
    params=best_params,
    dtrain=dtrain,
    num_boost_round=xgb_best_iter + 1,
    evals=[(dtrain, "train"), (dval, "val")],
    verbose_eval=False,
)

p_val_xgb = xgb_model.predict(dval)
evaluate_probs("val", y_val, p_val_xgb)

xgb_T, xgb_threshold_grid = pick_threshold_alert_budget(
    df_split=val_df,
    y_true=y_val,
    y_proba=p_val_xgb,
    median_alerts_target=ALERTS_MEDIAN_TARGET,
    beta=BETA,
    min_tp=MIN_TP_ON_VAL,
)

print(f"XGB threshold (val-only) | policy=median_alerts<={ALERTS_MEDIAN_TARGET} | T={xgb_T:.6f}")
xgb_threshold_grid.sort_values(["median_alerts", f"f{BETA}"], ascending=[True, False]).head(12)


val   | n=544 | pos=4 (0.735%) | PR-AUC=0.0557 | ROC-AUC=0.9097
XGB threshold (val-only) | policy=median_alerts<=3 | T=0.555525


Unnamed: 0,t,precision,recall,f0.5,tp,median_alerts,max_alerts
55,0.555525,0.071429,0.25,0.083333,1,0.0,4.0
54,0.521418,0.066667,0.25,0.078125,1,0.0,4.0
53,0.519861,0.0625,0.25,0.073529,1,0.0,4.0
56,0.558386,0.0,0.0,0.0,0,0.0,4.0
57,0.565084,0.0,0.0,0.0,0,0.0,4.0
58,0.578836,0.0,0.0,0.0,0,0.0,4.0
59,0.588085,0.0,0.0,0.0,0,0.0,3.0
60,0.597303,0.0,0.0,0.0,0,0.0,3.0
61,0.613592,0.0,0.0,0.0,0,0.0,3.0
62,0.616255,0.0,0.0,0.0,0,0.0,2.0


In [16]:
xgb_val_at_T = metrics_at_threshold(y_val, p_val_xgb, xgb_T, beta=0.5)
print(xgb_val_at_T)

val_alerts_xgb = (
    val_df.assign(p=p_val_xgb, pred=(p_val_xgb >= xgb_T).astype(int))
          .groupby(DATE_COL)["pred"]
          .sum()
)

print(
    "Val alerts/month (min/median/max) [XGB]: "
    f"{int(val_alerts_xgb.min()):,} / {int(val_alerts_xgb.median()):,} / {int(val_alerts_xgb.max()):,}"
)


{'t': 0.5555253028869629, 'tp': 1, 'fp': 13, 'fn': 3, 'precision': 0.07142857142857142, 'recall': 0.25, 'f1': 0.11111111111111112, 'f0.5': 0.08333333333302222}
Val alerts/month (min/median/max) [XGB]: 0 / 0 / 4


In [17]:
rank_val_xgb = precision_at_top_pct_by_month(val_df, p_val_xgb, pct=0.05)
print(f"Val months included (>=1 positive) [XGB]: {len(rank_val_xgb):,} / {val_df[DATE_COL].nunique():,}")
rank_val_xgb.head(12)


Val months included (>=1 positive) [XGB]: 4 / 12


Unnamed: 0,month,n,k,pos_in_month,precision_top_pct
0,2023-10-01 00:00:00+00:00,45,3,1,0.0
1,2023-11-01 00:00:00+00:00,46,3,1,0.333333
2,2024-05-01 00:00:00+00:00,44,3,1,0.0
3,2024-06-01 00:00:00+00:00,46,3,1,0.0


## Model selection (validation-only)

We select the final model using **validation PR-AUC** as the primary metric.
Threshold-based metrics (precision/recall/alerts) are treated as **operational diagnostics**, not as the selection target.


In [18]:
lr_val_pr_auc = float(lr_val_metrics["pr_auc"])
xgb_val_pr_auc = float(average_precision_score(y_val, p_val_xgb))

print(f"LR  val PR-AUC: {lr_val_pr_auc:.6f} | C={lr_best_C} | T={lr_T:.6f} | policy=median_alerts<={ALERTS_MEDIAN_TARGET}")
print(f"XGB val PR-AUC: {xgb_val_pr_auc:.6f} | best_iter={xgb_best_iter} | T={xgb_T:.6f} | policy=median_alerts<={ALERTS_MEDIAN_TARGET}")

winner = "xgb" if xgb_val_pr_auc > lr_val_pr_auc else "lr"
print(f"Winner (by val PR-AUC): {winner}")


LR  val PR-AUC: 0.032528 | C=10.0 | T=0.569427 | policy=median_alerts<=3
XGB val PR-AUC: 0.055706 | best_iter=12 | T=0.555525 | policy=median_alerts<=3
Winner (by val PR-AUC): xgb


## Final evaluation (test set — touched once)

From this point onward, the test set is used **only once** to report final performance for the selected model.


In [19]:
X_test = test_df[NUM_COLS + CAT_COLS]
y_test = test_df[TARGET_COL].astype(int).to_numpy()

trainval_df = df.loc[df[SPLIT_COL].isin(["train", "val"])].copy()
X_trainval = trainval_df[NUM_COLS + CAT_COLS]
y_trainval = trainval_df[TARGET_COL].astype(int).to_numpy()

if winner == "lr":
    lr_pipe.set_params(model__C=lr_best_C)
    lr_pipe.fit(X_trainval, y_trainval)

    p_test = lr_pipe.predict_proba(X_test)[:, 1]
    winner_T = lr_T

else:
    preprocessor_xgb_tv = ColumnTransformer(
        transformers=[
            ("cat", OneHotEncoder(handle_unknown="ignore"), CAT_COLS),
            ("num", "passthrough", NUM_COLS),
        ],
        remainder="drop",
    )

    X_trainval_xgb = preprocessor_xgb_tv.fit_transform(X_trainval)
    X_test_xgb = preprocessor_xgb_tv.transform(X_test)

    dtrain_tv = xgb.DMatrix(X_trainval_xgb, label=y_trainval)
    dtest = xgb.DMatrix(X_test_xgb, label=y_test)

    n_pos_tv = int(y_trainval.sum())
    n_neg_tv = int((y_trainval == 0).sum())
    scale_pos_weight_tv = (n_neg_tv / max(n_pos_tv, 1))

    xgb_params_tv = {
        **xgb_base_params,
        **xgb_best_params,
        "scale_pos_weight": scale_pos_weight_tv,
    }

    xgb_model_tv = xgb.train(
        params=xgb_params_tv,
        dtrain=dtrain_tv,
        num_boost_round=xgb_best_iter + 1,
        evals=[(dtrain_tv, "trainval")],
        verbose_eval=False,
    )

    p_test = xgb_model_tv.predict(dtest)
    winner_T = xgb_T

evaluate_probs("test", y_test, p_test)


test  | n=678 | pos=10 (1.475%) | PR-AUC=0.0736 | ROC-AUC=0.8812


{'pr_auc': 0.07364444679618214,
 'roc_auc': 0.8812125748502995,
 'pos_rate': 0.014749262536873156,
 'n': 678,
 'n_pos': 10}

In [20]:
test_at_T = metrics_at_threshold(y_test, p_test, winner_T, beta=BETA)
print(test_at_T)

test_alerts = (
    test_df.assign(p=p_test, pred=(p_test >= winner_T).astype(int))
           .groupby(DATE_COL)["pred"]
           .sum()
)

print(
    f"Test alerts/month (min/median/max) [{winner.upper()}]: "
    f"{int(test_alerts.min()):,} / {int(test_alerts.median()):,} / {int(test_alerts.max()):,}"
)


{'t': 0.5555253028869629, 'tp': 4, 'fp': 48, 'fn': 6, 'precision': 0.07692307692307693, 'recall': 0.4, 'f1': 0.12903225806451613, 'f0.5': 0.09174311926583621}
Test alerts/month (min/median/max) [XGB]: 0 / 4 / 9


In [21]:
rank_test = precision_at_top_pct_by_month(test_df, p_test, pct=0.05)
print(f"Test months included (>=1 positive) [{winner.upper()}]: {len(rank_test):,} / {test_df[DATE_COL].nunique():,}")
rank_test.head(12)


Test months included (>=1 positive) [XGB]: 8 / 13


Unnamed: 0,month,n,k,pos_in_month,precision_top_pct
0,2024-10-01 00:00:00+00:00,52,3,1,0.0
1,2024-11-01 00:00:00+00:00,49,3,1,0.333333
2,2024-12-01 00:00:00+00:00,49,3,1,0.0
3,2025-02-01 00:00:00+00:00,53,3,1,0.0
4,2025-05-01 00:00:00+00:00,47,3,2,0.0
5,2025-06-01 00:00:00+00:00,49,3,1,0.0
6,2025-07-01 00:00:00+00:00,55,3,1,0.333333
7,2025-08-01 00:00:00+00:00,47,3,2,0.333333


## Results (v1) — What these numbers mean

### Model selection (validation only)

- **LR (val):** PR-AUC = **0.0325** (C=10.0), threshold **T=0.5694** (median_alerts<=3)
- **XGB (val):** PR-AUC = **0.0557** (best_iter=12), threshold **T=0.5555** (median_alerts<=3)
- **Winner:** **XGBoost**, because it ranks true breakouts higher on validation (higher PR-AUC).

**Interpretation:** with a very low prevalence problem, the main question is “does the model push true breakouts up the ranking?”. PR-AUC is the right headline metric here, so XGB is the better v1 choice.

---

### Final test evaluation (touched once)

#### Threshold-free performance (ranking quality)
- **Test prevalence:** 10 positives out of 678 rows (**1.48%**).
- **Test PR-AUC:** **0.0736**

A random model would have PR-AUC approximately equal to prevalence (~**0.0147**).  
So **0.0736 is ~5× higher than random**, which indicates the model learned meaningful signal and is ranking future breakouts substantially above non-breakouts.

- **Test ROC-AUC:** **0.8812**

ROC-AUC is less informative under heavy class imbalance, but a value this high is still consistent with strong separation in rankings.

#### Operational performance at the frozen threshold
Using the validation-chosen threshold **T = 0.5555** (policy: **median alerts/month <= 3** on validation):

- **TP=4, FP=48, FN=6** on test
- **Precision:** 7.69%  (≈ 1 true breakout per ~13 alerts)
- **Recall:** 40% (catches 4 out of 10 true breakouts)
- **F0.5:** 0.0917 (precision-weighted)

**Interpretation:** with a rare-event label, precision will naturally be low. The key question is whether the alert volume is acceptable and whether the model can catch a meaningful fraction of breakouts.

#### Alert volume (product signal)
- **Test alerts/month (min / median / max):** **0 / 4 / 9**

This means that in a typical month the model would recommend about **4 breakout candidates**, but it can range from **0** (quiet months) to **9** (busier months).

This is consistent with the validation policy (alerts budget), but not identical — test months can be “harder” or have different base rates than validation.

---

### Ranking diagnostic (sanity check only)
- Months in test: **13**
- Months with ≥1 positive (where ranking is meaningful): **8**

We compute Precision@top-5% only on those 8 months to avoid misleading “0-positive month” metrics.

**Interpretation:** the ranking diagnostic is limited by data sparsity (many months have 0 true breakouts), so it should be treated as a sanity check rather than a headline number.

---

## Bottom line (v1)

- **XGB is the better v1 model** by the primary metric (validation PR-AUC).
- On test, the model provides **meaningful ranking signal** (PR-AUC ~5× above random baseline).
- At the chosen threshold, it produces a **manageable alert volume** (median 4/month) but **low precision** is expected given prevalence.

The next improvement lever is **threshold policy calibration** (alerts budget vs precision target), not model complexity.
