<a href="https://colab.research.google.com/github/CalculatedContent/xgboost2ww/blob/main/notebooks/XGBWW_Catalog_Random100_XGBoost_Accuracy_WithOverfitCatalog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What this notebook does

This notebook benchmarks **XGBoost classification accuracy** across a random sample of catalog datasets, then intentionally trains **overfit variants per completed dataset** to make failure modes visible.

## End-to-end workflow
1. Load `dataset_catalog.csv` from Drive and keep classification tasks.
2. Sample datasets with a fixed random seed and persist the selected list.
3. Train one baseline (“good”) model per sampled dataset.
4. For each completed baseline dataset, train one model per selected overfit mode (`OVERFIT_MODES[:MAX_OVERFIT_CASES]`).
5. Run `xgboost2ww` conversion + WeightWatcher metrics for both baseline and overfit runs.
6. Save checkpoint files continuously so interrupted runs can resume.
7. Write an aggregated output table containing both `case_type="good"` and `case_type="overfit"` rows.

## Overfit behavior in this notebook
- Overfit modes are configured by `OVERFIT_MODES` and capped by `MAX_OVERFIT_CASES` (default 6).
- Modes are applied **per completed dataset** (not globally across only a few datasets).
- Default modes:
  1. `deep_trees`
  2. `too_many_rounds`
  3. `no_regularization`
  4. `no_subsampling`
  5. `tiny_trainset`
  6. `leakage`

## Runtime logging you will see
- Baseline pass: dataset index progress (`[i/N]`), dataset UID, and final `train_accuracy` / `test_accuracy`.
- Overfit pass: dataset index progress (`[dataset i/N]`), overfit mode index (`[overfit j/M]`), overfit mode name, and final `train_accuracy` / `test_accuracy`.
- Skip/failure messages for datasets or modes that cannot be trained.

## Expected outputs
- `checkpoint_results.csv`: running status and per-run metrics.
- `errors.csv`: failed/model_failed rows.
- `results_per_dataset.csv`: completed baseline rows.
- `results_summary_by_source.csv`: baseline source-level summary.
- `checkpoint_results_good_plus_overfit.csv`: aggregated baseline + overfit rows.
- `experiment_config.json`: experiment settings and counts.

## Why include overfit cases?
Including overfit runs alongside strong baseline models provides a direct comparison target for diagnostic plots and sanity checks on generalization behavior.



# XGBWW catalog-driven random-per-source XGBoost benchmark + targeted overfit cases

This notebook keeps the original catalog benchmark workflow, then adds intentionally overfit models per dataset (5–6 cases per dataset, based on `OVERFIT_MODES[:MAX_OVERFIT_CASES]`) and writes an aggregated checkpoint with both good and overfit results.


## 1) Mount Google Drive and configure paths


In [None]:
from google.colab import drive
from pathlib import Path
import json
import pandas as pd

# ===== USER CONFIG =====
CATALOG_CSV = Path("/content/drive/MyDrive/xgbwwdata/catalog_checkpoint/dataset_catalog.csv")
RANDOM_SEED = 42
RANDOM_SAMPLE_SIZE = 100
TEST_SIZE = 0.20
EXPERIMENT_ROOT = Path("/content/drive/MyDrive/xgbwwdata/experiment_checkpoints")
DEFAULT_EXPERIMENT_BASENAME = "random100_xgboost_accuracy_plus_overfit"

# Targeted overfit cases
OVERFIT_MODES = [
    "deep_trees",
    "too_many_rounds",
    "no_regularization",
    "no_subsampling",
    "tiny_trainset",
    "leakage",
]
MAX_OVERFIT_CASES = 6
TINY_TRAIN_FRAC = 0.05

# Restart control
RESTART_EXPERIMENT = True
RETRY_FAILED_DATASETS = False  # Default: do not retry failed/model_failed datasets on restart.
EXPERIMENT_NAME = "random100_xgboost_accuracy_plus_overfit_run03" # Required for restart.
AUTO_INCREMENT_IF_NAME_MISSING = True
# =======================


def next_experiment_name(root: Path, base_name: str) -> str:
    existing = [d.name for d in root.glob(f"{base_name}_run*") if d.is_dir()]
    nums = []
    for name in existing:
        suffix = name.replace(f"{base_name}_run", "")
        if suffix.isdigit():
            nums.append(int(suffix))
    n = (max(nums) + 1) if nums else 1
    return f"{base_name}_run{n:02d}"


drive.mount("/content/drive")
EXPERIMENT_ROOT.mkdir(parents=True, exist_ok=True)

if RESTART_EXPERIMENT:
    if not EXPERIMENT_NAME:
        raise ValueError("Set EXPERIMENT_NAME when RESTART_EXPERIMENT=True.")
    EXPERIMENT_ID = EXPERIMENT_NAME
else:
    if EXPERIMENT_NAME:
        EXPERIMENT_ID = EXPERIMENT_NAME
    elif AUTO_INCREMENT_IF_NAME_MISSING:
        EXPERIMENT_ID = next_experiment_name(EXPERIMENT_ROOT, DEFAULT_EXPERIMENT_BASENAME)
    else:
        EXPERIMENT_ID = DEFAULT_EXPERIMENT_BASENAME

CHECKPOINT_DIR = EXPERIMENT_ROOT / EXPERIMENT_ID
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

CHECKPOINT_RESULTS_CSV = CHECKPOINT_DIR / "checkpoint_results.csv"
CHECKPOINT_ERRORS_CSV = CHECKPOINT_DIR / "errors.csv"
CHECKPOINT_AGGREGATED_CSV = CHECKPOINT_DIR / "checkpoint_results_good_plus_overfit.csv"
OVERFIT_RESULTS_CSV = CHECKPOINT_DIR / "overfit_results.csv"
SELECTED_DATASETS_CSV = CHECKPOINT_DIR / "selected_datasets.csv"
EXPERIMENT_CONFIG_JSON = CHECKPOINT_DIR / "experiment_config.json"

print("Catalog path:", CATALOG_CSV)
print("Experiment checkpoint:", CHECKPOINT_DIR)
print("Restart mode:", RESTART_EXPERIMENT)
print("Overfit results CSV:", OVERFIT_RESULTS_CSV)




Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Catalog path: /content/drive/MyDrive/xgbwwdata/catalog_checkpoint/dataset_catalog.csv
Experiment checkpoint: /content/drive/MyDrive/xgbwwdata/experiment_checkpoints/random100_xgboost_accuracy_plus_overfit_run01
Restart mode: True


## How far did we get before ?


In [None]:
# Progress snapshot from Google Drive checkpoint.
# Restart behavior:
# - status == "completed" is always skipped
# - failed/model_failed are retried only when RETRY_FAILED_DATASETS=True
# - pending/missing are always run
selected_df = pd.read_csv(SELECTED_DATASETS_CSV) if SELECTED_DATASETS_CSV.exists() else None
checkpoint_df = pd.read_csv(CHECKPOINT_RESULTS_CSV) if CHECKPOINT_RESULTS_CSV.exists() else pd.DataFrame()

if selected_df is not None and "dataset_uid" in selected_df.columns:
    target_uids = selected_df["dataset_uid"].astype(str).tolist()
else:
    target_uids = checkpoint_df.get("dataset_uid", pd.Series(dtype=str)).astype(str).tolist()
    if not target_uids:
        target_uids = [None] * RANDOM_SAMPLE_SIZE

status_by_uid = {}
if not checkpoint_df.empty and "dataset_uid" in checkpoint_df.columns:
    checkpoint_df = checkpoint_df.drop_duplicates(subset=["dataset_uid"], keep="last")
    if "status" in checkpoint_df.columns:
        status_by_uid = dict(zip(checkpoint_df["dataset_uid"].astype(str), checkpoint_df["status"].astype(str)))

completed_models = 0
remaining_models = 0
next_uid = None
next_index = None

for idx, uid in enumerate(target_uids, start=1):
    uid_key = str(uid) if uid is not None else None
    status = status_by_uid.get(uid_key, "missing")

    if status == "completed":
        completed_models += 1
        should_run = False
    elif status in {"failed", "model_failed"}:
        should_run = RETRY_FAILED_DATASETS
    else:
        should_run = True

    if should_run:
        remaining_models += 1
        if next_uid is None:
            next_uid = uid_key
            next_index = idx

failed_statuses = {"failed", "model_failed"}
failed_models = sum(1 for uid in target_uids if status_by_uid.get(str(uid), "missing") in failed_statuses)

print(f"Total selected models: {len(target_uids)}")
print(f"Completed models: {completed_models}")
print(f"Failed models currently on checkpoint: {failed_models}")
print(f"Retry failed datasets on restart: {RETRY_FAILED_DATASETS}")
print(f"Remaining models to run on restart: {remaining_models}")
if next_uid is not None:
    print(f"Restart will resume at dataset #{next_index}: {next_uid}")
else:
    print("Checkpoint is fully completed. No more models remain.")

Total selected models: 500
Completed models: 0
Failed models currently on checkpoint: 0
Retry failed datasets on restart: False
Remaining models to run on restart: 500
Restart will resume at dataset #1: openml:75


## 2) Install dependencies

Use the same repository-install flow as the other Colab notebooks (no `pip install xgbwwdata`).


In [None]:
# Install xgbwwdata from a fresh clone using the repository installer script
!rm -rf /content/repo_xgbwwdata
!git clone https://github.com/CalculatedContent/xgbwwdata.git /content/repo_xgbwwdata
%run /content/repo_xgbwwdata/scripts/colab_install.py --repo /content/repo_xgbwwdata

# Notebook-specific dependencies
%pip install -q openml pmlb keel-ds xgboost scikit-learn xgboost2ww weightwatcher


Cloning into '/content/repo_xgbwwdata'...
remote: Enumerating objects: 142, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 142 (delta 11), reused 0 (delta 0), pack-reused 100 (from 2)[K
Receiving objects: 100% (142/142), 233.90 KiB | 1.96 MiB/s, done.
Resolving deltas: 100% (48/48), done.
+ /usr/bin/python3 -m pip install -U pip setuptools wheel
+ /usr/bin/python3 -m pip install -r /content/repo_xgbwwdata/requirements.txt


## 3) Imports


In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import xgboost as xgb
import weightwatcher as ww

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from xgbwwdata import Filters, load_dataset
from xgboost2ww import convert


## 4) Load catalog and pick 100 random dataset UIDs (checkpoint-aware)


In [None]:
if not CATALOG_CSV.exists():
    raise FileNotFoundError(f"Catalog not found: {CATALOG_CSV}. Run XGBWW_Dataset_Catalog_Checkpoint.ipynb first.")

df_catalog = pd.read_csv(CATALOG_CSV)
print("Catalog shape:", df_catalog.shape)

required_cols = {"dataset_uid", "source", "task_type"}
missing = required_cols - set(df_catalog.columns)
if missing:
    raise ValueError(f"Catalog is missing required columns: {missing}")

# Accuracy is for classification; keep classification-like tasks
df_cls = df_catalog[df_catalog["task_type"].astype(str).str.contains("classification", case=False, na=False)].copy()
if df_cls.empty:
    raise ValueError("No classification datasets found in catalog.")

# Select 100 random dataset UIDs up front for this experiment.
n_select = min(RANDOM_SAMPLE_SIZE, len(df_cls))
df_pick = df_cls.sample(n=n_select, random_state=RANDOM_SEED).reset_index(drop=True)

print("Selected datasets:", len(df_pick))
display(df_pick[["source", "dataset_uid", "name", "task_type"]].sort_values(["source", "dataset_uid"]))

# Initialize or reload checkpoint table
if RESTART_EXPERIMENT and CHECKPOINT_RESULTS_CSV.exists():
    checkpoint_df = pd.read_csv(CHECKPOINT_RESULTS_CSV)
    print(f"Loaded existing checkpoint rows: {len(checkpoint_df)}")
else:
    checkpoint_df = df_pick.copy()
    checkpoint_df["status"] = "pending"
    checkpoint_df["error_message"] = ""

    # model / hyperparameter fields (blank at initialization)
    checkpoint_df["xgboost_params"] = ""
    checkpoint_df["rounds"] = np.nan
    checkpoint_df["n_classes"] = np.nan
    checkpoint_df["train_size"] = np.nan
    checkpoint_df["test_size_rows"] = np.nan

    # train/test accuracy fields
    checkpoint_df["train_accuracy"] = np.nan
    checkpoint_df["test_accuracy"] = np.nan

    # weightwatcher metrics
    checkpoint_df["alpha"] = np.nan
    checkpoint_df["ERG_gap"] = np.nan
    checkpoint_df["num_traps"] = np.nan

    # metadata from actual loaded data
    checkpoint_df["dataset_name"] = checkpoint_df.get("name", "")
    checkpoint_df["dataset_openml_id"] = checkpoint_df.get("openml_data_id", np.nan)
    checkpoint_df["n_rows"] = np.nan
    checkpoint_df["n_features"] = np.nan
    checkpoint_df["test_size"] = TEST_SIZE

# Ensure the selected datasets are saved for restart reproducibility
if not SELECTED_DATASETS_CSV.exists():
    df_pick.to_csv(SELECTED_DATASETS_CSV, index=False)

# If restarting, trust existing selected datasets file when present
if RESTART_EXPERIMENT and SELECTED_DATASETS_CSV.exists():
    df_pick = pd.read_csv(SELECTED_DATASETS_CSV)

# Save initial checkpoint immediately
checkpoint_df.to_csv(CHECKPOINT_RESULTS_CSV, index=False)
print("Checkpoint file:", CHECKPOINT_RESULTS_CSV)


## 5) Train one XGBoost model per sampled dataset, run xgboost2ww + WeightWatcher, and report metrics


In [None]:
filters = Filters(
    min_rows=200,
    max_rows=60000,
    max_features=50000,
    max_dense_elements=int(2e8),
)


class ModelTrainingFailed(RuntimeError):
    """Raised when XGBoost model fitting fails for a dataset."""


def detect_xgb_compute_params():
    """Prefer CUDA in Colab when available; gracefully fall back to CPU."""
    gpu_params = {"tree_method": "hist", "device": "cuda"}
    cpu_params = {"tree_method": "hist"}

    X_probe = np.array([[0.0], [1.0], [2.0], [3.0]], dtype=np.float32)
    y_probe = np.array([0, 0, 1, 1], dtype=np.float32)
    dprobe = xgb.DMatrix(X_probe, label=y_probe)

    try:
        xgb.train(
            params={
                "objective": "binary:logistic",
                "eval_metric": "logloss",
                **gpu_params,
            },
            dtrain=dprobe,
            num_boost_round=1,
            verbose_eval=False,
        )
        return gpu_params, "gpu"
    except Exception:
        return cpu_params, "cpu"


def apply_overfit_mode(params: dict, mode: str, seed: int):
    p = dict(params)
    rng = np.random.default_rng(seed + 999)

    if mode == "deep_trees":
        p["max_depth"] = int(rng.integers(12, 19))
        p["learning_rate"] = float(rng.uniform(0.2, 0.45))
        p["min_child_weight"] = float(rng.uniform(1.0, 4.0))
        p["reg_lambda"] = 0.0
        p["reg_alpha"] = 0.0
    elif mode == "too_many_rounds":
        p["learning_rate"] = float(rng.uniform(0.25, 0.5))
        p["max_depth"] = int(rng.integers(6, 12))
        p["subsample"] = 1.0
        p["colsample_bytree"] = 1.0
    elif mode == "no_regularization":
        p["reg_lambda"] = 0.0
        p["reg_alpha"] = 0.0
        p["gamma"] = 0.0
        p["max_depth"] = int(rng.integers(7, 13))
        p["learning_rate"] = float(rng.uniform(0.2, 0.45))
    elif mode == "no_subsampling":
        p["subsample"] = 1.0
        p["colsample_bytree"] = 1.0
        p["max_depth"] = int(rng.integers(6, 11))
        p["learning_rate"] = float(rng.uniform(0.2, 0.45))
    elif mode == "tiny_trainset":
        p["max_depth"] = int(rng.integers(9, 15))
        p["learning_rate"] = float(rng.uniform(0.25, 0.5))
        p["subsample"] = 1.0
        p["colsample_bytree"] = 1.0
    elif mode == "leakage":
        p["max_depth"] = int(rng.integers(5, 10))
        p["learning_rate"] = float(rng.uniform(0.2, 0.45))
        p["subsample"] = 1.0
        p["colsample_bytree"] = 1.0
    else:
        raise ValueError(f"Unknown overfit mode: {mode}")
    return p


def training_schedule(case_type: str, overfit_mode: str | None):
    if case_type == "overfit" and overfit_mode == "too_many_rounds":
        return 7000, None
    if case_type == "overfit":
        return 3000, None
    return 1200, None


XGB_COMPUTE_PARAMS, XGB_COMPUTE_BACKEND = detect_xgb_compute_params()
print(f"XGBoost compute backend detected: {XGB_COMPUTE_BACKEND} | params={XGB_COMPUTE_PARAMS}")


def fit_and_score(row_data: dict, case_type: str = "good", overfit_mode: str | None = None, seed_offset: int = 0):
    dataset_uid = row_data["dataset_uid"]
    source = row_data["source"]
    local_seed = RANDOM_SEED + seed_offset

    X, y, meta = load_dataset(dataset_uid, filters=filters)

    y = np.asarray(y)
    classes, y_enc = np.unique(y, return_inverse=True)
    n_classes = len(classes)
    if n_classes < 2:
        raise ValueError(f"Dataset {dataset_uid} has <2 classes after loading.")

    stratify = y_enc if n_classes > 1 else None
    X_train, X_test, y_train, y_test = train_test_split(
        X, y_enc, test_size=TEST_SIZE, random_state=local_seed, stratify=stratify
    )

    if case_type == "overfit" and overfit_mode == "tiny_trainset":
        sub_idx, _ = train_test_split(
            np.arange(len(y_train)),
            test_size=(1.0 - TINY_TRAIN_FRAC),
            random_state=local_seed,
            stratify=y_train,
        )
        X_train, y_train = X_train[sub_idx], y_train[sub_idx]

    if case_type == "overfit" and overfit_mode == "leakage":
        rng = np.random.default_rng(local_seed)
        Xtr_dense = X_train.toarray() if hasattr(X_train, "toarray") else np.asarray(X_train)
        Xte_dense = X_test.toarray() if hasattr(X_test, "toarray") else np.asarray(X_test)
        leak_tr = (y_train + 0.05 * rng.standard_normal(len(y_train))).astype(np.float32).reshape(-1, 1)
        leak_te = (y_test + 0.05 * rng.standard_normal(len(y_test))).astype(np.float32).reshape(-1, 1)
        X_train = np.hstack([Xtr_dense.astype(np.float32), leak_tr])
        X_test = np.hstack([Xte_dense.astype(np.float32), leak_te])

    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)

    try:
        if n_classes == 2:
            params = {
                "objective": "binary:logistic",
                "eval_metric": "logloss",
                **XGB_COMPUTE_PARAMS,
                "learning_rate": 0.05,
                "max_depth": 6,
                "subsample": 0.85,
                "colsample_bytree": 0.85,
                "min_child_weight": 2.0,
                "reg_lambda": 2.0,
                "reg_alpha": 0.2,
                "gamma": 0.1,
                "seed": local_seed,
            }
            if case_type == "overfit":
                params = apply_overfit_mode(params, overfit_mode, local_seed)
            num_boost_round, early_stop = training_schedule(case_type, overfit_mode)
            if early_stop is None:
                cv = xgb.cv(
                    params=params,
                    dtrain=dtrain,
                    num_boost_round=num_boost_round,
                    nfold=5,
                    stratified=True,
                    early_stopping_rounds=50,
                    seed=local_seed,
                    verbose_eval=False,
                )
                rounds = len(cv)
                model = xgb.train(params=params, dtrain=dtrain, num_boost_round=rounds, verbose_eval=False)
            else:
                model = xgb.train(
                    params=params,
                    dtrain=dtrain,
                    num_boost_round=num_boost_round,
                    evals=[(dtrain, "train"), (dtest, "test")],
                    early_stopping_rounds=early_stop,
                    verbose_eval=False,
                )
                rounds = int(model.best_iteration + 1)

            yhat_tr = (model.predict(dtrain) >= 0.5).astype(int)
            yhat_te = (model.predict(dtest) >= 0.5).astype(int)
        else:
            params = {
                "objective": "multi:softprob",
                "num_class": n_classes,
                "eval_metric": "mlogloss",
                **XGB_COMPUTE_PARAMS,
                "learning_rate": 0.05,
                "max_depth": 7,
                "subsample": 0.9,
                "colsample_bytree": 0.9,
                "min_child_weight": 1.0,
                "reg_lambda": 1.0,
                "reg_alpha": 0.1,
                "gamma": 0.0,
                "seed": local_seed,
            }
            if case_type == "overfit":
                params = apply_overfit_mode(params, overfit_mode, local_seed)
            num_boost_round, early_stop = training_schedule(case_type, overfit_mode)
            if early_stop is None:
                cv = xgb.cv(
                    params=params,
                    dtrain=dtrain,
                    num_boost_round=num_boost_round,
                    nfold=5,
                    stratified=True,
                    early_stopping_rounds=60,
                    seed=local_seed,
                    verbose_eval=False,
                )
                rounds = len(cv)
                model = xgb.train(params=params, dtrain=dtrain, num_boost_round=rounds, verbose_eval=False)
            else:
                model = xgb.train(
                    params=params,
                    dtrain=dtrain,
                    num_boost_round=num_boost_round,
                    evals=[(dtrain, "train"), (dtest, "test")],
                    early_stopping_rounds=early_stop,
                    verbose_eval=False,
                )
                rounds = int(model.best_iteration + 1)

            yhat_tr = np.argmax(model.predict(dtrain), axis=1)
            yhat_te = np.argmax(model.predict(dtest), axis=1)
    except Exception as e:
        raise ModelTrainingFailed(f"Model training failed for dataset {dataset_uid}: {e}") from e

    ww_layer = convert(
        model,
        X_train,
        y_train,
        W="W7",
        nfolds=5,
        t_points=160,
        random_state=local_seed,
        train_params=params,
        num_boost_round=rounds,
        multiclass="avg" if n_classes > 2 else "error",
        return_type="torch",
    )
    watcher = ww.WeightWatcher(model=ww_layer)
    details_df = watcher.analyze(ERG=True, randomize=True, plot=False)

    alpha = float(details_df["alpha"].iloc[0]) if "alpha" in details_df else np.nan
    erg_gap = float(details_df["ERG_gap"].iloc[0]) if "ERG_gap" in details_df else np.nan
    num_traps = float(details_df["num_traps"].iloc[0]) if "num_traps" in details_df else np.nan

    result = {
        "source": source,
        "dataset_uid": dataset_uid,
        "dataset_name": row_data.get("name", meta.get("name")),
        "task_type": row_data.get("task_type"),
        "dataset_openml_id": row_data.get("openml_data_id"),
        "n_rows": int(meta.get("n_rows", len(y))),
        "n_features": int(meta.get("n_features", X.shape[1] if hasattr(X, "shape") else -1)),
        "n_classes": int(n_classes),
        "test_size": float(TEST_SIZE),
        "train_size": int(len(y_train)),
        "test_size_rows": int(len(y_test)),
        "rounds": int(rounds),
        "train_accuracy": float(accuracy_score(y_train, yhat_tr)),
        "test_accuracy": float(accuracy_score(y_test, yhat_te)),
        "accuracy_gap": float(accuracy_score(y_train, yhat_tr) - accuracy_score(y_test, yhat_te)),
        "alpha": alpha,
        "ERG_gap": erg_gap,
        "num_traps": num_traps,
        "case_type": case_type,
        "overfit_mode": overfit_mode if overfit_mode else "none",
        "seed_used": int(local_seed),
        "xgboost_params": json.dumps(params, sort_keys=True),
        "status": "completed",
        "error_message": "",
    }
    return result


def update_checkpoint_row(uid: str, updates: dict):
    global checkpoint_df
    mask = checkpoint_df["dataset_uid"] == uid
    if not mask.any():
        return
    for k, v in updates.items():
        if k not in checkpoint_df.columns:
            checkpoint_df[k] = np.nan
        checkpoint_df.loc[mask, k] = v
    checkpoint_df.to_csv(CHECKPOINT_RESULTS_CSV, index=False)


total_models = len(df_pick)

for model_idx, row in enumerate(df_pick.itertuples(index=False), start=1):
    row_data = row._asdict()
    uid = row_data["dataset_uid"]
    progress = f"[{model_idx}/{total_models}]"

    existing = checkpoint_df.loc[checkpoint_df["dataset_uid"] == uid, "status"]
    if not existing.empty:
        existing_status = str(existing.iloc[0])
        if existing_status == "completed":
            print(f"{progress} Skipping completed dataset: {uid}")
            continue
        if existing_status in {"failed", "model_failed"} and not RETRY_FAILED_DATASETS:
            print(f"{progress} Skipping failed dataset (retry disabled): {uid}")
            continue

    print(f"{progress} Training + WW analysis: {uid} | case_type=good")
    try:
        result = fit_and_score(row_data, case_type="good", overfit_mode=None, seed_offset=0)
        update_checkpoint_row(uid, result)
        print(
            f"{progress} Completed: {uid} | train_accuracy={result['train_accuracy']:.4f} "
            f"| test_accuracy={result['test_accuracy']:.4f}"
        )
    except ModelTrainingFailed as e:
        err_msg = str(e)
        update_checkpoint_row(uid, {"status": "model_failed", "error_message": err_msg})
        print(f"{progress} Marked model_failed for {uid}: {err_msg}")
    except Exception as e:
        err_msg = str(e)
        update_checkpoint_row(uid, {"status": "failed", "error_message": err_msg})
        print(f"{progress} Skipped {uid}: {err_msg}")

results_df = checkpoint_df[checkpoint_df["status"] == "completed"].copy()
errors_df = checkpoint_df[checkpoint_df["status"] != "completed"][["source", "dataset_uid", "status", "error_message"]].rename(columns={"error_message": "error"}).copy()

if not errors_df.empty:
    errors_df.to_csv(CHECKPOINT_ERRORS_CSV, index=False)

print("Completed:", len(results_df), "datasets")
print("Failed:", len(errors_df), "datasets")

display(results_df.sort_values(["source", "test_accuracy"], ascending=[True, False]))


## 6) Summary tables (accuracy + WeightWatcher metrics)


In [None]:
if results_df.empty:
    print("No successful trainings.")
else:
    metric_cols = ["train_accuracy", "test_accuracy", "accuracy_gap", "alpha", "ERG_gap", "num_traps"]
    summary = (
        results_df.groupby("source", as_index=False)[metric_cols]
        .agg(["mean", "std", "min", "max"])
    )
    summary.columns = ["source"] + [f"{a}_{b}" for a, b in summary.columns.tolist()[1:]]

    print("Per-dataset results (good models):")
    display(results_df.sort_values(["source", "test_accuracy"], ascending=[True, False]))

    print("Per-source summary (good models):")
    display(summary.sort_values("test_accuracy_mean", ascending=False))

    experiment_config = {
        "experiment_id": EXPERIMENT_ID,
        "experiment_name": EXPERIMENT_ID,
        "restart_experiment": RESTART_EXPERIMENT,
        "catalog_csv": str(CATALOG_CSV),
        "random_seed": RANDOM_SEED,
        "random_sample_size": RANDOM_SAMPLE_SIZE,
        "test_size": TEST_SIZE,
        "overfit_modes": OVERFIT_MODES,
        "max_overfit_cases": MAX_OVERFIT_CASES,
        "selected_dataset_count": int(len(df_pick)),
        "successful_dataset_count": int(len(results_df)),
        "failed_dataset_count": int(len(errors_df)),
    }

    results_path = CHECKPOINT_DIR / "results_per_dataset.csv"
    summary_path = CHECKPOINT_DIR / "results_summary_by_source.csv"

    checkpoint_df.to_csv(CHECKPOINT_RESULTS_CSV, index=False)
    results_df.to_csv(results_path, index=False)
    summary.to_csv(summary_path, index=False)
    errors_df.to_csv(CHECKPOINT_ERRORS_CSV, index=False)
    with open(EXPERIMENT_CONFIG_JSON, "w") as f:
        json.dump(experiment_config, f, indent=2)

    print("Saved checkpoint files:")
    print(" -", CHECKPOINT_RESULTS_CSV)
    print(" -", results_path)
    print(" -", summary_path)
    print(" -", CHECKPOINT_ERRORS_CSV)
    print(" -", EXPERIMENT_CONFIG_JSON)


## 7) Accuracy comparison plots vs WeightWatcher metrics


In [None]:
if results_df.empty:
    print("No successful trainings to plot.")
else:
    import matplotlib.pyplot as plt

    plot_df = results_df.sort_values(["source", "dataset_uid"]).copy()

    metrics = ["alpha", "ERG_gap", "num_traps"]
    fig, axes = plt.subplots(1, len(metrics), figsize=(6 * len(metrics), 5), squeeze=False)

    for ax, metric in zip(axes[0], metrics):
        ax.scatter(plot_df[metric], plot_df["train_accuracy"], label="Train accuracy", alpha=0.8)
        ax.scatter(plot_df[metric], plot_df["test_accuracy"], label="Test accuracy", alpha=0.8)
        ax.set_xlabel(metric)
        ax.set_ylabel("Accuracy")
        ax.set_ylim(0.4, 1.05)
        if metric == "alpha":
            ax.set_xlim(1.5, 6)
        ax.set_title(f"Good models: Accuracy vs {metric}")
        ax.grid(alpha=0.2)

    axes[0, 0].legend()
    fig.tight_layout()
    plt.show()


## 8) Train targeted overfit cases for each completed dataset and build aggregated checkpoint


In [None]:
import matplotlib.pyplot as plt

if results_df.empty:
    print("No completed good models. Cannot build overfit comparison set.")
    combined_df = results_df.copy()
    overfit_df = pd.DataFrame()
else:
    good_df = results_df.copy()
    good_df["case_type"] = good_df.get("case_type", "good")
    good_df["overfit_mode"] = good_df.get("overfit_mode", "none")
    if "accuracy_gap" not in good_df.columns:
        good_df["accuracy_gap"] = good_df["train_accuracy"] - good_df["test_accuracy"]

    # Restart-aware overfit checkpointing:
    # - load previously saved overfit rows from CHECKPOINT_AGGREGATED_CSV
    # - skip already completed (dataset_uid, overfit_mode) pairs
    # - persist after every overfit case so long runs can resume exactly
    overfit_cols = [
        "case_type", "dataset_uid", "overfit_mode", "status", "error_message",
        "source", "train_accuracy", "test_accuracy", "accuracy_gap", "alpha", "ERG_gap", "num_traps"
    ]
    existing_overfit_df = pd.DataFrame(columns=overfit_cols)
    if CHECKPOINT_AGGREGATED_CSV.exists():
        existing_combined_df = pd.read_csv(CHECKPOINT_AGGREGATED_CSV)
        if "case_type" in existing_combined_df.columns:
            existing_overfit_df = existing_combined_df[
                existing_combined_df["case_type"].astype(str) == "overfit"
            ].copy()
            if "status" not in existing_overfit_df.columns:
                existing_overfit_df["status"] = "completed"
            if "error_message" not in existing_overfit_df.columns:
                existing_overfit_df["error_message"] = ""

    completed_pairs = set()
    failed_pairs = set()
    if not existing_overfit_df.empty:
        existing_overfit_df["dataset_uid"] = existing_overfit_df["dataset_uid"].astype(str)
        existing_overfit_df["overfit_mode"] = existing_overfit_df["overfit_mode"].astype(str)
        completed_pairs = {
            (uid, mode)
            for uid, mode, status in zip(
                existing_overfit_df["dataset_uid"],
                existing_overfit_df["overfit_mode"],
                existing_overfit_df["status"].astype(str),
            )
            if status == "completed"
        }
        failed_pairs = {
            (uid, mode)
            for uid, mode, status in zip(
                existing_overfit_df["dataset_uid"],
                existing_overfit_df["overfit_mode"],
                existing_overfit_df["status"].astype(str),
            )
            if status == "failed"
        }

    per_dataset_modes = OVERFIT_MODES[:MAX_OVERFIT_CASES]
    expected_total = len(good_df) * len(per_dataset_modes)
    print(
        f"Creating up to {len(per_dataset_modes)} overfit cases per completed dataset "
        f"using modes: {per_dataset_modes}"
    )
    print(
        f"Existing overfit checkpoint rows: {len(existing_overfit_df)} "
        f"| completed pairs: {len(completed_pairs)} | failed pairs: {len(failed_pairs)}"
    )

    # Deduplicate by pair, preferring the latest row if present in existing checkpoint.
    overfit_records = {}
    if not existing_overfit_df.empty:
        for _, r in existing_overfit_df.iterrows():
            key = (str(r.get("dataset_uid")), str(r.get("overfit_mode")))
            overfit_records[key] = r.to_dict()

    total_good = len(good_df)
    for ds_idx, (_, row) in enumerate(good_df.iterrows(), start=1):
        row_data = row.to_dict()
        uid = str(row_data["dataset_uid"])
        print(f"[dataset {ds_idx}/{total_good}] dataset={uid} | case_type=overfit")

        for mode_idx, mode in enumerate(per_dataset_modes, start=1):
            pair = (uid, str(mode))
            print(f"  [overfit {mode_idx}/{len(per_dataset_modes)}] mode={mode}")

            if pair in completed_pairs:
                print(f"    SKIP completed overfit checkpoint pair={pair}")
                continue

            try:
                overfit_result = fit_and_score(
                    row_data,
                    case_type="overfit",
                    overfit_mode=mode,
                    seed_offset=1000 + (ds_idx * 100) + mode_idx,
                )
                overfit_result["status"] = "completed"
                overfit_result["error_message"] = ""
                print(
                    f"    [dataset {ds_idx}/{total_good}] overfit_mode={mode} "
                    f"| train_accuracy={overfit_result['train_accuracy']:.4f} "
                    f"| test_accuracy={overfit_result['test_accuracy']:.4f}"
                )
                overfit_records[pair] = overfit_result
                completed_pairs.add(pair)
                failed_pairs.discard(pair)
            except Exception as e:
                err = str(e)
                print(f"    FAILED overfit case dataset={uid} mode={mode}: {err}")
                fail_row = {
                    "dataset_uid": uid,
                    "source": row_data.get("source", "unknown"),
                    "case_type": "overfit",
                    "overfit_mode": mode,
                    "status": "failed",
                    "error_message": err,
                    "train_accuracy": np.nan,
                    "test_accuracy": np.nan,
                    "accuracy_gap": np.nan,
                    "alpha": np.nan,
                    "ERG_gap": np.nan,
                    "num_traps": np.nan,
                }
                overfit_records[pair] = fail_row
                failed_pairs.add(pair)

            # Persist after every pair update for robust restart behavior.
            overfit_df = pd.DataFrame(list(overfit_records.values()))
            combined_df = pd.concat([good_df, overfit_df], ignore_index=True, sort=False)
            combined_df.to_csv(CHECKPOINT_AGGREGATED_CSV, index=False)

    overfit_df = pd.DataFrame(list(overfit_records.values()))
    for acc_col in ["train_accuracy", "test_accuracy", "accuracy_gap"]:
        if acc_col not in overfit_df.columns:
            overfit_df[acc_col] = np.nan
        overfit_df[acc_col] = pd.to_numeric(overfit_df[acc_col], errors="coerce")
    overfit_df.to_csv(OVERFIT_RESULTS_CSV, index=False)

    combined_df = pd.concat([good_df, overfit_df], ignore_index=True, sort=False)
    combined_df.to_csv(CHECKPOINT_AGGREGATED_CSV, index=False)

    print("Saved overfit-only checkpoint:")
    print(" -", OVERFIT_RESULTS_CSV)
    print("Saved aggregated checkpoint:")
    print(" -", CHECKPOINT_AGGREGATED_CSV)
    print(
        "Good rows:",
        len(good_df),
        "| Overfit rows:",
        len(overfit_df),
        "| Expected overfit rows:",
        expected_total,
    )
    if "status" in overfit_df.columns:
        print("Overfit status counts:")
        display(overfit_df["status"].value_counts(dropna=False))

    if not combined_df.empty:
        display(
            combined_df[
                [
                    "case_type", "overfit_mode", "source", "dataset_uid", "train_accuracy", "test_accuracy",
                    "accuracy_gap", "alpha", "ERG_gap", "num_traps"
                ]
            ].sort_values(["case_type", "source", "dataset_uid", "overfit_mode"])
        )

        # Scatter plot: train vs test accuracy
        plt.figure(figsize=(8, 6))
        plot_df = combined_df.copy()
        for case, marker in [("good", "o"), ("overfit", "x")]:
            sub = plot_df[plot_df["case_type"] == case]
            if sub.empty:
                continue
            plt.scatter(
                sub["test_accuracy"],
                sub["train_accuracy"],
                alpha=0.7,
                label=case,
                marker=marker,
            )
        min_acc = np.nanmin([plot_df["test_accuracy"].min(), plot_df["train_accuracy"].min()])
        max_acc = np.nanmax([plot_df["test_accuracy"].max(), plot_df["train_accuracy"].max()])
        plt.plot([min_acc, max_acc], [min_acc, max_acc], linestyle="--")
        plt.xlabel("Test accuracy")
        plt.ylabel("Train accuracy")
        plt.title("Train vs Test accuracy: good vs overfit cases")
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()



In [None]:
!head -200 $CHECKPOINT_AGGREGATED_CSV


In [None]:
# 9) Per-overfit-mode accuracy visualizations

import matplotlib.pyplot as plt

if 'combined_df' not in globals() or combined_df.empty:
    print('No combined results available. Run the training/aggregation cells first.')
else:
    overfit_plot_df = combined_df[combined_df['case_type'] == 'overfit'].copy()
    if overfit_plot_df.empty:
        print('No overfit rows available for accuracy plots.')
    else:
        overfit_plot_df = overfit_plot_df[overfit_plot_df.get('status', 'completed').astype(str) == 'completed'].copy()
        for acc_col in ['train_accuracy', 'test_accuracy']:
            overfit_plot_df[acc_col] = pd.to_numeric(overfit_plot_df[acc_col], errors='coerce')
        overfit_plot_df = overfit_plot_df.dropna(subset=['train_accuracy', 'test_accuracy'])
        overfit_plot_df['generalization_gap'] = overfit_plot_df['train_accuracy'] - overfit_plot_df['test_accuracy']

        if overfit_plot_df.empty:
            print('No completed overfit rows with train/test accuracies available.')
        else:
            modes = [m for m in OVERFIT_MODES[:MAX_OVERFIT_CASES] if m in set(overfit_plot_df['overfit_mode'].astype(str))]
            print(
                f'Building overfit-mode accuracy plots for {len(modes)} modes '
                f'with threshold RANDOM_SAMPLE_SIZE={RANDOM_SAMPLE_SIZE}.'
            )

            for mode in modes:
                mode_df = overfit_plot_df.loc[overfit_plot_df['overfit_mode'] == mode].copy()
                n_models = len(mode_df)
                print(f'overfit_mode={mode} | completed models={n_models}')

                if n_models <= RANDOM_SAMPLE_SIZE:
                    print(
                        f'Generating per-model train/test bar charts for mode={mode} '
                        f'(n={n_models} <= {RANDOM_SAMPLE_SIZE}).'
                    )
                    mode_df = mode_df.sort_values(['source', 'dataset_uid']).reset_index(drop=True)
                    for idx, row in mode_df.iterrows():
                        fig, ax = plt.subplots(figsize=(6, 4))
                        bars = ax.bar(
                            ['Train accuracy', 'Test accuracy'],
                            [row['train_accuracy'], row['test_accuracy']],
                            color=['tab:blue', 'tab:orange'],
                        )
                        ax.set_ylim(0, 1)
                        ax.set_ylabel('Accuracy')
                        ax.set_title(
                            f"{mode} | dataset={row['dataset_uid']} | "
                            f"model {idx + 1}/{n_models}"
                        )
                        ax.grid(axis='y', alpha=0.2)
                        for bar in bars:
                            height = bar.get_height()
                            ax.text(
                                bar.get_x() + bar.get_width() / 2,
                                min(height + 0.02, 1.0),
                                f'{height:.3f}',
                                ha='center',
                                va='bottom',
                                fontsize=9,
                            )
                        fig.tight_layout()
                        plt.show()
                else:
                    print(
                        f'Generating generalization-gap histogram for mode={mode} '
                        f'(n={n_models} > {RANDOM_SAMPLE_SIZE}).'
                    )
                    fig, ax = plt.subplots(figsize=(7, 4))
                    ax.hist(mode_df['generalization_gap'].values, bins=25, alpha=0.85, edgecolor='black')
                    ax.set_title(
                        f'Generalization gap histogram | mode={mode} | '
                        f'n={n_models}'
                    )
                    ax.set_xlabel('Generalization gap (train_accuracy - test_accuracy)')
                    ax.set_ylabel('Count')
                    ax.grid(alpha=0.2)
                    fig.tight_layout()
                    plt.show()
