# Convert training log text → machine-readable CSV + per-seed leaderboards

This notebook parses a training log text file and produces:

1. A **tidy CSV** (`results/BaseTrain_parsed.csv`) with one row per `(seed, run_idx, epoch)`.
2. An **Excel workbook** (`results/BaseTrain_parsed.xlsx`) with:
   - `parsed_epochs` (full per-epoch table)
   - one leaderboard sheet per seed: `leaderboard_<seed>`

The original log does **not** explicitly print the seed, but the runs were executed over 5 seeds in this known order:
`[38042, 217401, 637451, 207796, 45921]`.


In [1]:
from pathlib import Path

# =============================================================================
# CONFIGURATION
# =============================================================================

# Global variable: relative directory where outputs will be written.
CSV_REL_DIR = "../Structured Outputs/Base Gridsearch/"

# The log was run over 5 seeds in this order (not printed in the log itself).
SEEDS = [38042, 217401, 637451, 207796, 45921]

# Input log file path (relative to repo root)
INPUT_TXT_PATH = Path("../Raw Outputs/Base Gridsearch/BaseTrain.txt")

# Output filenames (written inside CSV_REL_DIR)
OUTPUT_CSV_NAME = "BaseTrain_parsed.csv"
OUTPUT_XLSX_NAME = "BaseTrain_parsed.xlsx"

# Derived paths
OUTPUT_DIR = Path(CSV_REL_DIR)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_CSV_PATH = OUTPUT_DIR / OUTPUT_CSV_NAME
OUTPUT_XLSX_PATH = OUTPUT_DIR / OUTPUT_XLSX_NAME

print("Will read:", INPUT_TXT_PATH)
print("Will write CSV:", OUTPUT_CSV_PATH)
print("Will write XLSX:", OUTPUT_XLSX_PATH)
print("Seeds:", SEEDS)


Will read: ../Raw Outputs/Base Gridsearch/BaseTrain.txt
Will write CSV: ../Structured Outputs/Base Gridsearch/BaseTrain_parsed.csv
Will write XLSX: ../Structured Outputs/Base Gridsearch/BaseTrain_parsed.xlsx
Seeds: [38042, 217401, 637451, 207796, 45921]


## Parsing helpers

We recognize two line types:

1. **Run header**  
   `run i/N | lr=... wd=... bs=... drop_path_rate=...`

2. **Epoch metrics**  
   `epoch e/E | loss ... | train_acc ...% | test_acc ...% | dl_time ...s | train_time ...s | eval_time ...s`

Normalization:
- values ending with `%` become floats in `*_pct` columns  
- values ending with `s` become floats in `*_s` columns


In [2]:
import re

RUN_HEADER_RE = re.compile(
    r"^\s*run\s+(?P<run_idx>\d+)\s*/\s*(?P<run_total>\d+)\s*\|\s*(?P<hparams>.+?)\s*$",
    re.IGNORECASE,
)
EPOCH_RE = re.compile(
    r"^\s*epoch\s+(?P<epoch>\d+)\s*/\s*(?P<epoch_total>\d+)\s*\|\s*(?P<rest>.+?)\s*$",
    re.IGNORECASE,
)

def parse_hparams(hparams_str: str) -> dict:
    """
    Parse a string like:
        'lr=0.0001 wd=0.02 bs=256 drop_path_rate=0.1'
    into a dict with numeric types where possible.
    """
    out = {}
    for token in hparams_str.strip().split():
        if "=" not in token:
            continue
        k, v = token.split("=", 1)
        if re.fullmatch(r"-?\d+", v):
            out[k] = int(v)
        else:
            try:
                out[k] = float(v)
            except ValueError:
                out[k] = v
    return out

def parse_epoch_kv(rest: str) -> dict:
    """
    Parse the right side of an epoch line: 'key value | key value | ...'
    and normalize % -> *_pct, seconds -> *_s.
    """
    kv = {}
    parts = [p.strip() for p in rest.split("|")]
    for p in parts:
        if not p or " " not in p:
            continue
        key, val = p.split(" ", 1)
        val = val.strip()

        if val.endswith("%"):
            try:
                kv[f"{key}_pct"] = float(val[:-1])
            except ValueError:
                kv[f"{key}_pct"] = None
        elif val.endswith("s"):
            try:
                kv[f"{key}_s"] = float(val[:-1])
            except ValueError:
                kv[f"{key}_s"] = None
        else:
            try:
                kv[key] = float(val)
            except ValueError:
                kv[key] = val
    return kv


## Parse the file → DataFrame → CSV

### Seed assignment
The log does not print the seed, but it concatenates multiple sweeps; each sweep restarts at `run 1/...`.
We detect the start of a new seed block when the run index resets (e.g., `run 27/...` followed by `run 1/...`),
and then advance through the provided `SEEDS` list.

### Hyperparameter fill (your requested fix)
Sometimes a few `epoch ...` lines can appear **before** the first `run 1/...` header in the file.
Those epoch rows still belong to `run_idx=1`, but they won't have `lr/wd/bs/drop_path_rate` recorded yet.

You also noted: **the hyperparameter config for `run_idx=1` is the same across all 5 seeds.**
To ensure `seed=38042, run_idx=1` is fully populated, we fill missing hyperparameters as follows:

1. Fill within each `(seed, run_idx)` group using the first non-null value in that group.
2. If anything is still missing (rare), fill by `run_idx` across all seeds (safe under your assumption).


In [3]:
import pandas as pd

# ----------------------------
# Parse with robust seed mapping
# ----------------------------
seed_ptr = 0
current_seed = SEEDS[seed_ptr] if len(SEEDS) else None

seen_any_run_header = False
last_run_idx = None

current_run = {
    "seed": current_seed,
    "run_idx": 1,
    "run_total": None,
    "lr": None,
    "wd": None,
    "bs": None,
    "drop_path_rate": None,
}

records = []

with INPUT_TXT_PATH.open("r", encoding="utf-8", errors="replace") as f:
    for line in f:
        line = line.rstrip("\n")

        m = RUN_HEADER_RE.match(line)
        if m:
            run_idx = int(m.group("run_idx"))
            run_total = int(m.group("run_total"))

            # Detect seed boundary when run counter resets (e.g., 27 -> 1)
            if seen_any_run_header and run_idx == 1 and seed_ptr < len(SEEDS) - 1:
                if last_run_idx is not None and last_run_idx > run_idx:
                    seed_ptr += 1
                    current_seed = SEEDS[seed_ptr]

            seen_any_run_header = True
            last_run_idx = run_idx

            current_run = {
                "seed": current_seed,
                "run_idx": run_idx,
                "run_total": run_total,
            }
            current_run.update(parse_hparams(m.group("hparams")))
            continue

        m = EPOCH_RE.match(line)
        if m:
            rec = dict(current_run)
            rec["seed"] = current_seed
            rec["epoch"] = int(m.group("epoch"))
            rec["epoch_total"] = int(m.group("epoch_total"))
            rec.update(parse_epoch_kv(m.group("rest")))
            records.append(rec)

df = pd.DataFrame.from_records(records)

# -----------------------------------------
# Fill missing hyperparameters (requested)
# -----------------------------------------
HYPERPARAM_COLS = ["lr", "wd", "bs", "drop_path_rate"]

# (1) Fill within each (seed, run_idx)
for col in HYPERPARAM_COLS:
    if col in df.columns:
        df[col] = df.groupby(["seed", "run_idx"])[col].transform(
            lambda s: s.fillna(s.dropna().iloc[0] if s.notna().any() else pd.NA)
        )

# (2) Fill by run_idx across seeds (safe under "run 1 config same across seeds")
for col in HYPERPARAM_COLS:
    if col in df.columns:
        df[col] = df.groupby(["run_idx"])[col].transform(
            lambda s: s.fillna(s.dropna().iloc[0] if s.notna().any() else pd.NA)
        )

# ----------------------------
# Order columns + write CSV
# ----------------------------
preferred_cols = [
    "seed",
    "run_idx", "run_total", "lr", "wd", "bs", "drop_path_rate",
    "epoch", "epoch_total",
    "loss", "train_acc_pct", "test_acc_pct",
    "dl_time_s", "train_time_s", "eval_time_s",
]
cols = preferred_cols + [c for c in df.columns if c not in preferred_cols]
df = df[cols].sort_values(["seed", "run_idx", "epoch"]).reset_index(drop=True)

df.to_csv(OUTPUT_CSV_PATH, index=False)

print("Rows:", len(df))
print("Seeds (observed):", [s for s in SEEDS if s in set(df['seed'].dropna().unique().tolist())])
print("Runs per seed:", df.groupby("seed")["run_idx"].nunique().to_dict() if len(df) else {})
print("Wrote:", OUTPUT_CSV_PATH.resolve())
df.head()


Rows: 9450
Seeds (observed): [38042, 217401, 637451, 207796, 45921]
Runs per seed: {38042: 27, 45921: 27, 207796: 27, 217401: 27, 637451: 27}
Wrote: /Users/etaashpatel/Documents/Final Project/Structured Outputs/Base Gridsearch/BaseTrain_parsed.csv


Unnamed: 0,seed,run_idx,run_total,lr,wd,bs,drop_path_rate,epoch,epoch_total,loss,train_acc_pct,test_acc_pct,dl_time_s,train_time_s,eval_time_s
0,38042,1,,0.0001,0.02,256.0,0.0,1,70,2.2998,11.04,12.99,0.81,2.98,0.47
1,38042,1,,0.0001,0.02,256.0,0.0,2,70,1.9448,27.78,29.84,0.81,2.85,0.44
2,38042,1,,0.0001,0.02,256.0,0.0,3,70,1.7826,34.16,37.15,0.81,2.91,0.44
3,38042,1,,0.0001,0.02,256.0,0.0,4,70,1.6707,38.52,40.17,0.81,2.9,0.44
4,38042,1,,0.0001,0.02,256.0,0.0,5,70,1.5374,44.25,46.73,0.81,2.89,0.45


## Export Excel workbook with per-seed leaderboards

Each seed gets its own leaderboard sheet, ranked by **best** `test_acc_pct` within that run.
We preserve the provided `SEEDS` order when writing sheets.


In [4]:
import pandas as pd

# Ensure numeric for ranking/aggregation
for c in ["test_acc_pct", "train_acc_pct", "loss"]:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce")

def build_leaderboard(df_seed: pd.DataFrame) -> pd.DataFrame:
    """
    One row per run_idx for a single seed, with:
      - best_test_acc_pct and the epoch it occurs
      - final_test_acc_pct (last epoch)
    """
    best_rows = (
        df_seed.dropna(subset=["test_acc_pct"])
               .sort_values(["run_idx", "test_acc_pct", "epoch"], ascending=[True, False, True])
               .groupby("run_idx", as_index=False)
               .head(1)
               .set_index("run_idx")
    )

    final_rows = (
        df_seed.sort_values(["run_idx", "epoch"])
               .groupby("run_idx", as_index=False)
               .tail(1)
               .set_index("run_idx")
    )

    leaderboard = pd.DataFrame({
        "run_idx": best_rows.index,
        "seed": best_rows["seed"],
        "lr": best_rows.get("lr"),
        "wd": best_rows.get("wd"),
        "bs": best_rows.get("bs"),
        "drop_path_rate": best_rows.get("drop_path_rate"),
        "best_test_acc_pct": best_rows["test_acc_pct"],
        "best_epoch": best_rows["epoch"],
        "best_loss": best_rows.get("loss"),
        "final_test_acc_pct": final_rows.get("test_acc_pct"),
        "final_loss": final_rows.get("loss"),
    }).reset_index(drop=True)

    sort_cols = ["best_test_acc_pct"]
    asc = [False]
    if "best_loss" in leaderboard.columns:
        sort_cols.append("best_loss")
        asc.append(True)

    leaderboard = leaderboard.sort_values(sort_cols, ascending=asc).reset_index(drop=True)
    leaderboard.insert(0, "rank", leaderboard.index + 1)
    return leaderboard

observed_seeds = set(df["seed"].dropna().unique().tolist())
seed_values = [s for s in SEEDS if s in observed_seeds]

with pd.ExcelWriter(OUTPUT_XLSX_PATH, engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="parsed_epochs", index=False)

    for seed in seed_values:
        lb = build_leaderboard(df[df["seed"] == seed].copy())
        lb.to_excel(writer, sheet_name=f"leaderboard_{seed}"[:31], index=False)

print("Wrote:", OUTPUT_XLSX_PATH.resolve())
print("Sheets:", ["parsed_epochs"] + [f"leaderboard_{s}" for s in seed_values])

# Preview top 10 of first seed
if seed_values:
    display(build_leaderboard(df[df["seed"] == seed_values[0]].copy()).head(10))


Wrote: /Users/etaashpatel/Documents/Final Project/Structured Outputs/Base Gridsearch/BaseTrain_parsed.xlsx
Sheets: ['parsed_epochs', 'leaderboard_38042', 'leaderboard_217401', 'leaderboard_637451', 'leaderboard_207796', 'leaderboard_45921']


Unnamed: 0,rank,run_idx,seed,lr,wd,bs,drop_path_rate,best_test_acc_pct,best_epoch,best_loss,final_test_acc_pct,final_loss
0,1,24,38042,0.001,0.07,256.0,0.2,84.68,63,0.1104,84.6,0.0939
1,2,27,38042,0.001,0.15,256.0,0.2,84.66,69,0.1608,84.58,0.1608
2,3,18,38042,0.00033,0.15,256.0,0.2,84.44,68,0.2328,84.26,0.228
3,4,11,38042,0.00033,0.02,256.0,0.1,84.07,69,0.1071,83.83,0.1048
4,5,21,38042,0.001,0.02,256.0,0.2,83.99,63,0.0896,83.98,0.0796
5,6,12,38042,0.00033,0.02,256.0,0.2,83.98,65,0.1694,83.53,0.1577
6,7,23,38042,0.001,0.07,256.0,0.1,83.96,64,0.0726,83.84,0.0633
7,8,26,38042,0.001,0.15,256.0,0.1,83.92,66,0.1388,83.91,0.1294
8,9,20,38042,0.001,0.02,256.0,0.1,83.87,65,0.0574,83.74,0.049
9,10,15,38042,0.00033,0.07,256.0,0.2,83.75,63,0.2131,83.4,0.1885


## Quick check: show run 1 hyperparameters for the first seed

This should no longer be missing.


In [5]:
df[(df['seed']==SEEDS[0]) & (df['run_idx']==1)][['seed','run_idx','lr','wd','bs','drop_path_rate']].drop_duplicates().head(10)


Unnamed: 0,seed,run_idx,lr,wd,bs,drop_path_rate
0,38042,1,0.0001,0.02,256.0,0.0
