# Ataxx Zero Transformer - Kaggle Notebook (Pro)

Kaggle-first workflow for training and resuming **Ataxx Zero** with:
- `uv` environment management,
- optional Hugging Face checkpoint uploads,
- compact logging for notebook UI,
- restart-safe training flow.


## 0) Kaggle Setup

1. In the notebook sidebar: **Settings -> Accelerator -> GPU**.
2. Optionally add Kaggle Secrets:
   - `HF_TOKEN`
3. Run all cells top-to-bottom.


In [None]:
#@title 1) Parameters
GITHUB_REPO = "https://github.com/YOUR_USERNAME/ataxx-zero.git"  #@param {type:"string"}
GIT_BRANCH = "main"  #@param {type:"string"}
WORKDIR = "/kaggle/working/ataxx-zero"  # Kaggle writable path

PROFILE = "balanced"  #@param ["debug", "balanced", "strong"]

ENABLE_HF = False  #@param {type:"boolean"}
HF_REPO_ID = "YOUR_USERNAME/ataxx-zero-transformer"  #@param {type:"string"}
HF_TOKEN_FALLBACK = ""  #@param {type:"string"}

MAX_ITERS_OVERRIDE = 0  #@param {type:"integer"}


In [None]:
#@title 2) Clone / Update Repository
from pathlib import Path

repo_path = Path(WORKDIR)

if not repo_path.exists():
    !git clone {GITHUB_REPO} {WORKDIR}
else:
    print(f"Repo already exists at {WORKDIR}")

%cd {WORKDIR}
!git fetch --all --prune
!git checkout {GIT_BRANCH}
!git pull --ff-only
!git status --short


In [None]:
#@title 3) Install uv + Sync Environment
!python -m pip -q install uv
!uv sync

!uv --version
!uv run python --version
!uv run python -c "import torch; print('torch', torch.__version__, 'cuda', torch.cuda.is_available())"


In [None]:
#@title 4) Lint / Type Health Check
!uv run ruff check src train_improved.py scripts/play_pygame.py
!uv run pyrefly check src


In [None]:
#@title 5) Resolve HF Token (Kaggle Secret or fallback)
import os

hf_token = ""

if ENABLE_HF:
    # Try Kaggle secrets first
    try:
        from kaggle_secrets import UserSecretsClient
        hf_token = UserSecretsClient().get_secret("HF_TOKEN")
        print("Loaded HF token from Kaggle Secrets.")
    except Exception:
        hf_token = HF_TOKEN_FALLBACK.strip()
        if hf_token:
            print("Loaded HF token from HF_TOKEN_FALLBACK parameter.")

    if not hf_token:
        raise RuntimeError(
            "ENABLE_HF=True but no HF token found. Add Kaggle Secret HF_TOKEN or set HF_TOKEN_FALLBACK."
        )

    os.environ["HF_TOKEN"] = hf_token


## 6) Configure Notebook Training Profile

This updates `train_improved.CONFIG` in-memory for this session.


In [None]:
#@title 6) Build CONFIG for Kaggle Session
import torch
import train_improved as t

profiles = {
    "debug": {
        "iterations": 3,
        "episodes_per_iter": 10,
        "mcts_sims": 48,
        "epochs": 2,
        "batch_size": 64,
        "d_model": 96,
        "nhead": 8,
        "num_layers": 3,
        "dim_feedforward": 256,
    },
    "balanced": {
        "iterations": 10,
        "episodes_per_iter": 32,
        "mcts_sims": 128,
        "epochs": 4,
        "batch_size": 96,
        "d_model": 128,
        "nhead": 8,
        "num_layers": 4,
        "dim_feedforward": 384,
    },
    "strong": {
        "iterations": 16,
        "episodes_per_iter": 56,
        "mcts_sims": 224,
        "epochs": 6,
        "batch_size": 128,
        "d_model": 192,
        "nhead": 8,
        "num_layers": 6,
        "dim_feedforward": 512,
    },
}

cfg = profiles[PROFILE].copy()
cfg.update({
    "verbose_logs": False,
    "episode_log_every": 60,
    "save_every": 2,
    "val_split": 0.1,
    "seed": 42,
    "checkpoint_dir": "checkpoints",
    "log_dir": "logs",
    "onnx_path": "ataxx_model.onnx",
})

if MAX_ITERS_OVERRIDE > 0:
    cfg["iterations"] = int(MAX_ITERS_OVERRIDE)

if ENABLE_HF:
    cfg["hf_enabled"] = True
    cfg["hf_repo_id"] = HF_REPO_ID
    cfg["hf_token_env"] = "HF_TOKEN"
    cfg["hf_local_dir"] = "hf_checkpoints"
    cfg["keep_last_n_hf_checkpoints"] = 3
else:
    cfg["hf_enabled"] = False

for k, v in cfg.items():
    t.CONFIG[k] = v

print("Active config:")
for key in [
    "iterations", "episodes_per_iter", "mcts_sims", "epochs", "batch_size",
    "d_model", "num_layers", "dim_feedforward", "save_every",
    "hf_enabled", "hf_repo_id"
]:
    print(f"  {key}: {t.CONFIG.get(key)}")

print("GPU available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))


In [None]:
#@title 7) Start / Resume Training
# Auto-resume from latest HF checkpoint if hf_enabled=True and repo has checkpoints.
import train_improved as t

t.main()


In [None]:
#@title 8) Check Artifacts
from pathlib import Path

for p in [Path("checkpoints"), Path("logs"), Path("hf_checkpoints")]:
    if p.exists():
        print(f"
{p}/")
        for f in sorted(p.glob("*"))[:40]:
            print("  -", f.name)


In [None]:
#@title 9) Optional TensorBoard
# Kaggle supports this in notebook output panel.
%load_ext tensorboard
%tensorboard --logdir logs


## 10) Save Kaggle Working Directory (Optional)

If HF is disabled, you can still persist artifacts by creating a Kaggle dataset from `/kaggle/working/ataxx-zero/checkpoints` manually after run.


## 11) Troubleshooting

- **OOM**: lower `batch_size`, `d_model`, `num_layers`, or `mcts_sims`.
- **Slow runs**: use `PROFILE=debug` first.
- **HF errors**: verify `HF_REPO_ID` and secret `HF_TOKEN`.
- **No resume**: ensure HF repo has `model_iter_XXX.pt` and `buffer_iter_XXX.npz`.
