# Swin Depth Controller - Sudoku Training (Colab A100)

Train the new Swin-style hierarchical depth controller on Sudoku-Extreme.

**Features:**
- Local window attention (Swin Transformer style)
- Hierarchical spatial-temporal depth tracking (preserves spatial info!)
- Depth skip connections for better gradient flow
- W&B logging with **automatic best model artifact upload**
- Auto-download dataset from HuggingFace

**Runtime:** A100 GPU recommended for batch_size=768


In [None]:
# Install dependencies
!pip install wandb huggingface_hub -q


In [None]:
# Clone the repo and checkout experiments branch
!rm -rf /content/PoT
!git clone https://github.com/Eran-BA/PoT.git /content/PoT
%cd /content/PoT
!git checkout feature/experiments


In [None]:
# Verify GPU
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")


In [None]:
# W&B Login
import wandb
wandb.login()


In [None]:
# Download Sudoku-Extreme from HuggingFace
# Option A: Manual download (run this cell)
import sys
sys.path.insert(0, '/content/PoT')

from src.data import download_sudoku_dataset

download_sudoku_dataset(
    output_dir='/content/PoT/data/sudoku-extreme-10k-aug-100',
    subsample_size=10000,
)
print("âœ“ Dataset ready!")

# Option B: Use --download flag in training command (skip this cell)


## Train with Swin Controller

This uses the optimized configuration with:
- **Swin controller** with window_size=3 (matches 9x9 Sudoku structure)
- **Hierarchical depth tracking** (preserves spatial info across refinement iterations!)
- **Depth skip connections** enabled for better gradient flow
- **W&B artifact logging**: Best model automatically uploaded to W&B Artifacts
- **`--hrm-grad-style`** (optional): Only last L+H calls get gradients (HRM-style, saves memory)


In [None]:
# Full training run (1000 epochs)
# Add --hrm-grad-style for HRM-style gradients (only last L+H get grads, saves memory)
!python scripts/train_sudoku_swin.py \
    --data-dir data/sudoku-extreme-10k-aug-100 \
    --epochs 1000 \
    --batch-size 768 \
    --lr 3.7e-4 \
    --weight-decay 0.108 \
    --beta2 0.968 \
    --dropout 0.039 \
    --warmup-steps 2000 \
    --d-model 512 \
    --d-ff 2048 \
    --n-heads 8 \
    --h-layers 2 \
    --l-layers 2 \
    --h-cycles 2 \
    --l-cycles 6 \
    --halt-max-steps 2 \
    --d-ctrl 256 \
    --window-size 3 \
    --n-stages 2 \
    --max-depth 32 \
    --T 4 \
    --num-workers 2 \
    --eval-interval 10 \
    --save-every 50 \
    --hrm-grad-style \
    --wandb \
    --project sudoku-swin \
    --device cuda


## Quick Test (Optional)

Run this first to verify everything works before the full training run.


## Resume Training

Continue training from a checkpoint (e.g., after Colab disconnect).

**Two options:**
1. Local file: `--resume checkpoints/swin/best_model.pt`
2. W&B artifact: `--resume wandb:YOUR_ENTITY/sudoku-swin/sudoku-swin-best:best`


In [None]:
# Resume training from checkpoint (change epochs to your target)
!python scripts/train_sudoku_swin.py \
    --data-dir data/sudoku-extreme-10k-aug-100 \
    --epochs 2000 \
    --batch-size 768 \
    --lr 3.7e-4 \
    --weight-decay 0.108 \
    --beta2 0.968 \
    --dropout 0.039 \
    --warmup-steps 2000 \
    --d-model 512 \
    --d-ff 2048 \
    --n-heads 8 \
    --h-layers 2 \
    --l-layers 2 \
    --h-cycles 2 \
    --l-cycles 6 \
    --halt-max-steps 2 \
    --d-ctrl 256 \
    --window-size 3 \
    --n-stages 2 \
    --max-depth 32 \
    --T 4 \
    --num-workers 2 \
    --eval-interval 10 \
    --save-every 50 \
    --hrm-grad-style \
    --resume checkpoints/swin/best_model.pt \
    --wandb \
    --project sudoku-swin \
    --device cuda


In [None]:
# Resume from W&B artifact (survives Colab disconnects!)
# Replace YOUR_ENTITY with your W&B username
!python scripts/train_sudoku_swin.py \
    --data-dir data/sudoku-extreme-10k-aug-100 \
    --epochs 2000 \
    --batch-size 768 \
    --lr 3.7e-4 \
    --weight-decay 0.108 \
    --beta2 0.968 \
    --dropout 0.039 \
    --warmup-steps 2000 \
    --d-model 512 \
    --d-ff 2048 \
    --n-heads 8 \
    --h-layers 2 \
    --l-layers 2 \
    --h-cycles 2 \
    --l-cycles 6 \
    --halt-max-steps 2 \
    --d-ctrl 256 \
    --window-size 3 \
    --n-stages 2 \
    --max-depth 32 \
    --T 4 \
    --num-workers 2 \
    --eval-interval 10 \
    --save-every 50 \
    --hrm-grad-style \
    --resume wandb:YOUR_ENTITY/sudoku-swin/sudoku-swin-best:best \
    --wandb \
    --project sudoku-swin \
    --device cuda


In [None]:
# Quick test (10 epochs)
!python scripts/train_sudoku_swin.py \
    --data-dir data/sudoku-extreme-10k-aug-100 \
    --epochs 10 \
    --batch-size 768 \
    --eval-interval 5 \
    --window-size 3 \
    --n-stages 2 \
    --device cuda


## Download Best Model

After training completes, download the best checkpoint.

**Note:** If you used `--wandb`, the best model is also saved as a W&B Artifact (`sudoku-swin-best`) which you can download anytime from your W&B dashboard.


In [None]:
# Download best model to your computer
from google.colab import files
files.download('checkpoints/swin/best_model.pt')


In [None]:
# Alternative: Download from W&B Artifacts (works anywhere, not just Colab)
# Run this after training or later from any machine with wandb installed

import wandb

# Download the best model artifact
run = wandb.init(project="sudoku-swin")  # or specify your project name
artifact = run.use_artifact("sudoku-swin-best:best")
artifact_dir = artifact.download()
print(f"Model downloaded to: {artifact_dir}")
