# Sokoban PoT (Pointer-Over-Heads Transformer) PPO Benchmark

Pure PPO training for Sokoban puzzle solving.

**Features:**
- Pure PPO training (no domain heuristics)
- SimplePoT and HybridPoT architectures
- **Curriculum learning** (start with easier puzzles)
- Iterative refinement for action prediction


## Setup


In [None]:
# Clone and install
!git clone https://github.com/Eran-BA/PoT.git
%cd PoT
!pip install -q torch numpy tqdm datasets wandb


In [None]:
# W&B Login
import wandb
wandb.login()


In [None]:
# Check GPU
import torch
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")


## HybridPoT PPO with Curriculum Learning (Recommended)

Curriculum learning starts with easier puzzles and progressively adds harder ones:
- Stage 1 (0-25%): 25% easiest puzzles
- Stage 2 (25-50%): 50% easiest puzzles
- Stage 3 (50-75%): 75% easiest puzzles
- Stage 4 (75-100%): All puzzles


In [None]:
# HybridPoT PPO with Curriculum Learning
!python experiments/sokoban_pot_benchmark.py \
    --download \
    --mode ppo \
    --model-type hybrid \
    --controller-type transformer \
    --d-ctrl 128 \
    --max-depth 128 \
    --d-model 256 \
    --n-heads 8 \
    --H-cycles 2 \
    --L-cycles 6 \
    --H-layers 2 \
    --L-layers 2 \
    --halt-max-steps 2 \
    --hrm-grad-style \
    --injection-mode broadcast \
    --ppo-timesteps 500000 \
    --batch-size 64 \
    --curriculum \
    --curriculum-stages 4 \
    --warmup-steps 1000 \
    --eval-interval 5 \
    --wandb \
    --project sokoban-ppo \
    --run-name hybrid-pot-curriculum \
    --output-dir experiments/results/sokoban_hybrid_curriculum


## SimplePoT PPO with Curriculum Learning


In [None]:
# SimplePoT PPO with Curriculum Learning
!python experiments/sokoban_pot_benchmark.py \
    --download \
    --mode ppo \
    --model-type pot \
    --R 4 \
    --d-model 256 \
    --n-heads 8 \
    --ppo-timesteps 500000 \
    --batch-size 64 \
    --curriculum \
    --curriculum-stages 4 \
    --eval-interval 5 \
    --wandb \
    --project sokoban-ppo \
    --run-name simple-pot-curriculum \
    --output-dir experiments/results/sokoban_simple_curriculum


## HybridPoT PPO WITHOUT Curriculum (Baseline)

For comparison - training on all puzzles from the start.


In [None]:
# HybridPoT PPO without Curriculum (baseline)
!python experiments/sokoban_pot_benchmark.py \
    --download \
    --mode ppo \
    --model-type hybrid \
    --controller-type transformer \
    --d-ctrl 128 \
    --max-depth 128 \
    --d-model 256 \
    --n-heads 8 \
    --H-cycles 2 \
    --L-cycles 6 \
    --H-layers 2 \
    --L-layers 2 \
    --halt-max-steps 2 \
    --hrm-grad-style \
    --injection-mode broadcast \
    --ppo-timesteps 500000 \
    --batch-size 64 \
    --warmup-steps 1000 \
    --eval-interval 5 \
    --wandb \
    --project sokoban-ppo \
    --run-name hybrid-pot-no-curriculum \
    --output-dir experiments/results/sokoban_hybrid_no_curriculum


## Display Results


In [None]:
import json, os

result_dirs = [
    'experiments/results/sokoban_hybrid_curriculum',
    'experiments/results/sokoban_simple_curriculum',
    'experiments/results/sokoban_hybrid_no_curriculum',
]

for d in result_dirs:
    f = os.path.join(d, 'results.json')
    if os.path.exists(f):
        r = json.load(open(f))
        print(f"\n{'='*50}\n{d}\n{'='*50}")
        if 'evaluation' in r:
            e = r['evaluation']
            print(f"Solve Rate @50: {e.get('solve_rate@50', 0):.2%}")
            print(f"Solve Rate @100: {e.get('solve_rate@100', 0):.2%}")
            print(f"Solve Rate @200: {e.get('solve_rate@200', 0):.2%}")
            print(f"Deadlock Rate: {e.get('deadlock_rate', 0):.2%}")
