# Sokoban PoT (Pointer-Over-Heads Transformer) PPO Benchmark

This notebook runs the Sokoban benchmark with PPO training.

**Features:**
- Pure PPO training (no domain heuristics)
- SimplePoT and HybridPoT architectures
- Iterative refinement for action prediction


## Setup


In [None]:
# Clone the repository
!git clone https://github.com/yourusername/PoT.git
%cd PoT

# Install dependencies
!pip install -q torch numpy tqdm datasets wandb


In [None]:
# W&B Login (optional but recommended)
import wandb
wandb.login()


In [None]:
# Check GPU
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")


## Download Dataset


In [None]:
# Download Boxoban dataset
!python -c "import sys; sys.path.insert(0, '.'); from src.data.sokoban import download_boxoban_dataset; download_boxoban_dataset('data/sokoban'); print('Download complete!')"


## SimplePoT PPO Training

Basic PoT model with R refinement steps.


In [None]:
# SimplePoT PPO
!python experiments/sokoban_pot_benchmark.py \
    --download \
    --mode ppo \
    --model-type pot \
    --R 4 \
    --d-model 256 \
    --n-heads 8 \
    --n-layers 6 \
    --ppo-timesteps 100000 \
    --batch-size 64 \
    --learning-rate 1e-4 \
    --eval-interval 5 \
    --eval-episodes 50 \
    --wandb \
    --project sokoban-ppo \
    --run-name simple-pot-ppo \
    --output-dir experiments/results/sokoban_simple_ppo


## HybridPoT PPO Training (Aligned with Sudoku)

Full HybridPoT model with H_cycles, L_cycles, ACT, and injection.


In [None]:
# HybridPoT PPO (aligned with Sudoku architecture)
!python experiments/sokoban_pot_benchmark.py \
    --download \
    --mode ppo \
    --model-type hybrid \
    --controller-type transformer \
    --d-ctrl 128 \
    --max-depth 128 \
    --d-model 256 \
    --n-heads 8 \
    --H-cycles 2 \
    --L-cycles 6 \
    --H-layers 2 \
    --L-layers 2 \
    --halt-max-steps 2 \
    --hrm-grad-style \
    --injection-mode broadcast \
    --ppo-timesteps 200000 \
    --batch-size 64 \
    --learning-rate 1e-4 \
    --warmup-steps 1000 \
    --eval-interval 5 \
    --eval-episodes 50 \
    --wandb \
    --project sokoban-ppo \
    --run-name hybrid-pot-ppo \
    --output-dir experiments/results/sokoban_hybrid_ppo


## Display Results


In [None]:
import json
import os

results_dirs = [
    'experiments/results/sokoban_simple_ppo',
    'experiments/results/sokoban_hybrid_ppo',
]

for result_dir in results_dirs:
    result_file = os.path.join(result_dir, 'results.json')
    if os.path.exists(result_file):
        with open(result_file, 'r') as f:
            results = json.load(f)
        print(f"\n{'='*60}")
        print(f"Results: {result_dir}")
        print(f"{'='*60}")
        if 'evaluation' in results:
            eval_results = results['evaluation']
            print(f"Solve Rate @50:  {eval_results.get('solve_rate@50', 0):.2%}")
            print(f"Solve Rate @100: {eval_results.get('solve_rate@100', 0):.2%}")
            print(f"Solve Rate @200: {eval_results.get('solve_rate@200', 0):.2%}")
            print(f"Deadlock Rate:   {eval_results.get('deadlock_rate@200', 0):.2%}")
        if 'ppo' in results:
            print(f"Best PPO Reward: {results['ppo'].get('best_reward', 0):.3f}")
    else:
        print(f"\nNo results found at {result_file}")


# Sokoban PoT Benchmark - Pointer-Over-Heads Transformer

This notebook runs the Sokoban benchmark with PoT iterative refinement.

**Models:**
- `pot`: Simple PoT with R refinement iterations
- `hybrid`: HybridPoT with two-timescale reasoning (H_cycles Ã— L_cycles)
- `baseline`: CNN baseline

**Training modes:**
- `heuristic`: Pretrain with heuristic pseudo-labels
- `ppo`: Pure PPO training
- `combined`: Pretrain + PPO fine-tuning

**Augmentations:** Geometric symmetries (flip, rotate)


In [None]:
# Clone repository and install dependencies
!git clone https://github.com/Eran-BA/PoT.git
%cd PoT
!pip install -q torch numpy tqdm wandb

# Login to W&B (optional - for experiment tracking)
import wandb
wandb.login()


In [None]:
# Check GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")


## 1. SimplePoT Heuristic Training WITH Augmentations


In [None]:
# SimplePoT Heuristic training WITH augmentations
!python experiments/sokoban_pot_benchmark.py \
    --mode heuristic \
    --download \
    --model-type pot \
    --R 4 \
    --d-model 256 \
    --n-heads 4 \
    --n-layers 2 \
    --controller-type transformer \
    --max-depth 32 \
    --heuristic-epochs 10 \
    --batch-size 64 \
    --learning-rate 1e-4 \
    --warmup-steps 100 \
    --eval-interval 2 \
    --wandb \
    --project sokoban-pot \
    --run-name simple-heuristic-with-aug \
    --output-dir experiments/results/sokoban_simple_heuristic_aug


## 2. HybridPoT Heuristic Training WITH Augmentations (Aligned with Sudoku)


In [None]:
# HybridPoT Heuristic training WITH augmentations (aligned with Sudoku)
!python experiments/sokoban_pot_benchmark.py \
    --mode heuristic \
    --download \
    --model-type hybrid \
    --d-model 256 \
    --n-heads 8 \
    --H-cycles 2 \
    --L-cycles 6 \
    --H-layers 2 \
    --L-layers 2 \
    --T 4 \
    --halt-max-steps 2 \
    --controller-type transformer \
    --d-ctrl 128 \
    --max-depth 128 \
    --hrm-grad-style \
    --halt-exploration-prob 0.1 \
    --injection-mode broadcast \
    --heuristic-epochs 10 \
    --batch-size 64 \
    --learning-rate 1e-4 \
    --warmup-steps 100 \
    --eval-interval 2 \
    --wandb \
    --project sokoban-pot \
    --run-name hybrid-heuristic-with-aug \
    --output-dir experiments/results/sokoban_hybrid_heuristic_aug


## 3. HybridPoT PPO Training WITH Augmentations


In [None]:
# HybridPoT PPO training WITH augmentations (aligned with Sudoku)
!python experiments/sokoban_pot_benchmark.py \
    --mode ppo \
    --model-type hybrid \
    --d-model 256 \
    --n-heads 8 \
    --H-cycles 2 \
    --L-cycles 6 \
    --H-layers 2 \
    --L-layers 2 \
    --T 4 \
    --halt-max-steps 2 \
    --controller-type transformer \
    --d-ctrl 128 \
    --max-depth 128 \
    --hrm-grad-style \
    --halt-exploration-prob 0.1 \
    --injection-mode broadcast \
    --ppo-timesteps 100000 \
    --ppo-n-envs 8 \
    --batch-size 64 \
    --learning-rate 3e-4 \
    --wandb \
    --project sokoban-pot \
    --run-name hybrid-ppo-with-aug \
    --output-dir experiments/results/sokoban_hybrid_ppo_aug


## 4. Display Results


In [None]:
import json
from pathlib import Path

result_dirs = [
    ('Simple Heuristic + Aug', 'experiments/results/sokoban_simple_heuristic_aug'),
    ('Hybrid Heuristic + Aug', 'experiments/results/sokoban_hybrid_heuristic_aug'),
    ('Hybrid PPO + Aug', 'experiments/results/sokoban_hybrid_ppo_aug'),
]

print("=" * 60)
print("SOKOBAN BENCHMARK RESULTS")
print("=" * 60)

for name, d in result_dirs:
    results_file = Path(d) / 'results.json'
    if results_file.exists():
        with open(results_file) as f:
            results = json.load(f)
        print(f"\n=== {name} ===")
        if 'evaluation' in results:
            e = results['evaluation']
            print(f"  Solve Rate @50:  {e.get('solve_rate@50', 0):.2%}")
            print(f"  Solve Rate @100: {e.get('solve_rate@100', 0):.2%}")
            print(f"  Solve Rate @200: {e.get('solve_rate@200', 0):.2%}")
            print(f"  Median Steps:    {e.get('median_steps', 0):.1f}")
        elif 'test' in results:
            t = results['test']
            print(f"  Test Accuracy: {t.get('accuracy', 0):.2%}")
    else:
        print(f"\n=== {name} ===")
        print(f"  (results not found)")
