# Blocksworld PPO Benchmark - Colab

PoT (Pointer-Over-Heads Transformer) Blocksworld solver with PPO training.

**Features:**
- Sub-trajectory augmentation (C(n+1,2) for each trajectory)
- Good/Bad trajectory contrastive learning
- PoT iterative refinement (R cycles)


In [None]:
# Clone repository
!git clone https://github.com/Eran-BA/PoT.git
%cd PoT

# Install dependencies
!pip install -q torch numpy tqdm datasets


In [None]:
# Download Blocksworld dataset from HuggingFace
from src.data.blocksworld import download_blocksworld_dataset

download_blocksworld_dataset(
    data_dir='data/blocksworld',
    max_blocks=6,
    generate_trajectories=False,
)


In [None]:
# Run PPO Benchmark - 2 epochs quick test
# Model options:
#   --model-type simple: SimplePoT with R refinement iterations (default)
#   --model-type hybrid: HybridPoT with H_cycles, L_cycles, T

!python experiments/blocksworld_ppo_benchmark.py \
    --mode ppo \
    --epochs 2 \
    --batch-size 32 \
    --max-blocks 6 \
    --model-type simple \
    --R 4 \
    --d-model 128 \
    --n-heads 4 \
    --n-layers 2 \
    --d-ff 512 \
    --dropout 0.1 \
    --controller-type transformer \
    --good-bad-ratio 1.0 \
    --clip-epsilon 0.2 \
    --entropy-coef 0.01 \
    --value-coef 0.5 \
    --ppo-epochs 4 \
    --eval-interval 1 \
    --output-dir experiments/results/blocksworld_ppo_colab


In [None]:
# View results
import json
with open('experiments/results/blocksworld_ppo_colab/results.json') as f:
    results = json.load(f)
    
print('=== PPO Blocksworld Results ===')
print(f"Best Val Accuracy: {results['best_val_acc']:.2%}")
print(f"Test Slot Accuracy: {results['test_metrics']['slot_accuracy']:.2%}")
print(f"Test Exact Match: {results['test_metrics']['exact_match']:.2%}")


## Hybrid Model (H_cycles, L_cycles, T)

The HybridPoT model uses the full HRM (Hierarchical Reasoning Module) architecture with:
- **H_cycles**: Slow outer cycles (H_level)
- **L_cycles**: Fast inner cycles per H_cycle (L_level)
- **T**: HRM period for pointer controller


In [None]:
# Run PPO with Hybrid model (H/L cycles)
!python experiments/blocksworld_ppo_benchmark.py \
    --mode ppo \
    --epochs 2 \
    --batch-size 32 \
    --max-blocks 6 \
    --model-type hybrid \
    --H-cycles 2 \
    --L-cycles 8 \
    --T 4 \
    --halt-max-steps 1 \
    --d-model 128 \
    --n-heads 4 \
    --n-layers 2 \
    --d-ff 512 \
    --dropout 0.1 \
    --controller-type transformer \
    --good-bad-ratio 1.0 \
    --eval-interval 1 \
    --output-dir experiments/results/blocksworld_ppo_hybrid
