# 🧩 Maze Scaling Benchmark

**Following HRM Paper Protocol (arXiv 2506.21734)**

This notebook benchmarks model performance across **increasing maze sizes (8×8 to 30×30)** to test hierarchical reasoning at scale.

## Models Compared:
- **Baseline**: Standard Transformer encoder-decoder
- **BERT**: Pre-trained BERT architecture (parameter-matched)
- **PoH-HRM**: Pointer-over-Heads with Hierarchical Reasoning Module

## Benchmark Protocol:
- **Maze Sizes**: 8×8, 12×12, 16×16, 20×20, 24×24, 30×30
- **Training Data**: 1,000 mazes per size (following HRM paper)
- **Test Data**: 200 mazes per size
- **Task**: Find shortest path from start to goal
- **Metrics**: Path finding accuracy & path optimality

## Expected Results:
HRM's hierarchical reasoning (f_L + f_H) should excel on **larger mazes** (20×20+) where multi-step planning is critical.


## Setup


In [None]:
# Clone repository
!git clone https://github.com/Eran-BA/PoT.git
%cd PoT


In [None]:
# Install dependencies
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install transformers datasets scipy numpy tqdm matplotlib seaborn


In [None]:
# Verify GPU
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
print(f"GPU Name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB" if torch.cuda.is_available() else "N/A")


## Run Full Scaling Benchmark

**Default Configuration** (HRM Paper Protocol):
- Maze sizes: 8, 12, 16, 20, 24, 30
- 1,000 training mazes per size
- 200 test mazes per size
- 50 epochs per size
- PoH: R=4, T=4

**Expected Runtime on A100**: ~3-4 hours for full benchmark


In [None]:
!python experiments/maze_scaling_benchmark.py \
  --maze-sizes 8 12 16 20 24 30 \
  --train 1000 \
  --test 200 \
  --R 4 \
  --T 4 \
  --heads 4 \
  --epochs 50 \
  --seed 42 \
  --output experiments/results/maze_scaling_full


## Quick Scaling Test (Faster)

**Reduced Configuration** for faster iteration:
- Maze sizes: 8, 12, 16, 20
- 500 training mazes per size
- 100 test mazes per size
- 30 epochs per size

**Expected Runtime on A100**: ~45-60 minutes


In [None]:
!python experiments/maze_scaling_benchmark.py \
  --maze-sizes 8 12 16 20 \
  --train 500 \
  --test 100 \
  --R 4 \
  --T 4 \
  --heads 4 \
  --epochs 30 \
  --seed 42 \
  --output experiments/results/maze_scaling_quick


## Large-Scale Test (30×30 Only)

**Focus on hardest maze** (30×30) as in HRM paper:
- 1,000 training mazes
- 200 test mazes
- 100 epochs

**Expected Runtime on A100**: ~60-90 minutes


In [None]:
!python experiments/maze_scaling_benchmark.py \
  --maze-sizes 30 \
  --train 1000 \
  --test 200 \
  --R 4 \
  --T 4 \
  --heads 4 \
  --epochs 100 \
  --seed 42 \
  --output experiments/results/maze_30x30_benchmark


## Visualize Results

The benchmark automatically generates:
1. **JSON results** with all metrics
2. **Scaling plot** showing accuracy/optimality vs maze size


In [None]:
# Display the scaling plot
from IPython.display import Image, display

# Adjust path based on which benchmark you ran
display(Image('experiments/results/maze_scaling_full.png'))


In [None]:
# Load and display JSON results
import json

with open('experiments/results/maze_scaling_full.json', 'r') as f:
    results = json.load(f)

print("Maze Scaling Results:")
print(json.dumps(results, indent=2))


## Analysis

**Key Questions:**
1. Does PoH-HRM outperform baselines on larger mazes (20×20+)?
2. How does performance degrade with maze size for each model?
3. Does HRM's hierarchical reasoning provide better path optimality?

**Expected Findings:**
- All models perform well on small mazes (8×8, 12×12)
- PoH-HRM should maintain high accuracy on 20×20+ mazes
- Baseline/BERT may struggle with longer planning horizons
- HRM's temporal abstraction (T=4) helps with multi-step reasoning


## Export Results

Download results to include in paper/documentation


In [None]:
# Download results
from google.colab import files

files.download('experiments/results/maze_scaling_full.json')
files.download('experiments/results/maze_scaling_full.png')
