# Parameter Scaling Benchmark - PoH-HRM vs Baseline Transformer (Enhanced)

This notebook runs a parameter scaling experiment comparing:
- **Baseline Transformer** (standard multi-head attention)
- **PoH-HRM** (Pointer-over-Heads with Hierarchical Reasoning Module)

We now focus on the two largest scales for clear differentiation:
- **Large** (~30M params)
- **XL** (~100M params)

Additionally, an optional section allows testing a **HUGE (~500M)** model.

## 🚀 Enhanced Training Features (MLM-U Inspired)

This notebook now includes advanced training techniques:
- ✅ **Label Smoothing** (0.1) - Prevents overconfidence
- ✅ **Cosine LR Warmup** (2000 steps) - Smooth learning rate schedule
- ✅ **Multi-Horizon Supervision** (3-step) - Predict multiple steps ahead
- ✅ **Validity-Aware Loss** - Only predict valid moves
- ✅ **Routing Entropy Regularization** (5e-4, annealed) - Sharper PoH routing
- ✅ **CNN Maze Encoder** - Global maze conditioning
- ✅ **Depth-First Parameter Parity** - Keep PoH depth, reduce width

## Key Features

**Parameter Parity:** PoH-HRM keeps full depth, adjusts width (d_model) to match baseline params (≤10% tolerance).

**Key Questions:**
1. Does PoH-HRM maintain its advantage at large scales with enhanced training?
2. How does advantage change from Large to XL (and optionally to HUGE)?
3. Does performance saturate or continue improving?

**Runtime:** ~1–2 hours on A100 GPU (Large + XL). HUGE is heavier; start with fewer epochs.


## Setup


In [None]:
# Clone repository
!git clone https://github.com/Eran-BA/PoT.git
%cd PoT

# Switch to scaling branch
!git checkout scaling_parameter_size

# Install dependencies
!pip install -q torch transformers numpy matplotlib tqdm
!pip install -q maze-dataset

# Verify GPU
import torch
if torch.cuda.is_available():
    print(f"CUDA GPU: {torch.cuda.get_device_name(0)}")
else:
    try:
        import torch.backends.mps
        print(f"MPS available: {torch.backends.mps.is_available()}")
    except Exception as e:
        print("MPS check failed", e)


In [None]:
# Check GPU
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")


## Configuration

Adjust these parameters as needed:


In [None]:
# Benchmark configuration
MAZE_SIZE = 16        # Maze grid size (16x16)
N_TRAIN = 1000        # Training samples per model
N_TEST = 100          # Test samples
EPOCHS = 50           # Training epochs per model
R = 4                 # PoH refinement iterations
T = 4                 # HRM outer loop period
SEED = 42             # Random seed

# Enhanced training options (NEW!)
LR = 1e-3             # Learning rate
LABEL_SMOOTH = 0.1    # Label smoothing factor
WARMUP_STEPS = 2000   # LR warmup steps
MULTI_HORIZON = 3     # k-step supervision horizon
VALIDITY_MASK = True  # Enable validity masking
ROUTE_ENT = 5e-4      # PoH routing entropy weight
ENT_ANNEAL = True     # Anneal entropy weight

# For faster testing (recommended for first run):
# MAZE_SIZE = 12
# N_TRAIN = 500
# N_TEST = 50
# EPOCHS = 30
# WARMUP_STEPS = 1000


## Run Benchmark with Enhanced Training

This will test the two largest model sizes (Large and XL) with:
- **Parameter parity** via depth-first approach (keep PoH depth, reduce width)
- **All enhanced features** (label smoothing, cosine warmup, multi-horizon, validity masking, routing entropy, CNN encoder)

**Progress:**
1. Generate training/test data (once)
2. For each size (Large → XL):
   - Train Baseline Transformer with enhancements
   - Evaluate Baseline
   - Train PoH-HRM with enhancements + routing entropy (width auto-adjusts for param parity)
   - Evaluate PoH-HRM
3. Save results

**Training Enhancements Active:**
- ✅ Label smoothing (0.1), Cosine LR warmup (2000 steps)
- ✅ Multi-horizon supervision (3-step ahead)
- ✅ Validity-aware loss (mask invalid moves)
- ✅ CNN maze encoder (global conditioning)
- ✅ PoH routing entropy regularization (5e-4, annealed)

**Note:** Large and XL may take 30–60 minutes each. Use fewer epochs for a quick run.


In [None]:
# Run Large and XL with all enhanced training features
validity_flag = "--validity-mask" if VALIDITY_MASK else ""
anneal_flag = "--ent-anneal" if ENT_ANNEAL else ""

!python experiments/parameter_scaling_benchmark.py \
    --maze-size {MAZE_SIZE} \
    --train {N_TRAIN} \
    --test {N_TEST} \
    --epochs {EPOCHS} \
    --R {R} \
    --T {T} \
    --seed {SEED} \
    --lr {LR} \
    --label-smoothing {LABEL_SMOOTH} \
    --warmup-steps {WARMUP_STEPS} \
    --multi-horizon {MULTI_HORIZON} \
    {validity_flag} \
    --route-ent-weight {ROUTE_ENT} \
    {anneal_flag} \
    --output experiments/results/parameter_scaling_colab


## Visualize Results


In [None]:
import json
import matplotlib.pyplot as plt
import numpy as np

# Load results
with open(f'experiments/results/parameter_scaling_colab/scaling_results_maze{MAZE_SIZE}.json', 'r') as f:
    data = json.load(f)

results = data['results']
config = data['config']

# Extract data
sizes = [r['size'] for r in results]
baseline_params = [r['baseline_params'] / 1e6 for r in results]
poh_params = [r['poh_params'] / 1e6 for r in results]

baseline_acc = [r['baseline_acc'] for r in results]
poh_acc = [r['poh_acc'] for r in results]

baseline_opt = [r['baseline_opt'] for r in results]
poh_opt = [r['poh_opt'] for r in results]

poh_adv_acc = [r['poh_advantage_acc'] for r in results]
poh_adv_opt = [r['poh_advantage_opt'] for r in results]

# Create plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Accuracy vs Parameters
ax = axes[0, 0]
ax.plot(baseline_params, baseline_acc, 'o-', label='Baseline', linewidth=2, markersize=8)
ax.plot(poh_params, poh_acc, 's-', label='PoH-HRM', linewidth=2, markersize=8)
ax.set_xlabel('Parameters (M)', fontsize=12)
ax.set_ylabel('Accuracy (%)', fontsize=12)
ax.set_title(f'Accuracy vs. Model Size\\n(Maze {config["maze_size"]}×{config["maze_size"]})', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_xscale('log')

# Plot 2: Optimality vs Parameters
ax = axes[0, 1]
ax.plot(baseline_params, baseline_opt, 'o-', label='Baseline', linewidth=2, markersize=8)
ax.plot(poh_params, poh_opt, 's-', label='PoH-HRM', linewidth=2, markersize=8)
ax.set_xlabel('Parameters (M)', fontsize=12)
ax.set_ylabel('Optimality (%)', fontsize=12)
ax.set_title(f'Optimality vs. Model Size\\n(Maze {config["maze_size"]}×{config["maze_size"]})', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_xscale('log')

# Plot 3: PoH Advantage in Accuracy
ax = axes[1, 0]
colors = ['green' if x > 0 else 'red' for x in poh_adv_acc]
ax.bar(sizes, poh_adv_acc, color=colors, alpha=0.7)
ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('Model Size', fontsize=12)
ax.set_ylabel('PoH Advantage (%)', fontsize=12)
ax.set_title('PoH-HRM Accuracy Advantage\\n(PoH - Baseline)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Plot 4: PoH Advantage in Optimality
ax = axes[1, 1]
colors = ['green' if x > 0 else 'red' for x in poh_adv_opt]
ax.bar(sizes, poh_adv_opt, color=colors, alpha=0.7)
ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('Model Size', fontsize=12)
ax.set_ylabel('PoH Advantage (%)', fontsize=12)
ax.set_title('PoH-HRM Optimality Advantage\\n(PoH - Baseline)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(f'scaling_plot_maze{config["maze_size"]}.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\\n✓ Plot saved to: scaling_plot_maze{config['maze_size']}.png")


## Results Summary


In [None]:
print("="*100)
print(f"PARAMETER SCALING RESULTS - Maze {config['maze_size']}×{config['maze_size']}")
print("="*100)
print(f"Training: {config['n_train']} samples, {config['epochs']} epochs")
print(f"Testing: {config['n_test']} samples")
print(f"PoH Config: R={config['R']}, T={config['T']}")
print("="*100)
print()

print(f"{'Size':<10} {'Params (M)':<20} {'Accuracy (%)':<25} {'Optimality (%)':<25}")
print(f"{'':<10} {'Baseline / PoH':<20} {'Baseline / PoH / Δ':<25} {'Baseline / PoH / Δ':<25}")
print("-"*100)

for r in results:
    print(f"{r['size']:<10} "
          f"{r['baseline_params']/1e6:>5.1f} / {r['poh_params']/1e6:>5.1f}   "
          f"{r['baseline_acc']:>5.1f} / {r['poh_acc']:>5.1f} / {r['poh_advantage_acc']:>+5.1f}   "
          f"{r['baseline_opt']:>5.1f} / {r['poh_opt']:>5.1f} / {r['poh_advantage_opt']:>+5.1f}")

print("="*100)
print("\\nKey Findings:")
print("-"*100)

# Calculate average advantage
avg_adv_acc = np.mean([r['poh_advantage_acc'] for r in results])
avg_adv_opt = np.mean([r['poh_advantage_opt'] for r in results])

print(f"Average PoH Advantage (Accuracy): {avg_adv_acc:+.2f}%")
print(f"Average PoH Advantage (Optimality): {avg_adv_opt:+.2f}%")
print()

# Find best size for PoH
best_acc_idx = np.argmax([r['poh_acc'] for r in results])
best_opt_idx = np.argmax([r['poh_opt'] for r in results])

print(f"Best PoH Accuracy: {results[best_acc_idx]['size']} "
      f"({results[best_acc_idx]['poh_acc']:.1f}% @ {results[best_acc_idx]['poh_params']/1e6:.1f}M params)")
print(f"Best PoH Optimality: {results[best_opt_idx]['size']} "
      f"({results[best_opt_idx]['poh_opt']:.1f}% @ {results[best_opt_idx]['poh_params']/1e6:.1f}M params)")


## Optional: HUGE (~500M) Benchmark

Run a much larger model using `experiments/huge_500m_benchmark.py`.
Parameter parity is enforced by reducing PoH depth to ≤10% overhead vs baseline.

Notes:
- Start with fewer epochs (e.g., 5–10)
- Lower batch size if you hit OOM
- Runtime is significantly higher than Large/XL


In [None]:
# Example HUGE run (adjust epochs/batch-size to your GPU)
!python experiments/huge_500m_benchmark.py \
  --maze-size {MAZE_SIZE} \
  --train 2000 \
  --test 200 \
  --epochs 10 \
  --batch-size 8 \
  --R 4 --T 4 \
  --output experiments/results/huge_500m
