# Pointer-over-Heads Transformer (PoT)

**Dynamic Multi-Head Attention with Adaptive Routing for Dependency Parsing**

**Author:** Eran Ben Artzy  
**Year:** 2025  
**License:** Apache 2.0  
**Repository:** https://github.com/Eran-BA/PoT

---

## 🚀 Quick Start (A100 GPU Optimized)

This notebook runs the complete experimental pipeline:

1. **Setup** (2 min): Clone repo and install dependencies
2. **Smoke Test** (1 min): Verify installation with dummy data
3. **A/B Comparison** (15-20 min): Baseline vs PoH on UD English EWT
4. **Ablation Studies** (30-45 min): Test iterations, routing, halting
5. **Multi-Seed Runs** (20-30 min): Best config with 3 seeds
6. **Visualization** (1 min): Generate publication-ready plots

**Total Runtime:** ~1-2 hours on A100  
**Batch Size:** 32 (optimized for A100)

---


## 📋 Step 1: Setup & Installation


In [None]:
# Check GPU
!nvidia-smi -L
!nvidia-smi --query-gpu=memory.total --format=csv,noheader


In [None]:
# Clone repository
!git clone https://github.com/Eran-BA/PoT.git
%cd PoT


In [None]:
# Install dependencies
!pip install -q -r requirements.txt


## ✅ Step 2: Smoke Test (Verify Installation)


In [None]:
# Quick sanity check with dummy data (~30 seconds)
!python ab_ud_pointer_vs_baseline.py \
  --data_source dummy \
  --epochs 2 \
  --batch_size 16 \
  --log_csv smoke_test.csv


In [None]:
# View results
import pandas as pd

df = pd.read_csv('smoke_test.csv')
print("\n📊 Smoke Test Results:")
print(df[['model', 'epoch', 'train_uas', 'dev_uas', 'mean_inner_iters', 'params']])

print("\n✅ Installation verified!")
print(f"Baseline params: {df[df['model']=='Baseline']['params'].iloc[0]:,}")
print(f"PoH params:      {df[df['model']=='PoH']['params'].iloc[0]:,}")
print(f"Overhead:        +{df[df['model']=='PoH']['params'].iloc[0] - df[df['model']=='Baseline']['params'].iloc[0]:,}")


## 📥 Step 3: Download Real Data (UD English EWT)

Download Universal Dependencies English EWT dataset:


In [None]:
# Download UD English EWT from GitHub
print("📥 Downloading UD English EWT dataset...")
!wget -q https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-train.conllu
!wget -q https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu
!wget -q https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-test.conllu
!mkdir -p ud_data
!mv en_ewt-ud-*.conllu ud_data/

print("✅ Downloaded UD English EWT to ud_data/")
print("   • Train: ud_data/en_ewt-ud-train.conllu")
print("   • Dev:   ud_data/en_ewt-ud-dev.conllu")
print("   • Test:  ud_data/en_ewt-ud-test.conllu")


## 🔬 Step 4: Real Data A/B Comparison

Run parameter-matched comparison on **UD English EWT** (using local files):


In [None]:
# Parameter-matched A/B comparison (~15-20 min on A100)
!python ab_ud_pointer_vs_baseline.py \
  --data_source conllu --conllu_dir ud_data \
  --epochs 5 \
  --batch_size 32 \
  --lr 3e-5 \
  --param_match baseline \
  --ignore_punct \
  --emit_conllu \
  --log_csv ab_comparison.csv


In [None]:
# View A/B results
import pandas as pd

df_ab = pd.read_csv('ab_comparison.csv')

print("\n📊 A/B Comparison Results (Final Epoch):")
final = df_ab[df_ab['epoch'] == df_ab['epoch'].max()]
print(final[['model', 'dev_uas', 'dev_las', 'mean_inner_iters', 'params']].to_string(index=False))

# Calculate improvement
baseline_uas = float(final[final['model']=='Baseline']['dev_uas'].values[0])
poh_uas = float(final[final['model']=='PoH']['dev_uas'].values[0])
improvement = ((poh_uas - baseline_uas) / baseline_uas) * 100

print(f"\n🎯 PoH Improvement: +{improvement:.2f}% UAS")


In [None]:
## 🧪 Step 5: Ablation Studies

Test what matters: **iterations**, **routing**, **halting**, **combination**


# A. Iterations (static gating vs refinement)
print("🔬 Running Iterations Ablation...")
for iters in [1, 2, 3]:
    print(f"  → Testing {iters} inner iterations...")
    !python ab_ud_pointer_vs_baseline.py \
      --data_source conllu --conllu_dir ud_data \
      --epochs 3 --batch_size 32 --lr 3e-5 \
      --halting_mode fixed --max_inner_iters {iters} \
      --routing_topk 0 --log_csv ablations.csv --ignore_punct

print("✅ Iterations ablation complete!")


In [None]:
# B. Routing (soft mixture vs hard top-k)
print("🔬 Running Routing Ablation...")
for topk in [0, 2]:
    mode = "soft" if topk == 0 else f"top-{topk}"
    print(f"  → Testing {mode} routing...")
    !python ab_ud_pointer_vs_baseline.py \
      --data_source conllu --conllu_dir ud_data \
      --epochs 3 --batch_size 32 --lr 3e-5 \
      --halting_mode fixed --max_inner_iters 2 \
      --routing_topk {topk} --log_csv ablations.csv --ignore_punct

print("✅ Routing ablation complete!")


In [None]:
# C. Halting (fixed vs entropy vs ACT-style)
print("🔬 Running Halting Ablation...")
for halt in ['fixed', 'entropy', 'halting']:
    print(f"  → Testing {halt} halting...")
    !python ab_ud_pointer_vs_baseline.py \
      --data_source conllu --conllu_dir ud_data \
      --epochs 3 --batch_size 32 --lr 3e-5 \
      --halting_mode {halt} --max_inner_iters 3 \
      --routing_topk 2 --log_csv ablations.csv --ignore_punct

print("✅ Halting ablation complete!")


In [None]:
# View ablation results summary
df_abl = pd.read_csv('ablations.csv')
poh_results = df_abl[df_abl['model'] == 'PoH']

print("\n📊 Ablation Study Results:")
print("\n🔢 Iterations:")
for i in [1, 2, 3]:
    subset = poh_results[poh_results['max_inner_iters'] == str(i)]
    if not subset.empty:
        uas = float(subset['dev_uas'].values[-1])
        print(f"  {i} iterations: {uas:.4f} UAS")

print("\n🎯 Routing:")
for tk in ['0', '2']:
    subset = poh_results[poh_results['routing_topk'] == tk]
    if not subset.empty:
        uas = float(subset['dev_uas'].values[-1])
        mode = "soft" if tk == '0' else f"top-{tk}"
        print(f"  {mode}: {uas:.4f} UAS")

print("\n⏱️  Halting:")
for halt in ['fixed', 'entropy', 'halting']:
    subset = poh_results[poh_results['halting_mode'] == halt]
    if not subset.empty:
        uas = float(subset['dev_uas'].values[-1])
        print(f"  {halt}: {uas:.4f} UAS")


In [None]:
## 🎲 Step 6: Multi-Seed Robustness

Run best configuration with **3 random seeds** for mean ± std


## 🎲 Step 5: Multi-Seed Robustness

Run best configuration with **3 random seeds** for mean ± std


In [None]:
# Run best config with multiple seeds (~20-30 min on A100)
print("🎲 Running Multi-Seed Evaluation...")
print("Best config: entropy halting, 2 iters, top-2 routing\n")

for seed in [42, 123, 456]:
    print(f"  → Seed {seed}...")
    !python ab_ud_pointer_vs_baseline.py \
      --data_source hf --epochs 5 --batch_size 32 --lr 3e-5 \
      --halting_mode entropy --max_inner_iters 2 --routing_topk 2 \
      --param_match baseline --ignore_punct --emit_conllu \
      --seed {seed} --log_csv multiseed.csv

print("\n✅ Multi-seed evaluation complete!")


In [None]:
# Compute statistics across seeds
import numpy as np

df_ms = pd.read_csv('multiseed.csv')
final_ms = df_ms[df_ms['epoch'] == df_ms['epoch'].max()]

baseline_stats = final_ms[final_ms['model'] == 'Baseline'].groupby('seed').last()
poh_stats = final_ms[final_ms['model'] == 'PoH'].groupby('seed').last()

print("\n📊 Multi-Seed Results (Mean ± Std):")
print("="*60)

for name, stats in [('Baseline', baseline_stats), ('PoH', poh_stats)]:
    uas_mean = stats['dev_uas'].astype(float).mean()
    uas_std = stats['dev_uas'].astype(float).std()
    las_mean = stats['dev_las'].astype(float).mean()
    las_std = stats['dev_las'].astype(float).std()
    
    print(f"\n{name}:")
    print(f"  Dev UAS: {uas_mean:.4f} ± {uas_std:.4f}")
    print(f"  Dev LAS: {las_mean:.4f} ± {las_std:.4f}")
    
    if name == 'PoH':
        iters_mean = stats['mean_inner_iters'].astype(float).mean()
        iters_std = stats['mean_inner_iters'].astype(float).std()
        print(f"  Mean Iters: {iters_mean:.2f} ± {iters_std:.2f}")

# Compute improvement
baseline_uas_mean = baseline_stats['dev_uas'].astype(float).mean()
poh_uas_mean = poh_stats['dev_uas'].astype(float).mean()
improvement_final = ((poh_uas_mean - baseline_uas_mean) / baseline_uas_mean) * 100

print("\n" + "="*60)
print(f"🎯 PoH Improvement: +{improvement_final:.2f}% UAS")
print("="*60)


## 📊 Step 6: Visualization & Download


In [None]:
# Generate comprehensive visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. A/B Comparison (Learning Curves)
ax = axes[0, 0]
for model in ['Baseline', 'PoH']:
    data = df_ab[df_ab['model'] == model]
    ax.plot(data['epoch'], data['dev_uas'].astype(float), 
            marker='o', label=model, linewidth=2)
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Dev UAS', fontsize=12)
ax.set_title('A/B Comparison: Learning Curves', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# 2. Iterations Ablation
ax = axes[0, 1]
poh_abl = df_abl[df_abl['model'] == 'PoH']
iters_data = []
for i in [1, 2, 3]:
    subset = poh_abl[poh_abl['max_inner_iters'] == str(i)]
    if not subset.empty:
        iters_data.append((i, float(subset['dev_uas'].values[-1])))
if iters_data:
    iters, uas = zip(*iters_data)
    ax.bar(iters, uas, color='steelblue', alpha=0.7)
    ax.set_xlabel('Inner Iterations', fontsize=12)
    ax.set_ylabel('Dev UAS', fontsize=12)
    ax.set_title('Ablation: Number of Iterations', fontsize=14, fontweight='bold')
    ax.set_xticks([1, 2, 3])
    ax.grid(True, alpha=0.3, axis='y')

# 3. Routing Ablation
ax = axes[1, 0]
routing_data = []
for tk, label in [('0', 'Soft'), ('2', 'Top-2')]:
    subset = poh_abl[poh_abl['routing_topk'] == tk]
    if not subset.empty:
        routing_data.append((label, float(subset['dev_uas'].values[-1])))
if routing_data:
    labels, uas = zip(*routing_data)
    ax.bar(labels, uas, color='coral', alpha=0.7)
    ax.set_xlabel('Routing Mode', fontsize=12)
    ax.set_ylabel('Dev UAS', fontsize=12)
    ax.set_title('Ablation: Routing Strategy', fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')

# 4. Multi-Seed Results
ax = axes[1, 1]
models = ['Baseline', 'PoH']
means = [baseline_uas_mean, poh_uas_mean]
stds = [baseline_stats['dev_uas'].astype(float).std(), 
        poh_stats['dev_uas'].astype(float).std()]
x_pos = range(len(models))
bars = ax.bar(x_pos, means, yerr=stds, capsize=5, 
              color=['lightblue', 'lightcoral'], alpha=0.7)
ax.set_xticks(x_pos)
ax.set_xticklabels(models)
ax.set_ylabel('Dev UAS', fontsize=12)
ax.set_title('Multi-Seed Results (Mean ± Std)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
ax.annotate(f'+{improvement_final:.2f}%', 
            xy=(1, poh_uas_mean), xytext=(1.2, poh_uas_mean),
            fontsize=12, fontweight='bold', color='green',
            arrowprops=dict(arrowstyle='->', color='green', lw=2))

plt.tight_layout()
plt.savefig('comprehensive_results.png', dpi=300, bbox_inches='tight')
print("\n✅ Saved comprehensive_results.png")
plt.show()


In [None]:
# Package all results for download
!zip -r pot_results.zip *.csv *.png *.conllu 2>/dev/null

from google.colab import files
files.download('pot_results.zip')

print("\n✅ Results packaged and ready for download!")
print("\nPackage includes:")
print("  • CSV logs (all experiments)")
print("  • Plots (ablations, learning curves)")
print("  • CoNLL-U predictions")


## 📋 Final Summary & Paper Claims


In [None]:
# Generate paper-ready summary
print("="*70)
print("POINTER-OVER-HEADS TRANSFORMER: EXPERIMENTAL RESULTS")
print("="*70)
print(f"\n📊 Dataset: Universal Dependencies English EWT")
print(f"🖥️  Hardware: A100 GPU")
print(f"⚙️  Encoder: DistilBERT-base-uncased")

print("\n" + "="*70)
print("MAIN RESULTS (Multi-Seed, Parameter-Matched)")
print("="*70)

baseline_params = baseline_stats['params'].iloc[0]
poh_params = poh_stats['params'].iloc[0]
param_overhead = ((poh_params - baseline_params) / baseline_params) * 100

print(f"\n{'Model':<12} {'Params':<12} {'Dev UAS':<15} {'Dev LAS':<15} {'Mean Iters':<12}")
print("-"*70)
print(f"{'Baseline':<12} {baseline_params:<12,} {baseline_uas_mean:.4f}±{baseline_stats['dev_uas'].astype(float).std():.4f}    {baseline_stats['dev_las'].astype(float).mean():.4f}±{baseline_stats['dev_las'].astype(float).std():.4f}    {'—':<12}")
print(f"{'PoH':<12} {poh_params:<12,} {poh_uas_mean:.4f}±{poh_stats['dev_uas'].astype(float).std():.4f}    {poh_stats['dev_las'].astype(float).mean():.4f}±{poh_stats['dev_las'].astype(float).std():.4f}    {poh_stats['mean_inner_iters'].astype(float).mean():.2f}±{poh_stats['mean_inner_iters'].astype(float).std():.2f}")
print(f"{'Δ':<12} {'+'+str(poh_params - baseline_params):<12,} +{improvement_final:.2f}%       +{((poh_stats['dev_las'].astype(float).mean() - baseline_stats['dev_las'].astype(float).mean()) / baseline_stats['dev_las'].astype(float).mean()) * 100:.2f}%")

print("\n" + "="*70)
print("KEY FINDINGS")
print("="*70)
print(f"\n1. PoH achieves +{improvement_final:.2f}% UAS improvement with only +{param_overhead:.2f}% parameters")
print(f"2. Adaptive routing converges to ~{poh_stats['mean_inner_iters'].astype(float).mean():.1f} inner iterations")
print(f"3. Best config: entropy halting + 2 max iters + top-2 routing")
print(f"4. Results stable across seeds (UAS std: {poh_stats['dev_uas'].astype(float).std():.4f})")

print("\n" + "="*70)
print("✅ Complete! All experiments finished successfully.")
print("\n📦 Download pot_results.zip for all data, plots, and predictions.")
print("\n🔗 Repository: https://github.com/Eran-BA/PoT")
print("👤 Author: Eran Ben Artzy")
print("="*70)
