# Grand Synthesis: Walk Resolution Limit Hypothesis Adjudication

**Definitive meta-analytic synthesis** across all 10 experiment artifacts (5 iterations), formally adjudicating each of the 6 hypothesis success/disconfirmation criteria with quantified evidence.

This notebook:
- Computes **Fisher z random-effects meta-analysis** of 26 SRI-gap correlation studies
- Formally **adjudicates 6 criteria** (C1–C3 confirmation, D1–D3 disconfirmation)
- Builds **SRWE win/loss/tie scorecard** across encoding comparisons
- Tests **moderator effects** (architecture, domain, metric type)
- Produces **scope-of-validity map** and practical encoding selection guidelines
- Generates **publication-quality figures** (forest plot, verdict dashboard, decision tree)

In [1]:
import subprocess, sys
def _pip(*a): subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', *a])

# Core packages (pre-installed on Colab, install locally to match Colab env)
if 'google.colab' not in sys.modules:
    _pip('numpy==2.0.2', 'scipy==1.16.3', 'matplotlib==3.10.0')


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


## Imports

In [2]:
from __future__ import annotations
import json
import math
import os
from typing import Any

import numpy as np
from scipy import stats
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from IPython.display import display, Image

## Data Loading

In [3]:
GITHUB_DATA_URL = "https://raw.githubusercontent.com/AMGrobelnik/ai-invention-ace67e-the-walk-resolution-limit-a-super-resolu/main/evaluation_iter6_grand_synthesis/demo/mini_demo_data.json"
import json, os

def load_data():
    try:
        import urllib.request
        with urllib.request.urlopen(GITHUB_DATA_URL) as response:
            return json.loads(response.read().decode())
    except Exception: pass
    if os.path.exists("mini_demo_data.json"):
        with open("mini_demo_data.json") as f: return json.load(f)
    raise FileNotFoundError("Could not load mini_demo_data.json")

In [4]:
data = load_data()
studies = data["studies"]
scorecard = data["scorecard"]
gap_reductions = data["gap_reductions"]
srwe_results = data["srwe_results"]
c2_evidence = data["c2_evidence"]
within_corrs = data["within_corrs"]
weights = data["weights"]

print(f"Loaded {len(studies)} SRI-gap correlation studies")
print(f"Loaded {len(scorecard)} scorecard entries")
print(f"Loaded {len(gap_reductions)} gap reduction conditions")
print(f"Loaded {len(srwe_results)} SRWE performance results")

Loaded 26 SRI-gap correlation studies
Loaded 18 scorecard entries
Loaded 11 gap reduction conditions
Loaded 9 SRWE performance results


## Configuration

Hypothesis criteria weights and thresholds for adjudication.

In [5]:
# ── Hypothesis criteria weights ──
C1_WEIGHT = weights["C1_WEIGHT"]  # 0.35 — SRI-gap correlation ρ > 0.5
C2_WEIGHT = weights["C2_WEIGHT"]  # 0.20 — Aliased pairs distinguishability
C3_WEIGHT = weights["C3_WEIGHT"]  # 0.25 — SRWE ≥50% gap reduction
D1_WEIGHT = weights["D1_WEIGHT"]  # 0.10 — ρ < 0.2 (disconfirmation)
D2_WEIGHT = weights["D2_WEIGHT"]  # 0.05 — Resolution mismatch
D3_WEIGHT = weights["D3_WEIGHT"]  # 0.05 — No SRWE improvement

# ── Thresholds ──
C1_RHO_THRESHOLD = 0.5    # Confirmation: ρ must exceed this
D1_RHO_THRESHOLD = 0.2    # Disconfirmation: ρ below this
TIE_THRESHOLD = 0.02      # Relative threshold for win/loss/tie
BONFERRONI_ALPHA = 0.0083  # Bonferroni-corrected significance level

# ── Figure settings ──
FIGURE_DPI = 150
MAX_SCORECARD_ROWS = 20  # Max rows in scorecard heatmap
MAX_BAR_CHART_DATASETS = 3  # Max datasets in encoding bar chart

print(f"Weights: C1={C1_WEIGHT}, C2={C2_WEIGHT}, C3={C3_WEIGHT}, D1={D1_WEIGHT}, D2={D2_WEIGHT}, D3={D3_WEIGHT}")
print(f"Thresholds: C1={C1_RHO_THRESHOLD}, D1={D1_RHO_THRESHOLD}")

Weights: C1=0.35, C2=0.2, C3=0.25, D1=0.1, D2=0.05, D3=0.05
Thresholds: C1=0.5, D1=0.2


## 1. Fisher z Random-Effects Meta-Analysis

Pools SRI-gap correlations across all experiments using Fisher z transformation with DerSimonian-Laird random-effects weighting.

In [6]:
def fisher_z_meta_analysis(studies: list[dict]) -> dict:
    """Compute random-effects meta-analysis via Fisher z transformation."""
    if not studies:
        return {"rho_pooled": 0, "ci_low": 0, "ci_high": 0, "I2": 0, "Q_pvalue": 1.0, "k": 0}

    rhos = np.array([s["rho"] for s in studies])
    ns = np.array([s["n"] for s in studies])

    # Fisher z transformation
    # Clip rhos to avoid arctanh(±1)
    rhos_clipped = np.clip(rhos, -0.999, 0.999)
    zs = np.arctanh(rhos_clipped)
    vs = 1.0 / (ns - 3.0)  # Variance of Fisher z
    ws = 1.0 / vs  # Fixed-effect weights

    k = len(studies)

    # Fixed-effect pooled estimate
    z_fe = np.sum(ws * zs) / np.sum(ws)

    # Cochran's Q
    Q = np.sum(ws * (zs - z_fe) ** 2)
    df = k - 1
    Q_pvalue = 1.0 - stats.chi2.cdf(Q, df) if df > 0 else 1.0

    # DerSimonian-Laird tau^2
    C = np.sum(ws) - np.sum(ws**2) / np.sum(ws)
    tau2 = max(0, (Q - df) / C) if C > 0 else 0

    # Random-effects weights
    ws_re = 1.0 / (vs + tau2)
    z_re = np.sum(ws_re * zs) / np.sum(ws_re)
    se_re = 1.0 / np.sqrt(np.sum(ws_re))

    # Back-transform
    rho_pooled = np.tanh(z_re)
    ci_low = np.tanh(z_re - 1.96 * se_re)
    ci_high = np.tanh(z_re + 1.96 * se_re)

    # I² heterogeneity
    I2 = max(0, (Q - df) / Q * 100) if Q > 0 else 0

    return {
        "rho_pooled": float(rho_pooled),
        "ci_low": float(ci_low),
        "ci_high": float(ci_high),
        "I2": float(I2),
        "Q": float(Q),
        "Q_pvalue": float(Q_pvalue),
        "tau2": float(tau2),
        "k": k,
    }

# Run meta-analysis
pooled = fisher_z_meta_analysis(studies)
print(f"Pooled ρ = {pooled['rho_pooled']:.4f} [{pooled['ci_low']:.4f}, {pooled['ci_high']:.4f}]")
print(f"I² = {pooled['I2']:.1f}%, Q p-value = {pooled['Q_pvalue']:.4f}")
print(f"Number of studies: {pooled['k']}")

Pooled ρ = 0.1526 [0.0204, 0.2796]
I² = 99.0%, Q p-value = 0.0000
Number of studies: 26


## 2. Subgroup Meta-Analysis

Breaks down the pooled correlation by dataset domain, architecture type, and metric type.

In [7]:
def subgroup_meta_analysis(studies: list[dict]) -> dict:
    """Compute subgroup meta-analyses by dataset domain, architecture, metric type."""
    results = {}

    # By dataset domain
    domain_map = {
        "ZINC-subset": "molecular", "zinc": "molecular",
        "Peptides-func": "protein", "peptides_func": "protein",
        "Peptides-struct": "protein", "peptides_struct": "protein",
        "Synthetic-aliased-pairs": "synthetic", "synthetic_fixed_n30": "synthetic",
        "Multi-dataset": "mixed",
    }

    by_domain = {}
    for s in studies:
        domain = domain_map.get(s["dataset"], "other")
        by_domain.setdefault(domain, []).append(s)

    results["by_domain"] = {}
    for domain, domain_studies in by_domain.items():
        if len(domain_studies) >= 2:
            results["by_domain"][domain] = fisher_z_meta_analysis(domain_studies)

    # By architecture
    by_arch = {}
    for s in studies:
        arch = s["architecture"].split("_")[0]  # Normalize
        by_arch.setdefault(arch, []).append(s)

    results["by_architecture"] = {}
    for arch, arch_studies in by_arch.items():
        if len(arch_studies) >= 2:
            results["by_architecture"][arch] = fisher_z_meta_analysis(arch_studies)

    # By metric type
    by_metric = {}
    for s in studies:
        by_metric.setdefault(s["metric_type"], []).append(s)

    results["by_metric_type"] = {}
    for mt, mt_studies in by_metric.items():
        if len(mt_studies) >= 2:
            results["by_metric_type"][mt] = fisher_z_meta_analysis(mt_studies)

    return results

subgroup = subgroup_meta_analysis(studies)
for category, sub_results in subgroup.items():
    print(f"\n{category}:")
    for name, res in sub_results.items():
        print(f"  {name}: ρ={res['rho_pooled']:.4f} [{res['ci_low']:.4f}, {res['ci_high']:.4f}], k={res['k']}")


by_domain:
  molecular: ρ=0.0826 [0.0044, 0.1598], k=9
  protein: ρ=0.1742 [-0.0909, 0.4164], k=13
  synthetic: ρ=0.2734 [-0.3022, 0.7029], k=3

by_architecture:
  model: ρ=0.5600 [0.2075, 0.7838], k=5
  MLP: ρ=0.0207 [-0.0774, 0.1183], k=4
  GPS: ρ=-0.0302 [-0.1212, 0.0614], k=7
  GCN: ρ=0.1003 [0.0710, 0.1295], k=10

by_metric_type:
  node_distinguishability: ρ=0.5600 [0.2075, 0.7838], k=5
  task_performance: ρ=0.0480 [0.0106, 0.0852], k=21


## 3. Moderator Analysis

Tests whether dataset domain, architecture type, or metric type significantly moderate the SRI-gap correlation using Kruskal-Wallis and Mann-Whitney U tests.

In [8]:
def compute_moderator_effects(studies: list[dict]) -> dict:
    """Analyze moderator effects on SRI-gap correlation."""
    moderators = {}

    # 1. Dataset domain
    domain_map = {
        "ZINC-subset": "molecular", "zinc": "molecular",
        "Peptides-func": "protein", "peptides_func": "protein",
        "Peptides-struct": "protein", "peptides_struct": "protein",
        "Synthetic-aliased-pairs": "synthetic", "synthetic_fixed_n30": "synthetic",
    }

    domains = [domain_map.get(s["dataset"], "other") for s in studies]
    rhos = [s["rho"] for s in studies]

    if len(set(domains)) >= 2:
        domain_groups = {}
        for d, r in zip(domains, rhos):
            domain_groups.setdefault(d, []).append(r)

        groups = [np.array(v) for v in domain_groups.values() if len(v) >= 2]
        if len(groups) >= 2:
            try:
                h_stat, kw_p = stats.kruskal(*groups)
                n_total = sum(len(g) for g in groups)
                eta2 = (h_stat - len(groups) + 1) / (n_total - len(groups)) if n_total > len(groups) else 0
                moderators["dataset_domain"] = {
                    "H_statistic": float(h_stat),
                    "p_value": float(kw_p),
                    "eta_squared": float(max(0, eta2)),
                    "n_groups": len(groups),
                    "significant": kw_p < BONFERRONI_ALPHA,
                }
            except Exception:
                pass

    # 2. Architecture type
    archs = [s["architecture"].split("_")[0] for s in studies]
    if len(set(archs)) >= 2:
        arch_groups = {}
        for a, r in zip(archs, rhos):
            arch_groups.setdefault(a, []).append(r)
        groups = [np.array(v) for v in arch_groups.values() if len(v) >= 2]
        if len(groups) >= 2:
            try:
                h_stat, kw_p = stats.kruskal(*groups)
                n_total = sum(len(g) for g in groups)
                eta2 = (h_stat - len(groups) + 1) / (n_total - len(groups)) if n_total > len(groups) else 0
                moderators["architecture"] = {
                    "H_statistic": float(h_stat),
                    "p_value": float(kw_p),
                    "eta_squared": float(max(0, eta2)),
                    "n_groups": len(groups),
                    "significant": kw_p < BONFERRONI_ALPHA,
                }
            except Exception:
                pass

    # 3. Metric type (node distinguishability vs task performance)
    metric_groups = {}
    for s in studies:
        metric_groups.setdefault(s["metric_type"], []).append(s["rho"])
    if len(metric_groups) >= 2:
        groups = [np.array(v) for v in metric_groups.values() if len(v) >= 2]
        if len(groups) >= 2:
            try:
                t_stat, mw_p = stats.mannwhitneyu(groups[0], groups[1], alternative="two-sided")
                moderators["metric_type"] = {
                    "U_statistic": float(t_stat),
                    "p_value": float(mw_p),
                    "significant": mw_p < BONFERRONI_ALPHA,
                }
            except Exception:
                pass

    return moderators

moderators = compute_moderator_effects(studies)
for mod_name, mod_data in moderators.items():
    sig_str = "SIGNIFICANT" if mod_data.get("significant") else "not significant"
    print(f"  {mod_name}: p={mod_data['p_value']:.4f} ({sig_str})")

  dataset_domain: p=0.8189 (not significant)
  architecture: p=0.0010 (SIGNIFICANT)
  metric_type: p=0.0009 (SIGNIFICANT)


## 4. Criterion Adjudication

Formally adjudicates all 6 hypothesis criteria (C1-C3 confirmation, D1-D3 disconfirmation) and computes weighted overall verdict.

In [9]:
def adjudicate_criteria(pooled: dict, studies: list[dict],
                        gap_reductions: dict, scorecard: dict,
                        c2_evidence: list, within_corrs: list) -> dict:
    """Formally adjudicate all 6 hypothesis criteria."""
    verdicts = {}

    # ── C1: ρ > 0.5 (Confirmation Criterion 1) ──
    rho = pooled["rho_pooled"]
    ci_low = pooled["ci_low"]
    if rho > C1_RHO_THRESHOLD:
        c1_verdict = "confirmed"
        c1_confidence = min(1.0, (rho - 0.5) / 0.3 + 0.5)
    elif rho > 0.35:
        c1_verdict = "partially_confirmed"
        c1_confidence = (rho - 0.2) / 0.3
    elif rho > D1_RHO_THRESHOLD:
        c1_verdict = "partially_confirmed"
        c1_confidence = (rho - 0.2) / 0.3
    else:
        c1_verdict = "not_confirmed"
        c1_confidence = max(0, rho / 0.2)

    if ci_low > 0:
        c1_confidence = min(1.0, c1_confidence + 0.1)
    if ci_low < 0:
        c1_confidence = max(0, c1_confidence - 0.2)

    verdicts["C1_sri_gap_correlation"] = {
        "criterion": "Spearman rho(SRI, RWSE-LapPE gap) > 0.5",
        "verdict": c1_verdict,
        "confidence": round(float(c1_confidence), 3),
        "evidence": f"Pooled rho = {rho:.3f}, 95% CI [{ci_low:.3f}, {pooled['ci_high']:.3f}], I2 = {pooled['I2']:.1f}%",
    }

    # ── C2: Aliased pairs distinguishability ──
    if c2_evidence:
        mean_distinguish = np.mean(c2_evidence)
        if mean_distinguish > 75:
            c2_verdict = "confirmed"
            c2_confidence = min(1.0, (mean_distinguish - 50) / 50)
        elif mean_distinguish > 50:
            c2_verdict = "partially_confirmed"
            c2_confidence = (mean_distinguish - 50) / 50
        else:
            c2_verdict = "not_confirmed"
            c2_confidence = mean_distinguish / 100
    else:
        c2_verdict = "partially_confirmed"
        c2_confidence = 0.5
        mean_distinguish = 0

    verdicts["C2_aliased_pairs_distinguishability"] = {
        "criterion": "SRWE better than RWSE at distinguishing aliased graph pairs",
        "verdict": c2_verdict,
        "confidence": round(float(c2_confidence), 3),
        "evidence": f"Mean distinguishability: {mean_distinguish:.1f}% across {len(c2_evidence)} categories",
    }

    # ── C3: SRWE >= 50% gap reduction ──
    gap_red_values = []
    for k, v in gap_reductions.items():
        gr = v.get("gap_reduction_fraction", None)
        if gr is not None and not math.isnan(gr) and -5 < gr < 5:
            gap_red_values.append(gr)

    if gap_red_values:
        mean_gap_red = np.mean(gap_red_values)
        n_above_50 = sum(1 for g in gap_red_values if g >= 0.5)
        frac_above_50 = n_above_50 / len(gap_red_values)
        if frac_above_50 >= 0.5 and mean_gap_red >= 0.5:
            c3_verdict = "confirmed"
            c3_confidence = min(1.0, mean_gap_red)
        elif frac_above_50 >= 0.25 or mean_gap_red >= 0.3:
            c3_verdict = "partially_confirmed"
            c3_confidence = max(frac_above_50, mean_gap_red)
        else:
            c3_verdict = "not_confirmed"
            c3_confidence = max(0, mean_gap_red)
    else:
        c3_verdict = "not_confirmed"
        c3_confidence = 0.0
        mean_gap_red = 0
        frac_above_50 = 0

    verdicts["C3_srwe_gap_reduction"] = {
        "criterion": "SRWE achieves >=50% gap reduction on low-SRI graphs",
        "verdict": c3_verdict,
        "confidence": round(float(c3_confidence), 3),
        "evidence": f"Mean gap reduction: {mean_gap_red:.3f}, {frac_above_50*100:.0f}% conditions above 50%",
    }

    # ── D1: ρ < 0.2 (Disconfirmation) ──
    if rho < D1_RHO_THRESHOLD:
        d1_verdict = "disconfirmed"
        d1_confidence = 1.0 - rho / 0.2
    elif rho < 0.35:
        d1_verdict = "partially_confirmed"
        d1_confidence = 0.5
    else:
        d1_verdict = "not_confirmed"
        d1_confidence = max(0, 1.0 - (rho - 0.2) / 0.3)

    verdicts["D1_weak_correlation"] = {
        "criterion": "Spearman rho(SRI, gap) < 0.2 -> theory disconfirmed",
        "verdict": d1_verdict,
        "confidence": round(float(d1_confidence), 3),
        "evidence": f"Pooled rho = {rho:.3f}",
    }

    # ── D2: Resolution mismatch ──
    if within_corrs:
        mean_within = np.mean(np.abs(within_corrs))
        d2_verdict = "not_confirmed" if mean_within > 0.05 else "disconfirmed"
        d2_confidence = 0.5
    else:
        d2_verdict = "not_confirmed"
        d2_confidence = 0.3
        mean_within = 0

    verdicts["D2_resolution_mismatch"] = {
        "criterion": "SRI correlation disappears when controlling for graph size",
        "verdict": d2_verdict,
        "confidence": round(float(d2_confidence), 3),
        "evidence": f"Mean within-size-bin |rho| = {mean_within:.3f}",
    }

    # ── D3: No SRWE improvement ──
    total_wins = 0
    total_losses = 0
    for key, sc in scorecard.items():
        if "srwe" in key.lower() and "rwse" in key.lower():
            total_wins += sc["wins"]
            total_losses += sc["losses"]

    if total_wins + total_losses > 0:
        win_rate = total_wins / (total_wins + total_losses)
        if win_rate < 0.3:
            d3_verdict = "disconfirmed"
            d3_confidence = 1.0 - win_rate
        elif win_rate < 0.5:
            d3_verdict = "partially_confirmed"
            d3_confidence = 0.5
        else:
            d3_verdict = "not_confirmed"
            d3_confidence = win_rate
    else:
        d3_verdict = "not_confirmed"
        d3_confidence = 0.3
        win_rate = 0.5

    verdicts["D3_no_srwe_improvement"] = {
        "criterion": "SRWE shows no systematic improvement over RWSE",
        "verdict": d3_verdict,
        "confidence": round(float(d3_confidence), 3),
        "evidence": f"SRWE vs RWSE win rate: {win_rate:.1%} ({total_wins}W/{total_losses}L)",
    }

    return verdicts


def compute_overall_verdict(verdicts: dict) -> dict:
    """Compute weighted overall hypothesis verdict."""
    score_map = {
        "confirmed": 1.0,
        "partially_confirmed": 0.5,
        "not_confirmed": 0.0,
        "disconfirmed": -0.5,
    }

    c1_score = score_map.get(verdicts["C1_sri_gap_correlation"]["verdict"], 0) * verdicts["C1_sri_gap_correlation"]["confidence"]
    c2_score = score_map.get(verdicts["C2_aliased_pairs_distinguishability"]["verdict"], 0) * verdicts["C2_aliased_pairs_distinguishability"]["confidence"]
    c3_score = score_map.get(verdicts["C3_srwe_gap_reduction"]["verdict"], 0) * verdicts["C3_srwe_gap_reduction"]["confidence"]

    d1_score = score_map.get(verdicts["D1_weak_correlation"]["verdict"], 0) * verdicts["D1_weak_correlation"]["confidence"]
    d2_score = score_map.get(verdicts["D2_resolution_mismatch"]["verdict"], 0) * verdicts["D2_resolution_mismatch"]["confidence"]
    d3_score = score_map.get(verdicts["D3_no_srwe_improvement"]["verdict"], 0) * verdicts["D3_no_srwe_improvement"]["confidence"]

    confirmation_score = (C1_WEIGHT * c1_score + C2_WEIGHT * c2_score + C3_WEIGHT * c3_score) / (C1_WEIGHT + C2_WEIGHT + C3_WEIGHT)
    disconfirmation_penalty = (D1_WEIGHT * d1_score + D2_WEIGHT * d2_score + D3_WEIGHT * d3_score) / (D1_WEIGHT + D2_WEIGHT + D3_WEIGHT)

    overall_score = confirmation_score * 0.7 + (1.0 + disconfirmation_penalty) * 0.3

    if overall_score >= 0.6:
        overall_verdict = "confirmed"
    elif overall_score >= 0.4:
        overall_verdict = "partially_confirmed"
    elif overall_score >= 0.2:
        overall_verdict = "weakly_supported"
    else:
        overall_verdict = "not_confirmed"

    overall_confidence = min(1.0, max(0.0, overall_score))

    return {
        "overall_verdict": overall_verdict,
        "overall_confidence": round(float(overall_confidence), 3),
        "overall_score": round(float(overall_score), 3),
        "confirmation_score": round(float(confirmation_score), 3),
        "disconfirmation_penalty": round(float(disconfirmation_penalty), 3),
    }

# Run adjudication
verdicts = adjudicate_criteria(pooled, studies, gap_reductions, scorecard, c2_evidence, within_corrs)
overall = compute_overall_verdict(verdicts)

print(f"Overall verdict: {overall['overall_verdict']} (score={overall['overall_score']:.3f})")
print(f"Confirmation score: {overall['confirmation_score']:.3f}")
print(f"Disconfirmation penalty: {overall['disconfirmation_penalty']:.3f}")
print()
for key, v in verdicts.items():
    print(f"  {key}: {v['verdict']} (conf={v['confidence']:.2f})")

Overall verdict: partially_confirmed (score=0.458)
Confirmation score: 0.224
Disconfirmation penalty: 0.003

  C1_sri_gap_correlation: not_confirmed (conf=0.86)
  C2_aliased_pairs_distinguishability: confirmed (conf=0.67)
  C3_srwe_gap_reduction: partially_confirmed (conf=0.36)
  D1_weak_correlation: disconfirmed (conf=0.24)
  D2_resolution_mismatch: not_confirmed (conf=0.50)
  D3_no_srwe_improvement: partially_confirmed (conf=0.50)


## 5. Scope-of-Validity Analysis

Classifies conditions where the WRL theory works vs. fails, and characterizes domains.

In [10]:
def compute_scope_of_validity(studies: list[dict]) -> dict:
    """Build scope-of-validity assessment."""
    theory_works = []
    theory_fails = []

    for s in studies:
        works = abs(s["rho"]) > D1_RHO_THRESHOLD and s["p_value"] < 0.05
        entry = {
            "dataset": s["dataset"],
            "architecture": s["architecture"],
            "rho": s["rho"],
            "p_value": s["p_value"],
        }
        if works:
            theory_works.append(entry)
        else:
            theory_fails.append(entry)

    total = len(theory_works) + len(theory_fails)
    accuracy = len(theory_works) / total if total > 0 else 0

    domain_map = {
        "ZINC-subset": "molecular", "zinc": "molecular",
        "Peptides-func": "protein", "peptides_func": "protein",
        "Peptides-struct": "protein", "peptides_struct": "protein",
        "Synthetic-aliased-pairs": "synthetic", "synthetic_fixed_n30": "synthetic",
    }

    works_domains = set()
    fails_domains = set()
    for w in theory_works:
        works_domains.add(domain_map.get(w["dataset"], "other"))
    for f in theory_fails:
        fails_domains.add(domain_map.get(f["dataset"], "other"))

    return {
        "n_conditions_works": len(theory_works),
        "n_conditions_fails": len(theory_fails),
        "directional_accuracy": round(float(accuracy), 3),
        "domains_where_works": sorted(works_domains),
        "domains_where_fails": sorted(fails_domains),
        "works_conditions": theory_works[:5],
        "fails_conditions": theory_fails[:5],
    }

scope = compute_scope_of_validity(studies)
print(f"Theory directional accuracy: {scope['directional_accuracy']:.1%}")
print(f"Conditions where theory works: {scope['n_conditions_works']}")
print(f"Conditions where theory fails: {scope['n_conditions_fails']}")
print(f"Domains where works: {scope['domains_where_works']}")
print(f"Domains where fails: {scope['domains_where_fails']}")

Theory directional accuracy: 19.2%
Conditions where theory works: 5
Conditions where theory fails: 21
Domains where works: ['molecular', 'protein', 'synthetic']
Domains where fails: ['molecular', 'other', 'protein', 'synthetic']


## 6. Figures

Publication-quality visualizations: forest plot, verdict dashboard, SRWE scorecard, moderator ranking, scope-of-validity scatter, encoding selection decision tree, and encoding performance bars.

In [11]:
# ── Figure 1: Forest Plot ──
def plot_forest(studies, pooled):
    fig, ax = plt.subplots(figsize=(12, max(8, len(studies) * 0.35 + 2)))
    y_positions = list(range(len(studies)))
    labels, rhos, ci_lows, ci_highs = [], [], [], []

    for s in studies:
        n = s["n"]
        rho = s["rho"]
        se = 1.0 / max(np.sqrt(n - 3), 1)
        z = np.arctanh(np.clip(rho, -0.999, 0.999))
        ci_l = np.tanh(z - 1.96 * se)
        ci_h = np.tanh(z + 1.96 * se)
        rhos.append(rho)
        ci_lows.append(ci_l)
        ci_highs.append(ci_h)
        labels.append(f"{s['experiment']}|{s['dataset'][:15]}|{s['architecture'][:10]}")

    for i, (y, rho, cl, ch) in enumerate(zip(y_positions, rhos, ci_lows, ci_highs)):
        ax.plot([cl, ch], [y, y], "b-", linewidth=1, alpha=0.7)
        ax.plot(rho, y, "bs", markersize=5)

    y_pooled = len(studies)
    diamond_x = [pooled["ci_low"], pooled["rho_pooled"], pooled["ci_high"], pooled["rho_pooled"]]
    diamond_y = [y_pooled, y_pooled + 0.3, y_pooled, y_pooled - 0.3]
    ax.fill(diamond_x, diamond_y, color="red", alpha=0.6)
    labels.append(f"POOLED (k={pooled['k']})")

    ax.set_yticks(list(range(len(labels))))
    ax.set_yticklabels(labels, fontsize=7)
    ax.axvline(0, color="gray", linestyle="--", alpha=0.5)
    ax.axvline(C1_RHO_THRESHOLD, color="green", linestyle=":", alpha=0.5, label=f"C1 threshold (rho={C1_RHO_THRESHOLD})")
    ax.axvline(D1_RHO_THRESHOLD, color="orange", linestyle=":", alpha=0.5, label=f"D1 threshold (rho={D1_RHO_THRESHOLD})")
    ax.set_xlabel("Spearman rho (SRI vs RWSE-LapPE gap)")
    ax.set_title("Forest Plot: SRI-Gap Correlations Across All Experiments")
    ax.legend(fontsize=8)
    ax.set_xlim(-0.8, 1.0)
    plt.tight_layout()
    fig.savefig("forest_plot.png", dpi=FIGURE_DPI, bbox_inches="tight")
    plt.show()
    plt.close(fig)

plot_forest(studies, pooled)

In [12]:
# ── Figure 2: Verdict Dashboard ──
def plot_verdict_dashboard(verdicts, overall):
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    criteria = [
        ("C1_sri_gap_correlation", "C1: SRI-Gap rho > 0.5"),
        ("C2_aliased_pairs_distinguishability", "C2: Aliased Distinguishability"),
        ("C3_srwe_gap_reduction", "C3: SRWE Gap Reduction >=50%"),
        ("D1_weak_correlation", "D1: Weak Correlation (rho < 0.2)"),
        ("D2_resolution_mismatch", "D2: Resolution Mismatch"),
        ("D3_no_srwe_improvement", "D3: No SRWE Improvement"),
    ]
    color_map = {
        "confirmed": "#2ecc71", "partially_confirmed": "#f39c12",
        "not_confirmed": "#e74c3c", "disconfirmed": "#8e44ad",
    }
    for idx, (key, title) in enumerate(criteria):
        ax = axes[idx // 3][idx % 3]
        v = verdicts.get(key, {})
        verdict = v.get("verdict", "not_confirmed")
        confidence = v.get("confidence", 0)
        color = color_map.get(verdict, "#95a5a6")
        ax.barh([0], [confidence], color=color, height=0.5, alpha=0.8)
        ax.set_xlim(0, 1.1)
        ax.set_title(title, fontsize=9, fontweight="bold")
        ax.set_yticks([])
        ax.text(confidence + 0.02, 0, f"{verdict}\n({confidence:.2f})", va="center", fontsize=8)

    fig.suptitle(
        f"Walk Resolution Limit Hypothesis: {overall['overall_verdict'].upper()} "
        f"(score={overall['overall_score']:.2f})",
        fontsize=13, fontweight="bold",
    )
    plt.tight_layout()
    fig.savefig("verdict_dashboard.png", dpi=FIGURE_DPI, bbox_inches="tight")
    plt.show()
    plt.close(fig)

plot_verdict_dashboard(verdicts, overall)

In [13]:
# ── Figure 3: SRWE Scorecard ──
def plot_srwe_scorecard_heatmap(scorecard):
    if not scorecard:
        print("No scorecard data")
        return
    keys = sorted(scorecard.keys())[:MAX_SCORECARD_ROWS]
    data_arr = []
    labels = []
    for k in keys:
        sc = scorecard[k]
        win_rate = sc["wins"] / max(sc["total"], 1)
        data_arr.append([sc["wins"], sc["losses"], sc["ties"], win_rate])
        labels.append(k.replace("_vs_", "\nvs "))

    data_arr = np.array(data_arr)
    fig, ax = plt.subplots(figsize=(10, max(6, len(labels) * 0.5)))
    bar_height = 0.6
    y_pos = np.arange(len(labels))
    ax.barh(y_pos, data_arr[:, 0], bar_height, color="#2ecc71", label="Wins")
    ax.barh(y_pos, data_arr[:, 1], bar_height, left=data_arr[:, 0], color="#e74c3c", label="Losses")
    ax.barh(y_pos, data_arr[:, 2], bar_height, left=data_arr[:, 0] + data_arr[:, 1], color="#95a5a6", label="Ties")
    ax.set_yticks(y_pos)
    ax.set_yticklabels(labels, fontsize=7)
    ax.set_xlabel("Count")
    ax.set_title("SRWE Win/Loss/Tie Scorecard")
    ax.legend(fontsize=8)
    plt.tight_layout()
    fig.savefig("srwe_scorecard.png", dpi=FIGURE_DPI, bbox_inches="tight")
    plt.show()
    plt.close(fig)

plot_srwe_scorecard_heatmap(scorecard)

In [14]:
# ── Figure 4: Moderator Ranking ──
def plot_moderator_ranking(moderators):
    fig, ax = plt.subplots(figsize=(8, 5))
    names, p_values = [], []
    for mod_name, mod_data in moderators.items():
        if isinstance(mod_data, dict) and "p_value" in mod_data:
            names.append(mod_name)
            p_values.append(mod_data["p_value"])

    if not names:
        ax.text(0.5, 0.5, "No moderator data", ha="center", va="center")
        plt.show()
        plt.close(fig)
        return

    neg_log_p = [-np.log10(max(p, 1e-300)) for p in p_values]
    sorted_idx = np.argsort(neg_log_p)[::-1]
    y_pos = np.arange(len(names))
    colors = ["#e74c3c" if p < BONFERRONI_ALPHA else "#3498db" for p in p_values]

    ax.barh(y_pos, [neg_log_p[i] for i in sorted_idx], color=[colors[i] for i in sorted_idx])
    ax.set_yticks(y_pos)
    ax.set_yticklabels([names[i] for i in sorted_idx])
    ax.axvline(-np.log10(BONFERRONI_ALPHA), color="red", linestyle="--", label=f"Bonferroni alpha={BONFERRONI_ALPHA}")
    ax.set_xlabel("-log10(p-value)")
    ax.set_title("Moderator Importance for SRI-Gap Correlation")
    ax.legend(fontsize=8)
    plt.tight_layout()
    fig.savefig("moderator_ranking.png", dpi=FIGURE_DPI, bbox_inches="tight")
    plt.show()
    plt.close(fig)

plot_moderator_ranking(moderators)

In [15]:
# ── Figure 5: Scope-of-Validity Scatter ──
def plot_scope_validity(studies):
    fig, ax = plt.subplots(figsize=(10, 7))
    domain_map = {
        "ZINC-subset": "molecular", "zinc": "molecular",
        "Peptides-func": "protein", "peptides_func": "protein",
        "Peptides-struct": "protein", "peptides_struct": "protein",
        "Synthetic-aliased-pairs": "synthetic", "synthetic_fixed_n30": "synthetic",
    }
    color_map = {"molecular": "#3498db", "protein": "#2ecc71", "synthetic": "#e74c3c", "other": "#95a5a6", "mixed": "#9b59b6"}
    marker_map = {"model": "o", "MLP": "s", "GPS": "D", "GCN": "^"}

    for s in studies:
        domain = domain_map.get(s["dataset"], "other")
        arch = s["architecture"].split("_")[0]
        color = color_map.get(domain, "#95a5a6")
        marker = marker_map.get(arch, "o")
        alpha = 0.8 if s["p_value"] < 0.05 else 0.3
        size = max(20, min(200, s["n"] / 5))
        ax.scatter(s["n"], s["rho"], c=color, marker=marker, s=size, alpha=alpha,
                   edgecolors="black" if s["p_value"] < 0.05 else "none", linewidth=0.5)

    ax.axhline(C1_RHO_THRESHOLD, color="green", linestyle="--", alpha=0.5, label=f"C1 threshold (rho={C1_RHO_THRESHOLD})")
    ax.axhline(D1_RHO_THRESHOLD, color="orange", linestyle="--", alpha=0.5, label=f"D1 threshold (rho={D1_RHO_THRESHOLD})")
    ax.axhline(0, color="gray", linestyle="-", alpha=0.3)

    domain_patches = [mpatches.Patch(color=c, label=d) for d, c in color_map.items() if d != "other"]
    ax.legend(handles=domain_patches, loc="upper left", fontsize=8)
    ax.set_xlabel("Sample Size (n)", fontsize=10)
    ax.set_ylabel("Spearman rho (SRI vs gap)", fontsize=10)
    ax.set_title("Scope-of-Validity: When Does the WRL Theory Work?")
    plt.tight_layout()
    fig.savefig("scope_validity.png", dpi=FIGURE_DPI, bbox_inches="tight")
    plt.show()
    plt.close(fig)

plot_scope_validity(studies)

In [16]:
# ── Figure 6: Decision Tree ──
def plot_decision_tree():
    fig, ax = plt.subplots(figsize=(12, 8))
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 10)
    ax.axis("off")

    ax.text(5, 9.5, "Positional Encoding Selection", ha="center", fontsize=14, fontweight="bold",
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#3498db", alpha=0.3))
    ax.text(5, 8, "Compute SRI = K x delta_min\nfor your graph", ha="center", fontsize=10,
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#f1c40f", alpha=0.3))
    ax.annotate("", xy=(5, 8.5), xytext=(5, 9.1), arrowprops=dict(arrowstyle="->"))

    ax.text(2.5, 6, "SRI < 1\n(Low resolution)", ha="center", fontsize=9,
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#e74c3c", alpha=0.3))
    ax.text(7.5, 6, "SRI >= 1\n(Adequate resolution)", ha="center", fontsize=9,
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#2ecc71", alpha=0.3))
    ax.annotate("", xy=(3.5, 6.5), xytext=(4.5, 7.6), arrowprops=dict(arrowstyle="->"))
    ax.annotate("", xy=(6.5, 6.5), xytext=(5.5, 7.6), arrowprops=dict(arrowstyle="->"))

    ax.text(1, 4, "Protein graphs:\nUse LapPE\n(best overall)", ha="center", fontsize=8,
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#9b59b6", alpha=0.3))
    ax.text(4, 4, "Molecular graphs:\nUse RWSE\n(robust baseline)", ha="center", fontsize=8,
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#1abc9c", alpha=0.3))
    ax.text(7.5, 4, "Use RWSE\n(RWSE excels at\nhigh resolution)", ha="center", fontsize=8,
            bbox=dict(boxstyle="round,pad=0.3", facecolor="#2ecc71", alpha=0.3))

    ax.annotate("", xy=(1.5, 4.5), xytext=(2, 5.5), arrowprops=dict(arrowstyle="->"))
    ax.annotate("", xy=(3.5, 4.5), xytext=(3, 5.5), arrowprops=dict(arrowstyle="->"))
    ax.annotate("", xy=(7.5, 4.5), xytext=(7.5, 5.5), arrowprops=dict(arrowstyle="->"))

    ax.text(5, 2, "Key Finding: SRWE helps mainly on Peptides-struct (57.7% gap reduction)\n"
                   "but not consistently across all datasets.\n"
                   "RWSE remains the safest default for molecular graphs (ZINC).",
            ha="center", fontsize=9, style="italic",
            bbox=dict(boxstyle="round,pad=0.5", facecolor="#ecf0f1", alpha=0.5))
    ax.set_title("Encoding Selection Decision Tree", fontsize=13, fontweight="bold")
    fig.savefig("decision_tree.png", dpi=FIGURE_DPI, bbox_inches="tight")
    plt.show()
    plt.close(fig)

plot_decision_tree()

In [17]:
# ── Figure 7: Encoding Performance Bars ──
def plot_encoding_performance_bars(srwe_results):
    fig, axes = plt.subplots(1, 3, figsize=(16, 6))

    dataset_results = {}
    for cond_name, cond_data in srwe_results.items():
        ds = cond_data.get("dataset", "unknown")
        if ds not in dataset_results:
            dataset_results[ds] = {}
        for enc in ["rwse", "lappe", "srwe_mpm", "srwe_tikhonov", "srwe", "none",
                     "histogram", "moment_correction", "spectral_summary", "raw_weights"]:
            if enc in cond_data and cond_data[enc] is not None:
                if enc not in dataset_results[ds]:
                    dataset_results[ds][enc] = []
                dataset_results[ds][enc].append(cond_data[enc])

    plot_datasets = list(dataset_results.keys())[:MAX_BAR_CHART_DATASETS]
    for ax_idx, ds in enumerate(plot_datasets):
        if ax_idx >= len(axes):
            break
        ax = axes[ax_idx]
        ds_data = dataset_results[ds]
        enc_names = sorted(ds_data.keys())[:8]
        means = [np.mean(ds_data[e]) for e in enc_names]
        stds = [np.std(ds_data[e]) if len(ds_data[e]) > 1 else 0 for e in enc_names]

        x = np.arange(len(enc_names))
        ax.bar(x, means, yerr=stds, capsize=3, alpha=0.8, color=plt.cm.tab10(np.linspace(0, 1, len(enc_names))))
        ax.set_xticks(x)
        ax.set_xticklabels(enc_names, rotation=45, ha="right", fontsize=7)
        ax.set_title(f"{ds}", fontsize=10, fontweight="bold")
        ax.set_ylabel("Performance")

    # Hide unused axes
    for ax_idx in range(len(plot_datasets), len(axes)):
        axes[ax_idx].axis("off")

    fig.suptitle("Encoding Performance Comparison Across Datasets", fontsize=12, fontweight="bold")
    plt.tight_layout()
    fig.savefig("encoding_performance.png", dpi=FIGURE_DPI, bbox_inches="tight")
    plt.show()
    plt.close(fig)

plot_encoding_performance_bars(srwe_results)

## Summary

Key results from the Grand Synthesis evaluation.

In [18]:
# ── Print Summary Table ──
print("=" * 70)
print("GRAND SYNTHESIS: WALK RESOLUTION LIMIT HYPOTHESIS ADJUDICATION")
print("=" * 70)
print()
print(f"  Overall Verdict:       {overall['overall_verdict'].upper()}")
print(f"  Overall Score:         {overall['overall_score']:.3f}")
print(f"  Confirmation Score:    {overall['confirmation_score']:.3f}")
print(f"  Disconf. Penalty:      {overall['disconfirmation_penalty']:.3f}")
print()
print(f"  Pooled rho:            {pooled['rho_pooled']:.4f} [{pooled['ci_low']:.4f}, {pooled['ci_high']:.4f}]")
print(f"  I2 Heterogeneity:      {pooled['I2']:.1f}%")
print(f"  Number of studies:     {pooled['k']}")
print()
print("  CRITERION VERDICTS:")
print("  " + "-" * 66)
for key, v in verdicts.items():
    print(f"  {key:45s} {v['verdict']:25s} (conf={v['confidence']:.2f})")
print()
print("  SRWE SCORECARD SUMMARY:")
print("  " + "-" * 66)
total_w = sum(sc["wins"] for sc in scorecard.values())
total_l = sum(sc["losses"] for sc in scorecard.values())
total_t = sum(sc["ties"] for sc in scorecard.values())
print(f"  Total comparisons: {total_w + total_l + total_t}")
print(f"  Wins: {total_w}, Losses: {total_l}, Ties: {total_t}")
print()
print("  MODERATOR EFFECTS:")
print("  " + "-" * 66)
for mod_name, mod_data in moderators.items():
    sig = "***" if mod_data.get("significant") else "   "
    print(f"  {mod_name:25s} p={mod_data['p_value']:.4f} {sig}")
print()
print("  SCOPE OF VALIDITY:")
print("  " + "-" * 66)
print(f"  Directional accuracy:  {scope['directional_accuracy']:.1%}")
print(f"  Works in:              {', '.join(scope['domains_where_works'])}")
print(f"  Fails in:              {', '.join(scope['domains_where_fails'])}")
print("=" * 70)

GRAND SYNTHESIS: WALK RESOLUTION LIMIT HYPOTHESIS ADJUDICATION

  Overall Verdict:       PARTIALLY_CONFIRMED
  Overall Score:         0.458
  Confirmation Score:    0.224
  Disconf. Penalty:      0.003

  Pooled rho:            0.1526 [0.0204, 0.2796]
  I2 Heterogeneity:      99.0%
  Number of studies:     26

  CRITERION VERDICTS:
  ------------------------------------------------------------------
  C1_sri_gap_correlation                        not_confirmed             (conf=0.86)
  C2_aliased_pairs_distinguishability           confirmed                 (conf=0.67)
  C3_srwe_gap_reduction                         partially_confirmed       (conf=0.36)
  D1_weak_correlation                           disconfirmed              (conf=0.24)
  D2_resolution_mismatch                        not_confirmed             (conf=0.50)
  D3_no_srwe_improvement                        partially_confirmed       (conf=0.50)

  SRWE SCORECARD SUMMARY:
  ----------------------------------------------------