# Deep-Dive Cross-Experiment Mechanistic Analysis of Walk Resolution Limit

This notebook demonstrates the **cross-experiment evaluation** that synthesizes results from three iteration-5 experiments into a unified mechanistic analysis of the walk resolution limit:

1. **Depth Compensation** (Exp1): How the RWSE-LapPE gap narrows with increasing GNN depth
2. **SRWE Classification Diagnosis** (Exp2): Why SRWE closes 110% of the gap on regression but -9% on classification
3. **Adaptive Selection Value** (Exp3): Oracle headroom analysis showing 44% on ZINC, 24% on Peptides-struct
4. **Cross-Experiment Consistency**: Hypothesis support score of 0.49

The evaluation produces 37 aggregate metrics across 4 analyses, with per-graph evaluation data across 5 datasets.

In [1]:
import subprocess, sys
def _pip(*a): subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', *a])

# scipy — NOT pre-installed at the right version on all systems, install for linregress
# (No non-Colab-only packages needed for this evaluation)

# Core packages (pre-installed on Colab, install locally to match Colab env)
if 'google.colab' not in sys.modules:
    _pip('numpy==2.0.2', 'scipy==1.16.3', 'matplotlib==3.10.0')


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import json
import math
import os
import sys
from typing import Any

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

## Data Loading

Load the pre-computed evaluation results from GitHub (or local fallback).

In [3]:
GITHUB_DATA_URL = "https://raw.githubusercontent.com/AMGrobelnik/ai-invention-ace67e-the-walk-resolution-limit-a-super-resolu/main/evaluation_iter6_deep_dive_cross/demo/mini_demo_data.json"
import json, os

def load_data():
    try:
        import urllib.request
        with urllib.request.urlopen(GITHUB_DATA_URL) as response:
            return json.loads(response.read().decode())
    except Exception: pass
    if os.path.exists("mini_demo_data.json"):
        with open("mini_demo_data.json") as f: return json.load(f)
    raise FileNotFoundError("Could not load mini_demo_data.json")

In [4]:
data = load_data()
print(f"Loaded evaluation data: {len(data['datasets'])} datasets, {len(data['metrics_agg'])} aggregate metrics")

Loaded evaluation data: 5 datasets, 37 aggregate metrics


## Configuration

Tunable parameters for the analysis. `MAX_EXAMPLES` controls how many per-graph examples are processed (0 = all available).

In [5]:
# --- Config ---
# Maximum examples per dataset to process (0 = all available)
# Original: 0  (process all)
MAX_EXAMPLES = 0

# GNN depths analyzed in the depth compensation experiment
DEPTHS = [2, 3, 4, 6, 8]

# Encoding types tested in the SRWE diagnosis experiment
ENCODING_TYPES = ["none", "rwse", "lappe", "histogram", "raw_weights",
                  "eigenvalue_pairs", "moment_correction", "spectral_summary"]
ENCODING_LABELS = ["None", "RWSE", "LapPE", "Hist", "RawW", "EigP", "MomC", "SpecS"]

# Strategies tested in adaptive selection experiment
STRATEGIES = ["FIXED-RWSE", "FIXED-LapPE", "FIXED-SRWE", "SRI-THRESHOLD",
              "CONCAT-RWSE-SRWE", "ORACLE"]
STRATEGY_LABELS = ["RWSE", "LapPE", "SRWE", "SRI-Thresh", "Concat", "Oracle"]

# Hypothesis support score weights
W_SRI = 0.4
W_SRWE = 0.3
W_DEPTH = 0.3

## Helper Functions

Core computation utilities from the evaluation script: parsing predictions, computing MAE and AP.

In [6]:
def parse_prediction(pred_str: str) -> np.ndarray:
    """Parse a prediction string like '0.123' or '[1.0, 2.0, 3.0]' to numpy array."""
    pred_str = pred_str.strip()
    if pred_str.startswith("["):
        vals = json.loads(pred_str)
        return np.array(vals, dtype=np.float64)
    else:
        return np.array([float(pred_str)], dtype=np.float64)


def parse_output(out_str: str) -> np.ndarray:
    """Parse output string to numpy array."""
    out_str = out_str.strip()
    if out_str.startswith("["):
        vals = json.loads(out_str)
        return np.array(vals, dtype=np.float64)
    else:
        return np.array([float(out_str)], dtype=np.float64)


def compute_mae(pred: np.ndarray, target: np.ndarray) -> float:
    """Compute mean absolute error."""
    return float(np.mean(np.abs(pred - target)))


def compute_ap_multilabel(pred_probs: np.ndarray, targets: np.ndarray) -> float:
    """Compute average precision for multi-label classification."""
    pos_mask = targets > 0.5
    if pos_mask.sum() == 0:
        return 0.0
    return float(np.mean(pred_probs[pos_mask]))

## Analysis 1: Display Aggregate Metrics

The evaluation computes 37 aggregate metrics across 4 analyses. Let's examine the key findings.

In [7]:
metrics = data["metrics_agg"]

print("=" * 70)
print("AGGREGATE METRICS SUMMARY")
print("=" * 70)

# Analysis 1: Depth Compensation
print("\n--- Analysis 1: Depth Compensation ---")
print(f"  ZINC gap at depth 2:    {metrics['exp1_zinc_gap_depth2']:.4f}")
print(f"  ZINC gap at depth 8:    {metrics['exp1_zinc_gap_depth8']:.4f}")
print(f"  ZINC gap slope:         {metrics['exp1_zinc_gap_slope']:.6f} (p={metrics['exp1_zinc_gap_slope_pvalue']:.4f})")
print(f"  Pep-struct gap depth 2: {metrics['exp1_pep_gap_depth2']:.4f}")
print(f"  Pep-struct gap depth 8: {metrics['exp1_pep_gap_depth8']:.4f}")
print(f"  Pep-struct gap slope:   {metrics['exp1_pep_gap_slope']:.6f} (p={metrics['exp1_pep_gap_slope_pvalue']:.4f})")
print(f"  Pep-struct D* (compensation depth): {metrics['exp1_pep_compensation_depth_Dstar']:.2f}")

# Analysis 2: SRWE Classification Diagnosis
print("\n--- Analysis 2: SRWE Classification Diagnosis ---")
print(f"  Pep-func RWSE AP:          {metrics['exp2_func_rwse_ap']:.4f}")
print(f"  Pep-func best SRWE AP:     {metrics['exp2_func_best_srwe_ap']:.4f}")
print(f"  Pep-func gap closed:       {metrics['exp2_func_gap_closed_pct']:.2f}%")
print(f"  Pep-struct RWSE MAE:       {metrics['exp2_struct_rwse_mae']:.4f}")
print(f"  Pep-struct best SRWE MAE:  {metrics['exp2_struct_best_srwe_mae']:.4f}")
print(f"  Pep-struct gap closed:     {metrics['exp2_struct_gap_closed_pct']:.2f}%")

# Analysis 3: Adaptive Selection
print("\n--- Analysis 3: Adaptive Selection ---")
print(f"  ZINC oracle headroom:      {metrics['exp3_zinc_oracle_headroom_pct']:.2f}%")
print(f"  Pep-struct oracle headroom:{metrics['exp3_struct_oracle_headroom_pct']:.2f}%")

# Analysis 4: Cross-Experiment Consistency
print("\n--- Analysis 4: Cross-Experiment Consistency ---")
print(f"  SRI-rho consistency ZINC:      {metrics['sri_rho_consistency_zinc']:.4f}")
print(f"  SRI-rho consistency Pep-struct: {metrics['sri_rho_consistency_pepstruct']:.4f}")
print(f"  Hypothesis support score:      {metrics['hypothesis_support_score']:.4f}")

print(f"\nTotal aggregate metrics: {len(metrics)}")

AGGREGATE METRICS SUMMARY

--- Analysis 1: Depth Compensation ---
  ZINC gap at depth 2:    0.2351
  ZINC gap at depth 8:    0.2136
  ZINC gap slope:         -0.031473 (p=0.2248)
  Pep-struct gap depth 2: 1.5308
  Pep-struct gap depth 8: 0.0569
  Pep-struct gap slope:   -0.788682 (p=0.5458)
  Pep-struct D* (compensation depth): 6.17

--- Analysis 2: SRWE Classification Diagnosis ---
  Pep-func RWSE AP:          0.4182
  Pep-func best SRWE AP:     0.4224
  Pep-func gap closed:       -9.06%
  Pep-struct RWSE MAE:       36.3738
  Pep-struct best SRWE MAE:  20.5106
  Pep-struct gap closed:     109.71%

--- Analysis 3: Adaptive Selection ---
  ZINC oracle headroom:      44.26%
  Pep-struct oracle headroom:24.47%

--- Analysis 4: Cross-Experiment Consistency ---
  SRI-rho consistency ZINC:      0.0519
  SRI-rho consistency Pep-struct: 0.3012
  Hypothesis support score:      0.4925

Total aggregate metrics: 37


## Analysis 2: Per-Graph Depth Compensation Analysis

Examine per-graph RWSE-LapPE gaps across GNN depths. The gap should narrow with increasing depth for Peptides-struct (depth compensation), but remain persistent for ZINC.

In [8]:
# Extract per-graph data from depth compensation datasets
datasets_by_name = {ds["dataset"]: ds["examples"] for ds in data["datasets"]}

for ds_name in ["depth_compensation_zinc", "depth_compensation_peptides_struct"]:
    examples = datasets_by_name.get(ds_name, [])
    if MAX_EXAMPLES > 0:
        examples = examples[:MAX_EXAMPLES]

    print(f"\n{'=' * 60}")
    print(f"Dataset: {ds_name} ({len(examples)} examples)")
    print(f"{'=' * 60}")

    # Compute average gap across examples at each depth
    avg_gaps = {}
    for d in DEPTHS:
        gap_key = f"eval_depth{d}_gap"
        gap_vals = [ex[gap_key] for ex in examples if gap_key in ex]
        if gap_vals:
            avg_gaps[d] = np.mean(gap_vals)
            print(f"  Depth {d}: avg gap = {avg_gaps[d]:.4f} (n={len(gap_vals)})")

    if len(avg_gaps) >= 2:
        log_d = np.array([np.log(d) for d in sorted(avg_gaps.keys())])
        gap_vals_arr = np.array([avg_gaps[d] for d in sorted(avg_gaps.keys())])
        slope_res = stats.linregress(log_d, gap_vals_arr)
        print(f"  Gap slope: {slope_res.slope:.6f} (p={slope_res.pvalue:.4f})")


Dataset: depth_compensation_zinc (3 examples)
  Depth 2: avg gap = 0.1747 (n=3)
  Depth 3: avg gap = -0.0755 (n=3)
  Depth 4: avg gap = 0.0966 (n=3)
  Depth 6: avg gap = 0.0543 (n=3)
  Depth 8: avg gap = 0.1104 (n=3)
  Gap slope: -0.007718 (p=0.9421)

Dataset: depth_compensation_peptides_struct (3 examples)
  Depth 2: avg gap = 0.6363 (n=3)
  Depth 3: avg gap = -1.3094 (n=3)
  Depth 4: avg gap = 0.1585 (n=3)
  Depth 6: avg gap = -2.9934 (n=3)
  Depth 8: avg gap = -2.1703 (n=3)
  Gap slope: -2.197460 (p=0.1132)


## Analysis 3: SRWE Classification vs Regression Diagnosis

Per-graph analysis of SRWE encoding performance on classification (Peptides-func) vs regression tasks.

In [9]:
# SRWE diagnosis: per-graph AP values for Peptides-func
srwe_func_examples = datasets_by_name.get("srwe_diagnosis_peptides_func", [])
if MAX_EXAMPLES > 0:
    srwe_func_examples = srwe_func_examples[:MAX_EXAMPLES]

print(f"SRWE Diagnosis — Peptides-func ({len(srwe_func_examples)} examples)")
print("-" * 60)

for enc, label in zip(ENCODING_TYPES, ENCODING_LABELS):
    ap_key = f"eval_{enc}_ap"
    ap_vals = [ex[ap_key] for ex in srwe_func_examples if ap_key in ex]
    if ap_vals:
        print(f"  {label:8s}: mean AP = {np.mean(ap_vals):.4f} (std={np.std(ap_vals):.4f})")

# Show SRWE advantage per example
print("\nPer-example SRWE advantage (moment_correction AP - RWSE AP):")
for i, ex in enumerate(srwe_func_examples):
    advantage = ex.get("eval_srwe_advantage", 0)
    print(f"  Example {i}: {advantage:+.4f}")

SRWE Diagnosis — Peptides-func (3 examples)
------------------------------------------------------------
  None    : mean AP = 0.2266 (std=0.1834)
  RWSE    : mean AP = 0.2315 (std=0.1641)
  LapPE   : mean AP = 0.2457 (std=0.2705)
  Hist    : mean AP = 0.2367 (std=0.2040)
  RawW    : mean AP = 0.2194 (std=0.2485)
  EigP    : mean AP = 0.2519 (std=0.2605)
  MomC    : mean AP = 0.2229 (std=0.1924)
  SpecS   : mean AP = 0.2306 (std=0.2082)

Per-example SRWE advantage (moment_correction AP - RWSE AP):
  Example 0: +0.0345
  Example 1: -0.0686
  Example 2: +0.0084


## Analysis 4: Adaptive Selection — Oracle Headroom

Compute per-graph oracle advantage over the best fixed encoding strategy.

In [10]:
for ds_name in ["adaptive_selection_zinc", "adaptive_selection_peptides_struct"]:
    examples = datasets_by_name.get(ds_name, [])
    if MAX_EXAMPLES > 0:
        examples = examples[:MAX_EXAMPLES]

    is_zinc = "zinc" in ds_name
    metric_suffix = "mae"  # Both ZINC and Pep-struct use MAE in per-graph data

    print(f"\n{'=' * 60}")
    print(f"Adaptive Selection: {ds_name} ({len(examples)} examples)")
    print(f"{'=' * 60}")

    # Compute mean metric per strategy
    strat_keys = ["FIXED_RWSE", "FIXED_LapPE", "FIXED_SRWE", "SRI_THRESHOLD",
                  "CONCAT_RWSE_SRWE", "ORACLE"]
    strat_labels_local = ["RWSE", "LapPE", "SRWE", "SRI-Thresh", "Concat", "Oracle"]

    for sk, sl in zip(strat_keys, strat_labels_local):
        key = f"eval_{sk}_{metric_suffix}"
        vals = [ex[key] for ex in examples if key in ex]
        if vals:
            print(f"  {sl:12s}: mean MAE = {np.mean(vals):.4f}")

    # Oracle advantage per example
    oracle_advs = [ex.get("eval_oracle_advantage", 0) for ex in examples]
    print(f"  Mean oracle advantage: {np.mean(oracle_advs):.4f}")


Adaptive Selection: adaptive_selection_zinc (3 examples)
  RWSE        : mean MAE = 0.0740
  LapPE       : mean MAE = 0.1137
  SRWE        : mean MAE = 0.0580
  SRI-Thresh  : mean MAE = 0.2682
  Concat      : mean MAE = 0.0824
  Oracle      : mean MAE = 0.0333
  Mean oracle advantage: 0.0000

Adaptive Selection: adaptive_selection_peptides_struct (3 examples)
  RWSE        : mean MAE = 41.3430
  LapPE       : mean MAE = 37.5040
  SRWE        : mean MAE = 29.7914
  SRI-Thresh  : mean MAE = 40.0177
  Concat      : mean MAE = 12.3151
  Oracle      : mean MAE = 29.5376
  Mean oracle advantage: 0.0000


## Analysis 5: Hypothesis Support Score Computation

Recompute the hypothesis support score from the aggregate metrics, showing the three component contributions.

In [11]:
# Component 1: SRI-gap correlation strength
# Collect all available SRI rho measurements
all_rhos = []
for prefix in ["exp1_zinc", "exp1_pep"]:
    for depth in [2, 8]:
        key = f"{prefix}_sri_rho_depth{depth}"
        if key in metrics:
            all_rhos.append(abs(metrics[key]))

avg_abs_rho = np.mean(all_rhos) if all_rhos else 0.0
sri_component = min(1.0, avg_abs_rho / 0.5)

# Component 2: SRWE gap-closing on regression (0-1 scale)
struct_gap_closed = metrics.get("exp2_struct_gap_closed_pct", 0.0)
srwe_component = min(1.0, max(0.0, struct_gap_closed / 100.0))

# Component 3: Depth compensation evidence
pep_slope = metrics.get("exp1_pep_gap_slope", 0.0)
pep_pval = metrics.get("exp1_pep_gap_slope_pvalue", 1.0)
if pep_slope < 0 and pep_pval < 0.05:
    depth_component = min(1.0, abs(pep_slope) / 2.0)
elif pep_slope < 0:
    depth_component = min(0.5, abs(pep_slope) / 2.0)
else:
    depth_component = 0.0

hypothesis_score = W_SRI * sri_component + W_SRWE * srwe_component + W_DEPTH * depth_component

print("Hypothesis Support Score Breakdown:")
print(f"  SRI-gap correlation (weight {W_SRI}):  {sri_component:.4f} (avg |rho|={avg_abs_rho:.4f})")
print(f"  SRWE gap-closing   (weight {W_SRWE}):  {srwe_component:.4f} (gap closed={struct_gap_closed:.2f}%)")
print(f"  Depth compensation (weight {W_DEPTH}):  {depth_component:.4f} (slope={pep_slope:.4f}, p={pep_pval:.4f})")
print(f"  => Hypothesis support score: {hypothesis_score:.4f}")
print(f"  (Reference from full data:   {metrics['hypothesis_support_score']:.4f})")

Hypothesis Support Score Breakdown:
  SRI-gap correlation (weight 0.4):  0.1904 (avg |rho|=0.0952)
  SRWE gap-closing   (weight 0.3):  1.0000 (gap closed=109.71%)
  Depth compensation (weight 0.3):  0.3943 (slope=-0.7887, p=0.5458)
  => Hypothesis support score: 0.4944
  (Reference from full data:   0.4925)


## Visualization: Cross-Experiment Synthesis

Publication-quality figures summarizing the key findings across all four analyses.

In [12]:
plt.rcParams.update({
    "font.size": 11,
    "axes.titlesize": 13,
    "axes.labelsize": 11,
    "xtick.labelsize": 9,
    "ytick.labelsize": 9,
    "legend.fontsize": 9,
    "figure.dpi": 150,
})

fig, axes = plt.subplots(2, 2, figsize=(12, 9))

# --- Panel A: Depth Compensation — Gap vs Depth ---
ax = axes[0, 0]
for ds_name, color, marker, label in [
    ("depth_compensation_zinc", "tab:blue", "o", "ZINC-subset"),
    ("depth_compensation_peptides_struct", "tab:orange", "s", "Peptides-struct"),
]:
    examples = datasets_by_name.get(ds_name, [])
    if MAX_EXAMPLES > 0:
        examples = examples[:MAX_EXAMPLES]
    avg_gaps = []
    for d in DEPTHS:
        gap_key = f"eval_depth{d}_gap"
        vals = [ex[gap_key] for ex in examples if gap_key in ex]
        avg_gaps.append(np.mean(vals) if vals else np.nan)

    ax.plot(DEPTHS, avg_gaps, f"{marker}-", color=color, linewidth=2, markersize=8, label=label)

    # Fit regression
    valid = ~np.isnan(avg_gaps)
    if np.sum(valid) >= 2:
        log_d = np.log(np.array(DEPTHS)[valid])
        g = np.array(avg_gaps)[valid]
        sr = stats.linregress(log_d, g)
        x_fit = np.linspace(min(DEPTHS), max(DEPTHS), 100)
        y_fit = sr.slope * np.log(x_fit) + sr.intercept
        ax.plot(x_fit, y_fit, "--", color=color, alpha=0.5)

ax.axhline(y=0, color="gray", linestyle="--", alpha=0.5)
ax.set_xlabel("GNN Depth (layers)")
ax.set_ylabel("LapPE MAE - RWSE MAE")
ax.set_title("A) Depth Compensation: Gap vs Depth")
ax.legend()
ax.grid(True, alpha=0.3)

# --- Panel B: SRWE Classification vs Regression ---
ax = axes[0, 1]
x = np.arange(2)
width = 0.35
rwse_mi = [metrics.get("exp2_func_rwse_mi", 0), metrics.get("exp2_struct_rwse_mi", 0)]
srwe_mi = [metrics.get("exp2_func_best_srwe_mi", 0), metrics.get("exp2_struct_best_srwe_mi", 0)]
ax.bar(x - width/2, rwse_mi, width, label="RWSE", color="tab:blue", edgecolor="black", linewidth=0.5)
ax.bar(x + width/2, srwe_mi, width, label="Best SRWE", color="tab:orange", edgecolor="black", linewidth=0.5)
ax.set_xticks(x)
ax.set_xticklabels(["Func (Classification)", "Struct (Regression)"])
ax.set_ylabel("Mutual Information")
ax.set_title("B) MI: RWSE vs Best SRWE")
ax.legend()
ax.grid(True, alpha=0.3, axis="y")

# --- Panel C: Oracle Headroom ---
ax = axes[1, 0]
headroom_labels = ["ZINC", "Pep-struct"]
headroom_values = [
    metrics.get("exp3_zinc_oracle_headroom_pct", 0),
    metrics.get("exp3_struct_oracle_headroom_pct", 0),
]
bars = ax.bar(range(len(headroom_labels)), headroom_values,
       color=["tab:blue", "tab:orange"], edgecolor="black", linewidth=0.5)
ax.set_xticks(range(len(headroom_labels)))
ax.set_xticklabels(headroom_labels)
ax.set_ylabel("Oracle Headroom (%)")
ax.set_title("C) Per-Graph Selection Potential")
ax.grid(True, alpha=0.3, axis="y")
for i, v in enumerate(headroom_values):
    ax.text(i, v + 0.5, f"{v:.1f}%", ha="center", fontsize=9)

# --- Panel D: Hypothesis Support Components ---
ax = axes[1, 1]
comp_labels = ["SRI-Gap\nCorrelation", "SRWE Gap\nClosing", "Depth\nCompensation", "Overall\nScore"]
comp_values = [sri_component, srwe_component, depth_component, hypothesis_score]
comp_colors = ["tab:blue", "tab:orange", "tab:green", "tab:red"]
ax.bar(range(len(comp_labels)), comp_values, color=comp_colors, edgecolor="black", linewidth=0.5)
ax.set_xticks(range(len(comp_labels)))
ax.set_xticklabels(comp_labels)
ax.set_ylabel("Score (0-1)")
ax.set_title("D) Hypothesis Support Components")
ax.set_ylim(0, 1.1)
ax.grid(True, alpha=0.3, axis="y")
for i, v in enumerate(comp_values):
    ax.text(i, v + 0.02, f"{v:.3f}", ha="center", fontsize=9)

fig.suptitle("Cross-Experiment Mechanistic Analysis of Walk Resolution Limit", fontsize=14, y=1.02)
fig.tight_layout()
plt.savefig("synthesis_figure.png", bbox_inches="tight", dpi=150)
plt.show()
print("Figure saved to synthesis_figure.png")

Figure saved to synthesis_figure.png
