# Spectral Diagnostics: Walk Resolution Limit Hypothesis

This notebook demonstrates **comprehensive spectral diagnostic analysis** validating four key aspects of the walk resolution limit hypothesis across multiple graph datasets:

1. **Spectral Sparsity Validation** -- Are local spectral measures sparse (few dominant eigenvalues per node)?
2. **SRI Distribution Analysis** -- How do Spectral Resolution Index distributions differ across datasets?
3. **Node-Level vs Graph-Level Resolution** -- Does eigenvector localization improve resolution?
4. **Vandermonde Conditioning** -- Does condition number predict reconstruction error?

Plus a baseline eigenvalue clustering analysis for comparison.

In [1]:
import subprocess, sys

def _pip(*args):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *args])

# --- Colab pre-installed: pin to Colab versions if on Colab ---
_COLAB_PINS = {
    "numpy": "numpy==1.26.4",
    "scipy": "scipy==1.13.1",
    "matplotlib": "matplotlib==3.7.1",
    "seaborn": "seaborn==0.13.2",
}

import importlib, os
_on_colab = "google.colab" in sys.modules
for _pkg, _pin in _COLAB_PINS.items():
    if _on_colab:
        pass  # already installed on Colab
    else:
        _pip(_pin)

# --- Non-Colab packages (always install) ---
_pip("loguru")

print("All dependencies installed.")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


All dependencies installed.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import json
import math
import os
import sys
import time
import warnings

import numpy as np
from scipy import stats

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore', category=RuntimeWarning)

# Plot defaults
plt.rcParams.update({
    'font.size': 12,
    'axes.labelsize': 14,
    'axes.titlesize': 14,
    'xtick.labelsize': 11,
    'ytick.labelsize': 11,
    'legend.fontsize': 11,
    'figure.dpi': 150,
    'savefig.dpi': 150,
    'savefig.bbox': 'tight',
})

print("Imports loaded successfully.")

Imports loaded successfully.


In [3]:
GITHUB_DATA_URL = "https://raw.githubusercontent.com/AMGrobelnik/ai-invention-ace67e-the-walk-resolution-limit-a-super-resolu/main/experiment_iter2_spectral_diagno/demo/mini_demo_data.json"

def load_data():
    try:
        import urllib.request
        with urllib.request.urlopen(GITHUB_DATA_URL) as response:
            return json.loads(response.read().decode())
    except Exception:
        pass
    if os.path.exists("mini_demo_data.json"):
        with open("mini_demo_data.json") as f:
            return json.load(f)
    raise FileNotFoundError("Could not load mini_demo_data.json")

print("Data loading helper defined.")

Data loading helper defined.


In [4]:
raw_data = load_data()
datasets = raw_data['datasets']

total_graphs = sum(len(gs) for gs in datasets.values())
print(f"Loaded {total_graphs} graphs across {len(datasets)} datasets:")
for ds_name, graphs in datasets.items():
    print(f"  {ds_name}: {len(graphs)} graphs")

Loaded 83 graphs across 4 datasets:
  Synthetic-aliased-pairs: 33 graphs
  ZINC-subset: 20 graphs
  Peptides-func: 15 graphs
  Peptides-struct: 15 graphs


## Configuration

Tunable parameters for all analyses. Adjust these to control runtime vs. detail.

In [5]:
# === CONFIG: All tunable parameters ===

# Walk lengths K to evaluate SRI and Vandermonde conditioning
K_VALUES = [2, 4, 8, 16, 20]
K_KEYS = [f'K={k}' for k in K_VALUES]

# Weight thresholds for node-level resolution analysis (Analysis 3)
WEIGHT_THRESHOLDS = [0.01, 0.05, 0.10]

# Vandermonde reconstruction experiment (Analysis 4)
N_SUBSAMPLE = 200  # Original: 200 (all graphs fit easily with demo data)
NOISE_LEVELS = [1e-8, 1e-6, 1e-4, 1e-2]  # Original: 4 noise levels
MAX_COND = 1e15
NODES_PER_GRAPH = 5  # Original: 5 nodes per graph for reconstruction

print(f"Config: K_VALUES={K_VALUES}, N_SUBSAMPLE={N_SUBSAMPLE}, "
      f"NOISE_LEVELS={NOISE_LEVELS}, NODES_PER_GRAPH={NODES_PER_GRAPH}")

Config: K_VALUES=[2, 4, 8, 16, 20], N_SUBSAMPLE=200, NOISE_LEVELS=[1e-08, 1e-06, 0.0001, 0.01], NODES_PER_GRAPH=5


## Analysis 1: Spectral Sparsity Validation

For each graph, we examine how many eigenvalues contribute significantly to each node's local spectral measure. If few eigenvalues dominate (low effective rank), spectral measures are **sparse** -- confirming a key assumption of the walk resolution limit hypothesis.

In [6]:
def run_sparsity_analysis(datasets):
    """Analysis 1: Spectral Sparsity Validation."""
    print("=== Analysis 1: Spectral Sparsity Validation ===")
    t0 = time.time()
    sparsity_results = {}
    sparsity_cache = {}

    for ds_name, graphs in datasets.items():
        all_eff_rank_1pct = []
        all_eff_rank_5pct = []
        all_participation_ratios = []
        all_spectral_entropies = []
        all_n_eigenvalues = []

        for g in graphs:
            n_eigenvalues = len(g['eigenvalues'])
            all_n_eigenvalues.append(n_eigenvalues)
            for node_measures in g['local_spectral']:
                if not node_measures or len(node_measures) == 0:
                    continue
                weights = np.array([m[1] for m in node_measures], dtype=np.float64)
                total_w = weights.sum()
                if total_w < 1e-12:
                    continue

                eff_1 = int(np.sum(weights > 0.01 * total_w))
                all_eff_rank_1pct.append(eff_1)

                eff_5 = int(np.sum(weights > 0.05 * total_w))
                all_eff_rank_5pct.append(eff_5)

                pr = float(total_w ** 2 / np.sum(weights ** 2))
                all_participation_ratios.append(pr)

                p = weights / total_w
                p = p[p > 0]
                entropy = float(-np.sum(p * np.log2(p)))
                all_spectral_entropies.append(entropy)

        if len(all_eff_rank_1pct) == 0:
            print(f"  No valid nodes found for {ds_name}, skipping")
            continue

        arr_1 = np.array(all_eff_rank_1pct)
        arr_5 = np.array(all_eff_rank_5pct)
        arr_pr = np.array(all_participation_ratios)
        arr_n = np.array(all_n_eigenvalues)

        median_eff = float(np.median(arr_1))
        median_n = float(np.median(arr_n))
        ratio = median_eff / max(median_n, 1e-10)

        sparsity_cache[ds_name] = {
            'eff_rank_1pct': arr_1,
            'eff_rank_5pct': arr_5,
            'participation_ratios': arr_pr,
        }

        sparsity_results[ds_name] = {
            'median_eff_rank_1pct': median_eff,
            'median_graph_size': median_n,
            'sparsity_ratio': ratio,
            'sparsity_confirmed': bool(ratio < 0.3),
            'n_nodes_analyzed': len(all_eff_rank_1pct),
        }
        print(f"  {ds_name}: median_eff_rank={median_eff:.1f}, "
              f"median_n={median_n:.0f}, ratio={ratio:.4f}, "
              f"confirmed={ratio < 0.3}")

    print(f"  Sparsity analysis complete in {time.time()-t0:.1f}s")
    return sparsity_results, sparsity_cache

sparsity_results, sparsity_cache = run_sparsity_analysis(datasets)

=== Analysis 1: Spectral Sparsity Validation ===
  Synthetic-aliased-pairs: median_eff_rank=8.0, median_n=10, ratio=0.8000, confirmed=False
  ZINC-subset: median_eff_rank=10.0, median_n=22, ratio=0.4651, confirmed=False
  Peptides-func: median_eff_rank=10.0, median_n=134, ratio=0.0746, confirmed=True
  Peptides-struct: median_eff_rank=10.0, median_n=134, ratio=0.0746, confirmed=True
  Sparsity analysis complete in 0.0s


## Analysis 2: SRI Distribution Analysis

The **Spectral Resolution Index (SRI)** measures how well random walks of length K can distinguish neighboring eigenvalues. We compare SRI distributions across datasets using Kolmogorov-Smirnov tests.

In [7]:
def run_sri_analysis(datasets):
    """Analysis 2: SRI Distribution Analysis with KS tests."""
    print("=== Analysis 2: SRI Distribution Analysis ===")
    t0 = time.time()
    sri_results = {}
    sri_by_dataset = {}

    for ds_name, graphs in datasets.items():
        delta_mins = np.array([g['delta_min'] for g in graphs], dtype=np.float64)
        sri_20 = np.array([g['sri']['K=20'] for g in graphs], dtype=np.float64)

        spectral_ranges = np.array([
            max(g['eigenvalues']) - min(g['eigenvalues']) for g in graphs
        ], dtype=np.float64)
        normalized_sri = delta_mins / np.maximum(spectral_ranges, 1e-10)

        sri_by_K = {}
        for k_key in K_KEYS:
            sri_by_K[k_key] = np.array([g['sri'][k_key] for g in graphs], dtype=np.float64)

        sri_by_dataset[ds_name] = {
            'delta_min': delta_mins,
            'sri_20': sri_20,
            'normalized_sri': normalized_sri,
            'spectral_range': spectral_ranges,
            'sri_by_K': sri_by_K,
        }

        sri_results[ds_name] = {
            'delta_min_mean': float(np.mean(delta_mins)),
            'sri_20_mean': float(np.mean(sri_20)),
            'sri_20_median': float(np.median(sri_20)),
            'pct_below_1': float(np.mean(sri_20 < 1.0) * 100),
            'pct_above_5': float(np.mean(sri_20 > 5.0) * 100),
        }
        print(f"  {ds_name}: SRI(K=20) mean={np.mean(sri_20):.4f}, "
              f"pct_below_1={np.mean(sri_20 < 1.0)*100:.1f}%")

    # Pairwise KS tests
    dataset_names = sorted(sri_by_dataset.keys())
    ks_results = {}
    for i, ds1 in enumerate(dataset_names):
        for ds2 in dataset_names[i + 1:]:
            stat_sri, p_sri = stats.ks_2samp(
                sri_by_dataset[ds1]['sri_20'],
                sri_by_dataset[ds2]['sri_20']
            )
            ks_results[f"{ds1}_vs_{ds2}"] = {
                'statistic': float(stat_sri),
                'p_value': float(p_sri),
            }
            print(f"  KS({ds1} vs {ds2}): stat={stat_sri:.4f}, p={p_sri:.2e}")

    sri_results['ks_tests'] = ks_results
    print(f"  SRI analysis complete in {time.time()-t0:.1f}s")
    return sri_results, sri_by_dataset

sri_results, sri_by_dataset = run_sri_analysis(datasets)

=== Analysis 2: SRI Distribution Analysis ===
  Synthetic-aliased-pairs: SRI(K=20) mean=18.9830, pct_below_1=0.0%
  ZINC-subset: SRI(K=20) mean=1.3766, pct_below_1=40.0%
  Peptides-func: SRI(K=20) mean=0.0348, pct_below_1=100.0%
  Peptides-struct: SRI(K=20) mean=0.0348, pct_below_1=100.0%
  KS(Peptides-func vs Peptides-struct): stat=0.0000, p=1.00e+00
  KS(Peptides-func vs Synthetic-aliased-pairs): stat=1.0000, p=1.83e-12
  KS(Peptides-func vs ZINC-subset): stat=1.0000, p=6.16e-10
  KS(Peptides-struct vs Synthetic-aliased-pairs): stat=1.0000, p=1.83e-12
  KS(Peptides-struct vs ZINC-subset): stat=1.0000, p=6.16e-10
  KS(Synthetic-aliased-pairs vs ZINC-subset): stat=0.8485, p=1.10e-09
  SRI analysis complete in 0.0s


## Analysis 3: Node-Level vs Graph-Level Resolution

Eigenvector localization means each node "sees" only a subset of eigenvalues. This can improve spectral resolution at the node level compared to the graph level.

In [8]:
def _compute_node_deltas_for_graph(g, thresh):
    """Compute node-level delta_min values for a single graph."""
    graph_delta_min = g['delta_min']
    if graph_delta_min < 1e-15:
        return [], graph_delta_min

    node_deltas = []
    for node_measures in g['local_spectral']:
        if not node_measures or len(node_measures) == 0:
            continue
        eigenvals = np.array([m[0] for m in node_measures], dtype=np.float64)
        weights = np.array([m[1] for m in node_measures], dtype=np.float64)
        if len(weights) == 0 or weights.max() < 1e-12:
            continue
        mask = weights > thresh * weights.max()
        sig_eigenvals = eigenvals[mask]
        if len(sig_eigenvals) < 2:
            continue
        sig_sorted = np.sort(sig_eigenvals)
        diffs = np.diff(sig_sorted)
        nonzero_diffs = diffs[diffs > 1e-15]
        if len(nonzero_diffs) > 0:
            node_deltas.append(float(np.min(nonzero_diffs)))

    return node_deltas, graph_delta_min


def run_node_vs_graph_analysis(datasets):
    """Analysis 3: Node-level vs Graph-level resolution comparison."""
    print("=== Analysis 3: Node-Level vs Graph-Level Resolution ===")
    t0 = time.time()
    node_vs_graph_results = {}
    scatter_cache = {}

    for ds_name, graphs in datasets.items():
        node_vs_graph_results[ds_name] = {}
        scatter_cache[ds_name] = {}

        for thresh in WEIGHT_THRESHOLDS:
            ratios = []
            node_sris = []
            graph_sris = []

            for g in graphs:
                node_deltas, graph_delta_min = _compute_node_deltas_for_graph(g, thresh)
                if not node_deltas or graph_delta_min < 1e-15:
                    continue
                median_node_delta = float(np.median(node_deltas))
                if median_node_delta > 0:
                    ratio = median_node_delta / graph_delta_min
                    ratios.append(ratio)
                    node_sris.append(median_node_delta * 20)
                    graph_sris.append(graph_delta_min * 20)

            scatter_cache[ds_name][f'{thresh}'] = {
                'node_sris': node_sris,
                'graph_sris': graph_sris,
                'ratios': ratios,
            }

            result = {'n_graphs_analyzed': len(ratios)}
            if ratios:
                ratios_arr = np.array(ratios)
                result.update({
                    'mean_ratio': float(np.mean(ratios_arr)),
                    'median_ratio': float(np.median(ratios_arr)),
                    'pct_node_better': float(np.mean(ratios_arr > 1.0) * 100),
                })
                if len(graph_sris) > 2:
                    corr = stats.spearmanr(graph_sris, node_sris)
                    result['spearman_corr'] = float(corr.statistic)
            else:
                result.update({
                    'mean_ratio': None,
                    'median_ratio': None,
                    'pct_node_better': None,
                })

            node_vs_graph_results[ds_name][f'threshold_{thresh}'] = result
            if ratios:
                print(f"  {ds_name} thresh={thresh}: "
                      f"median_ratio={np.median(ratios):.2f}, "
                      f"pct_better={np.mean(np.array(ratios) > 1)*100:.1f}%")

    print(f"  Node vs Graph analysis complete in {time.time()-t0:.1f}s")
    return node_vs_graph_results, scatter_cache

node_vs_graph_results, scatter_cache = run_node_vs_graph_analysis(datasets)

=== Analysis 3: Node-Level vs Graph-Level Resolution ===
  Synthetic-aliased-pairs thresh=0.01: median_ratio=1.00, pct_better=39.4%
  Synthetic-aliased-pairs thresh=0.05: median_ratio=1.00, pct_better=42.4%
  Synthetic-aliased-pairs thresh=0.1: median_ratio=1.00, pct_better=51.5%
  ZINC-subset thresh=0.01: median_ratio=1.94, pct_better=100.0%
  ZINC-subset thresh=0.05: median_ratio=1.94, pct_better=100.0%
  ZINC-subset thresh=0.1: median_ratio=2.16, pct_better=100.0%
  Peptides-func thresh=0.01: median_ratio=61.88, pct_better=100.0%
  Peptides-func thresh=0.05: median_ratio=61.88, pct_better=100.0%
  Peptides-func thresh=0.1: median_ratio=64.33, pct_better=100.0%
  Peptides-struct thresh=0.01: median_ratio=61.88, pct_better=100.0%
  Peptides-struct thresh=0.05: median_ratio=61.88, pct_better=100.0%
  Peptides-struct thresh=0.1: median_ratio=64.33, pct_better=100.0%
  Node vs Graph analysis complete in 0.0s


## Analysis 4: Vandermonde Condition Number Analysis

The Vandermonde matrix condition number controls how noise in random walk statistics amplifies into reconstruction error. We verify that higher condition numbers (from closer eigenvalues) lead to larger errors.

In [9]:
def run_vandermonde_analysis(datasets):
    """Analysis 4: Vandermonde Condition Number Analysis."""
    print("=== Analysis 4: Vandermonde Conditioning Analysis ===")
    t0 = time.time()
    vander_results = {}
    reconstruction_results = {}
    vander_scatter_cache = {}

    for ds_name, graphs in datasets.items():
        print(f"  Processing {ds_name} ({len(graphs)} graphs)...")

        # (a) Growth rate analysis
        growth_rates = []
        sri_values_K20 = []
        all_log_conds = {k: [] for k in K_VALUES}

        for g in graphs:
            conds = [g['vandermonde_cond'][kk] for kk in K_KEYS]
            log_conds = np.log10(np.clip(np.array(conds, dtype=np.float64), 1.0, MAX_COND))
            slope, _, _, _, _ = stats.linregress(K_VALUES, log_conds)
            growth_rates.append(float(slope))
            sri_values_K20.append(g['sri']['K=20'])
            for ki, kv in enumerate(K_VALUES):
                all_log_conds[kv].append(float(log_conds[ki]))

        growth_rates_arr = np.array(growth_rates)
        sri_values_arr = np.array(sri_values_K20)

        # (b) Correlation: growth rate vs SRI
        corr_growth_sri = None
        if len(growth_rates_arr) > 2:
            try:
                corr = stats.spearmanr(growth_rates_arr, sri_values_arr)
                corr_growth_sri = {
                    'spearman_rho': float(corr.statistic),
                    'p_value': float(corr.pvalue),
                }
            except Exception:
                pass

        vander_results[ds_name] = {
            'growth_rate_mean': float(np.mean(growth_rates_arr)),
            'growth_rate_median': float(np.median(growth_rates_arr)),
            'corr_growth_rate_vs_sri': corr_growth_sri,
            'cond_number_stats': {},
            'sri_quartile_conds': {},
        }

        for kv in K_VALUES:
            log_conds_arr = np.array(all_log_conds[kv])
            vander_results[ds_name]['cond_number_stats'][str(kv)] = {
                'median_log10': float(np.median(log_conds_arr)),
            }

        # SRI quartile analysis
        if len(sri_values_arr) >= 4:
            quartiles = np.percentile(sri_values_arr, [25, 50, 75])
            bins = [-np.inf] + list(quartiles) + [np.inf]
            for kv in K_VALUES:
                lc = np.array(all_log_conds[kv])
                q_means = []
                for qi in range(4):
                    mask = (sri_values_arr >= bins[qi]) & (sri_values_arr < bins[qi + 1])
                    if mask.sum() > 0:
                        q_means.append(float(np.mean(lc[mask])))
                    else:
                        q_means.append(0.0)
                vander_results[ds_name]['sri_quartile_conds'][str(kv)] = q_means

        print(f"    Growth rate: mean={np.mean(growth_rates_arr):.4f}")

        # (c) RWSE Reconstruction Error Experiment
        np.random.seed(42)
        n_sample = min(N_SUBSAMPLE, len(graphs))
        sample_indices = np.random.choice(len(graphs), n_sample, replace=False)

        errors_by_noise = {eps: [] for eps in NOISE_LEVELS}
        cond_nums_recon = []
        recon_scatter_conds = []
        recon_scatter_errs = []

        for idx in sample_indices:
            g = graphs[idx]
            local_spectral = g['local_spectral']
            for node_idx in range(min(NODES_PER_GRAPH, len(local_spectral))):
                measure = local_spectral[node_idx]
                if not measure or len(measure) < 2:
                    continue

                eigs = np.array([m[0] for m in measure], dtype=np.float64)
                weights_true = np.array([m[1] for m in measure], dtype=np.float64)
                if len(eigs) < 2 or np.linalg.norm(weights_true) < 1e-12:
                    continue

                K = 20
                V = np.array([[eig ** (k + 1) for eig in eigs] for k in range(K)],
                             dtype=np.float64)

                try:
                    cond = float(np.linalg.cond(V))
                except Exception:
                    cond = MAX_COND
                if not np.isfinite(cond):
                    cond = MAX_COND
                cond = min(cond, MAX_COND)
                cond_nums_recon.append(cond)

                m_true = V @ weights_true

                for eps in NOISE_LEVELS:
                    noise = eps * np.random.randn(K)
                    m_noisy = m_true + noise
                    try:
                        w_hat, _, _, _ = np.linalg.lstsq(V, m_noisy, rcond=None)
                        rel_error = float(np.linalg.norm(w_hat - weights_true) /
                                          max(np.linalg.norm(weights_true), 1e-12))
                        if not np.isfinite(rel_error):
                            rel_error = 1e10
                        errors_by_noise[eps].append(min(rel_error, 1e10))
                    except Exception:
                        errors_by_noise[eps].append(1e10)

                    if eps == NOISE_LEVELS[-2] if len(NOISE_LEVELS) >= 2 else eps == NOISE_LEVELS[0]:
                        recon_scatter_conds.append(np.log10(max(cond, 1.0)))
                        recon_scatter_errs.append(np.log10(max(errors_by_noise[eps][-1], 1e-15)))

        vander_scatter_cache[ds_name] = {
            'conds': recon_scatter_conds,
            'errors': recon_scatter_errs,
        }

        reconstruction_results[ds_name] = {}
        for eps in NOISE_LEVELS:
            errs = np.array(errors_by_noise[eps])
            if len(errs) > 0:
                reconstruction_results[ds_name][str(eps)] = {
                    'mean_error': float(np.mean(errs)),
                    'median_error': float(np.median(errs)),
                    'n_samples': len(errs),
                }

        # (d) Correlation: cond vs error
        ref_eps = NOISE_LEVELS[-2] if len(NOISE_LEVELS) >= 2 else NOISE_LEVELS[0]
        if len(cond_nums_recon) > 2 and len(errors_by_noise[ref_eps]) > 2:
            min_len = min(len(cond_nums_recon), len(errors_by_noise[ref_eps]))
            log_conds_r = np.log10(np.maximum(cond_nums_recon[:min_len], 1.0))
            log_errors_r = np.log10(np.maximum(errors_by_noise[ref_eps][:min_len], 1e-15))
            valid = np.isfinite(log_conds_r) & np.isfinite(log_errors_r)
            if valid.sum() > 2:
                try:
                    corr = stats.spearmanr(log_conds_r[valid], log_errors_r[valid])
                    reconstruction_results[ds_name]['corr_cond_vs_error'] = {
                        'spearman_rho': float(corr.statistic),
                        'p_value': float(corr.pvalue),
                    }
                except Exception:
                    pass

        print(f"    Reconstruction: {len(cond_nums_recon)} samples")

    print(f"  Vandermonde analysis complete in {time.time()-t0:.1f}s")
    return vander_results, reconstruction_results, vander_scatter_cache

vander_results, reconstruction_results, vander_scatter_cache = run_vandermonde_analysis(datasets)

=== Analysis 4: Vandermonde Conditioning Analysis ===
  Processing Synthetic-aliased-pairs (33 graphs)...
    Growth rate: mean=0.6710
    Reconstruction: 163 samples
  Processing ZINC-subset (20 graphs)...
    Growth rate: mean=0.6491
    Reconstruction: 100 samples
  Processing Peptides-func (15 graphs)...
    Growth rate: mean=0.5359
    Reconstruction: 75 samples
  Processing Peptides-struct (15 graphs)...
    Growth rate: mean=0.5359
    Reconstruction: 75 samples
  Vandermonde analysis complete in 0.1s


## Analysis 5: Eigenvalue Clustering Baseline

A simple baseline that counts distinct eigenvalue clusters and measures spectral gaps, providing context for the SRI-based diagnostics.

In [10]:
def run_eigenvalue_clustering_baseline(datasets):
    """Baseline analysis: eigenvalue clustering coefficient."""
    print("=== Analysis 5: Eigenvalue Clustering Baseline ===")
    t0 = time.time()
    baseline_results = {}

    for ds_name, graphs in datasets.items():
        n_clusters_list = []
        spectral_gaps = []
        max_gaps = []

        for g in graphs:
            eigs = np.sort(g['eigenvalues'])
            n = len(eigs)
            if n < 2:
                continue

            clusters = 1
            for i in range(1, n):
                if abs(eigs[i] - eigs[i - 1]) > 0.01:
                    clusters += 1
            n_clusters_list.append(clusters)

            diffs = np.diff(eigs)
            nonzero = diffs[diffs > 1e-10]
            if len(nonzero) > 0:
                spectral_gaps.append(float(nonzero[0]))
                max_gaps.append(float(np.max(nonzero)))
            else:
                spectral_gaps.append(0.0)
                max_gaps.append(0.0)

        baseline_results[ds_name] = {
            'median_clusters': float(np.median(n_clusters_list)) if n_clusters_list else None,
            'median_spectral_gap': float(np.median(spectral_gaps)) if spectral_gaps else None,
            'median_max_gap': float(np.median(max_gaps)) if max_gaps else None,
            'n_graphs': len(n_clusters_list),
        }
        print(f"  {ds_name}: median_clusters={baseline_results[ds_name]['median_clusters']}")

    print(f"  Baseline analysis complete in {time.time()-t0:.1f}s")
    return baseline_results

baseline_results = run_eigenvalue_clustering_baseline(datasets)

=== Analysis 5: Eigenvalue Clustering Baseline ===
  Synthetic-aliased-pairs: median_clusters=7.0
  ZINC-subset: median_clusters=21.0
  Peptides-func: median_clusters=107.0
  Peptides-struct: median_clusters=107.0
  Baseline analysis complete in 0.0s


## Results Summary and Visualization

Key findings displayed as a summary table and plots showing SRI distributions, node-level resolution benefits, and Vandermonde conditioning.

In [11]:
# === Summary Table ===
print("=" * 80)
print("SPECTRAL DIAGNOSTICS SUMMARY")
print("=" * 80)

header = f"{'Dataset':<25} {'Sparsity':<12} {'SRI<1 %':<10} {'Node>Graph %':<14} {'Clusters':<10}"
print(header)
print("-" * 80)

for ds_name in sorted(datasets.keys()):
    sp = sparsity_results.get(ds_name, {})
    sr = sri_results.get(ds_name, {})
    ng = node_vs_graph_results.get(ds_name, {}).get('threshold_0.05', {})
    bl = baseline_results.get(ds_name, {})

    sparsity_str = "YES" if sp.get('sparsity_confirmed') else "NO"
    sri_pct = f"{sr.get('pct_below_1', 0):.1f}%"
    node_pct = f"{ng.get('pct_node_better', 0):.1f}%" if ng.get('pct_node_better') is not None else "N/A"
    clusters = f"{bl.get('median_clusters', 0):.0f}" if bl.get('median_clusters') is not None else "N/A"

    print(f"{ds_name:<25} {sparsity_str:<12} {sri_pct:<10} {node_pct:<14} {clusters:<10}")

print("=" * 80)

# === Visualization ===
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
ds_names = sorted(datasets.keys())
colors = sns.color_palette("husl", len(ds_names))

# Plot 1: SRI Distribution (log scale)
ax = axes[0, 0]
for i, ds in enumerate(ds_names):
    vals = sri_by_dataset[ds]['sri_20']
    vals_pos = vals[vals > 0]
    if len(vals_pos) > 0:
        ax.hist(np.log10(vals_pos), bins=20, alpha=0.5, label=ds, color=colors[i])
ax.axvline(x=0, color='red', linestyle='--', alpha=0.7, label='SRI=1')
ax.set_xlabel('log10(SRI at K=20)')
ax.set_ylabel('Count')
ax.set_title('SRI Distribution (K=20)')
ax.legend(fontsize=8)

# Plot 2: SRI vs K
ax = axes[0, 1]
for i, ds_name in enumerate(ds_names):
    graphs = datasets[ds_name]
    median_sri = []
    for k_key in K_KEYS:
        sri_vals = [g['sri'][k_key] for g in graphs]
        median_sri.append(float(np.median(sri_vals)))
    ax.plot(K_VALUES, median_sri, '-o', label=ds_name, color=colors[i])
ax.axhline(y=1.0, color='red', linestyle='--', label='SRI=1 (resolution limit)')
ax.set_xlabel('Walk Length K')
ax.set_ylabel('Median SRI')
ax.set_title('SRI vs Walk Length K')
ax.set_yscale('log')
ax.legend(fontsize=8)

# Plot 3: Sparsity - Effective Rank
ax = axes[1, 0]
data_1 = [sparsity_cache[ds]['eff_rank_1pct'] for ds in ds_names if ds in sparsity_cache]
labels = [ds for ds in ds_names if ds in sparsity_cache]
if data_1:
    vp = ax.violinplot(data_1, positions=range(len(labels)), showmeans=True, showmedians=True)
    ax.set_xticks(range(len(labels)))
    ax.set_xticklabels([n.replace('-', '\n') for n in labels], fontsize=8)
    ax.axhline(y=10, color='red', linestyle='--', alpha=0.7, label='Top-10 ceiling')
    ax.set_ylabel('Effective Rank')
    ax.set_title('Effective Rank (1% threshold)')
    ax.legend(fontsize=8)

# Plot 4: Vandermonde Condition Number Heatmap
ax = axes[1, 1]
data_mat = []
ylabels = []
for ds in ds_names:
    row = []
    for kv in K_VALUES:
        cs = vander_results.get(ds, {}).get('cond_number_stats', {}).get(str(kv), {})
        row.append(cs.get('median_log10', 0))
    data_mat.append(row)
    ylabels.append(ds)
if data_mat:
    data_arr = np.array(data_mat)
    sns.heatmap(data_arr, annot=True, fmt='.1f',
                xticklabels=[f'K={k}' for k in K_VALUES],
                yticklabels=ylabels, cmap='YlOrRd', ax=ax)
    ax.set_title('Median Vandermonde Cond. Number (log10)')

fig.suptitle('Spectral Diagnostics Overview', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
print("\nVisualization complete.")

SPECTRAL DIAGNOSTICS SUMMARY
Dataset                   Sparsity     SRI<1 %    Node>Graph %   Clusters  
--------------------------------------------------------------------------------
Peptides-func             YES          100.0%     100.0%         107       
Peptides-struct           YES          100.0%     100.0%         107       
Synthetic-aliased-pairs   NO           0.0%       42.4%          7         
ZINC-subset               NO           40.0%      100.0%         21        



Visualization complete.
