# QuartumSE Complete Benchmark Suite

**The canonical benchmark notebook** for classical shadows vs direct measurement.

This notebook consolidates ALL benchmarking functionality:
- All 8 Measurements Bible tasks
- All circuits from research workstreams (S, C, O, B, M)
- Enhanced statistical analysis (bootstrap, K-S tests, crossover)
- Optional noise sensitivity sweep (Task 7)
- Locality breakdown and cost-normalized metrics
- Cross-circuit consolidated comparison

## Research Workstreams

| Workstream | Focus | Circuits |
|------------|-------|----------|
| **S** | Shadows Core | GHZ, Bell pairs, Clifford, Ising |
| **C** | Chemistry | H2, LiH, BeH2 molecular ansatze |
| **O** | Optimization | QAOA MAX-CUT |
| **B** | Benchmarking | RB/XEB random circuits |
| **M** | Metrology | GHZ phase sensing |

In [1]:
# =============================================================================
# SETUP
# =============================================================================
import sys
sys.path.insert(0, '../src')

import numpy as np
from collections import Counter, defaultdict
from qiskit import QuantumCircuit
from scipy import stats
import json
from pathlib import Path
from datetime import datetime

from quartumse import (
    run_benchmark_suite,
    BenchmarkMode,
    BenchmarkSuiteConfig,
    Observable,
    ObservableSet,
)

# NEW: Import timing model for hardware time estimates
from quartumse.analysis.quantum_time_model import (
    HardwareTimingProfile,
    IBM_HERON,
)

# NEW: Import suite classes and builders
from quartumse.observables.suites import (
    ObservableSuite,
    ObjectiveType,
    SuiteType,
    # Circuit-specific suite builders
    make_ghz_suites,
    make_bell_suites,
    make_ising_suites,
    make_qaoa_ring_suites,
    make_phase_sensing_suites,
    make_chemistry_suites,
    # Generic builders
    make_stress_suite,
    make_posthoc_library,
    make_commuting_suite,
)

print("Setup complete!")
print("Suite types available:", [t.value for t in SuiteType])
print(f"Default HW profile: {IBM_HERON.profile_id} (2Q gate: {IBM_HERON.gate_2q_ns}ns)")


Setup complete!
Suite types available: ['workload', 'stress', 'posthoc', 'commuting', 'diagnostic']
Default HW profile: ibm_heron_r2 (2Q gate: 300.0ns)


---

## 1. Circuit and Observable Definitions

In [2]:
# =============================================================================
# CIRCUIT BUILDERS
# =============================================================================

def build_ghz(n_qubits: int) -> QuantumCircuit:
    """GHZ state: |00...0> + |11...1> / sqrt(2)"""
    qc = QuantumCircuit(n_qubits, name=f'GHZ_{n_qubits}q')
    qc.h(0)
    for i in range(1, n_qubits):
        qc.cx(i - 1, i)
    return qc

def build_bell_pairs(n_pairs: int) -> QuantumCircuit:
    """Parallel Bell pairs."""
    n_qubits = 2 * n_pairs
    qc = QuantumCircuit(n_qubits, name=f'Bell_{n_pairs}pairs')
    for i in range(n_pairs):
        qc.h(2 * i)
        qc.cx(2 * i, 2 * i + 1)
    return qc

def build_random_clifford(n_qubits: int, depth: int, seed: int = 42) -> QuantumCircuit:
    """Random Clifford circuit."""
    rng = np.random.default_rng(seed)
    qc = QuantumCircuit(n_qubits, name=f'Clifford_{n_qubits}q_d{depth}')
    clifford_gates = ['h', 's', 'sdg', 'x', 'y', 'z']
    for _ in range(depth):
        for q in range(n_qubits):
            gate = rng.choice(clifford_gates)
            getattr(qc, gate)(q)
        for q in range(0, n_qubits - 1, 2):
            if rng.random() > 0.3:
                qc.cx(q, q + 1)
    return qc

def build_ising_trotter(n_qubits: int, steps: int = 3, dt: float = 0.5) -> QuantumCircuit:
    """Trotterized transverse-field Ising model."""
    qc = QuantumCircuit(n_qubits, name=f'Ising_{n_qubits}q_t{steps}')
    J, h = 1.0, 0.5
    for q in range(n_qubits):
        qc.h(q)
    for _ in range(steps):
        for q in range(n_qubits - 1):
            qc.cx(q, q + 1)
            qc.rz(2 * J * dt, q + 1)
            qc.cx(q, q + 1)
        for q in range(n_qubits):
            qc.rx(2 * h * dt, q)
    return qc

def build_h2_ansatz(theta: float = 0.5) -> QuantumCircuit:
    """H2 molecule ansatz (4 qubits)."""
    qc = QuantumCircuit(4, name='H2_ansatz')
    qc.x(0); qc.x(1)
    qc.cx(1, 2); qc.ry(theta, 2); qc.cx(1, 2)
    qc.cx(0, 3); qc.ry(theta / 2, 3); qc.cx(0, 3)
    return qc

def build_lih_ansatz(theta: float = 0.5) -> QuantumCircuit:
    """LiH molecule ansatz (6 qubits)."""
    qc = QuantumCircuit(6, name='LiH_ansatz')
    for i in range(4): qc.x(i)
    qc.cx(3, 4); qc.ry(theta, 4); qc.cx(3, 4)
    qc.cx(2, 5); qc.ry(theta / 2, 5); qc.cx(2, 5)
    return qc

def build_beh2_ansatz(theta: float = 0.5) -> QuantumCircuit:
    """BeH2 molecule ansatz (8 qubits)."""
    qc = QuantumCircuit(8, name='BeH2_ansatz')
    for i in range(6): qc.x(i)
    qc.cx(5, 6); qc.ry(theta, 6); qc.cx(5, 6)
    qc.cx(4, 7); qc.ry(theta / 2, 7); qc.cx(4, 7)
    return qc

def build_qaoa_maxcut_ring(n_qubits: int, p: int = 1, gamma: float = 0.5, beta: float = 0.5) -> QuantumCircuit:
    """QAOA for MAX-CUT on ring graph."""
    qc = QuantumCircuit(n_qubits, name=f'QAOA_ring_{n_qubits}q_p{p}')
    for q in range(n_qubits): qc.h(q)
    for _ in range(p):
        for q in range(n_qubits):
            q_next = (q + 1) % n_qubits
            qc.cx(q, q_next); qc.rz(2 * gamma, q_next); qc.cx(q, q_next)
        for q in range(n_qubits): qc.rx(2 * beta, q)
    return qc

def build_ghz_phase_sensing(n_qubits: int, phi: float = 0.1) -> QuantumCircuit:
    """GHZ state with phase encoding."""
    qc = build_ghz(n_qubits)
    qc.name = f'GHZ_phase_{n_qubits}q'
    for q in range(n_qubits): qc.rz(phi, q)
    return qc

def build_xeb_circuit(n_qubits: int, depth: int, seed: int = 42) -> QuantumCircuit:
    """Cross-Entropy Benchmarking random circuit."""
    rng = np.random.default_rng(seed)
    qc = QuantumCircuit(n_qubits, name=f'XEB_{n_qubits}q_d{depth}')
    gates_1q = ['h', 'x', 'y', 'z', 's', 't', 'sdg', 'tdg']
    for d in range(depth):
        for q in range(n_qubits):
            getattr(qc, rng.choice(gates_1q))(q)
        for q in range(d % 2, n_qubits - 1, 2):
            qc.cx(q, q + 1)
    return qc

print("Circuit builders defined!")

Circuit builders defined!


In [3]:
# =============================================================================
# OBSERVABLE SUITES (from quartumse.observables.suites module)
# =============================================================================
# 
# The suite system provides task-aligned observable sets for each circuit family.
# Each circuit gets multiple suites:
#   - workload: What practitioners actually measure (energy, cost, fidelity)
#   - stress: Large sets (1000+) for testing protocol scaling  
#   - commuting: All-commuting baselines (where grouped measurement wins)
#   - posthoc: Libraries for "measure once, query later" tests
#
# Suite builders:
#   make_ghz_suites(n)           -> stabilizers, stress, commuting, posthoc
#   make_bell_suites(n_pairs)    -> pair correlations, diagnostics, stress
#   make_ising_suites(n)         -> energy (weighted), correlations, stress
#   make_qaoa_ring_suites(n)     -> cost (weighted, with wrap edge!), stress, posthoc
#   make_phase_sensing_suites(n) -> phase signal (X^n, Y^n), stabilizers, stress
#   make_chemistry_suites(n)     -> energy (weighted), stress

# Demo: Show what suites are generated for each circuit type
print("="*70)
print("AVAILABLE SUITES BY CIRCUIT TYPE")
print("="*70)

demo_configs = [
    ("GHZ-4", make_ghz_suites(4)),
    ("Bell-2pairs", make_bell_suites(2)),
    ("Ising-4", make_ising_suites(4)),
    ("QAOA-5-ring", make_qaoa_ring_suites(5)),
    ("Phase-3", make_phase_sensing_suites(3)),
]

for name, suites in demo_configs:
    print(f"\n{name}:")
    for suite_name, suite in suites.items():
        obj = "weighted" if suite.objective == ObjectiveType.WEIGHTED_SUM else "per-obs"
        comm = suite.commutation_analysis()
        comm_str = "FULLY COMMUTING" if comm['fully_commuting'] else f"{comm['n_commuting_groups']} groups"
        print(f"  {suite_name:30s} | {suite.n_observables:4d} obs | {obj:8s} | {comm_str}")

print("\n" + "="*70)
print("KEY INSIGHT: Commuting suites (e.g., QAOA cost) favor grouped measurement")
print("             Non-commuting suites (e.g., stress) may favor shadows")
print("="*70)

AVAILABLE SUITES BY CIRCUIT TYPE



GHZ-4:
  workload_stabilizers           |    7 obs | per-obs  | 2 groups
  stress_random_1000             |  255 obs | per-obs  | 81 groups
  commuting_z_only               |   11 obs | per-obs  | FULLY COMMUTING
  posthoc_library                |  255 obs | per-obs  | 81 groups

Bell-2pairs:
  workload_pair_correlations     |    6 obs | per-obs  | 3 groups
  diagnostics_single_qubit       |    4 obs | per-obs  | FULLY COMMUTING
  diagnostics_cross_pair         |    1 obs | per-obs  | FULLY COMMUTING
  stress_random_1000             |  255 obs | per-obs  | 81 groups

Ising-4:
  workload_energy                |    7 obs | weighted | 2 groups
  workload_correlations          |    6 obs | per-obs  | FULLY COMMUTING


  stress_random_1000             |  255 obs | per-obs  | 81 groups

QAOA-5-ring:
  workload_cost                  |    5 obs | weighted | FULLY COMMUTING
  commuting_cost                 |    5 obs | weighted | FULLY COMMUTING


  stress_random_1000             |  705 obs | per-obs  | 207 groups


  posthoc_library                | 1018 obs | per-obs  | 243 groups

Phase-3:
  workload_phase_signal          |    2 obs | per-obs  | 2 groups
  workload_stabilizers           |    4 obs | per-obs  | 3 groups
  stress_random_500              |   63 obs | per-obs  | 27 groups

KEY INSIGHT: Commuting suites (e.g., QAOA cost) favor grouped measurement
             Non-commuting suites (e.g., stress) may favor shadows


---

## 2. Configuration

In [4]:
# =============================================================================
# CIRCUIT AND SUITE SELECTION
# =============================================================================

# Which circuits to benchmark
CIRCUITS_TO_RUN = {
    # WORKSTREAM S: SHADOWS CORE
    'S-GHZ-4':    False,    # 4-qubit GHZ          -- DONE
    'S-GHZ-5':    False,    # 5-qubit GHZ          -- DONE
    'S-BELL-2':   True,     # 2 Bell pairs (4 qubits) -- QUICK TEST
    'S-BELL-3':   False,    # 3 Bell pairs (6 qubits) -- DONE
    'S-ISING-4':  False,    # 4-qubit Ising
    'S-ISING-6':  False,    # 6-qubit Ising
    # WORKSTREAM C: CHEMISTRY
    'C-H2':       False,    # H2 molecule (4 qubits) -- DONE
    'C-LiH':      False,    # LiH molecule (6 qubits)
    # WORKSTREAM O: OPTIMIZATION
    'O-QAOA-5':   False,    # QAOA 5q ring
    'O-QAOA-7':   False,    # QAOA 7q ring
    # WORKSTREAM M: METROLOGY
    'M-PHASE-3':  False,    # 3-qubit phase sensing
    'M-PHASE-4':  False,    # 4-qubit phase sensing
}

SUITES_TO_RUN = {
    'merged': True,         # Merged (all unique observables)
    'workload': True,       # Task-aligned workload suites (energy, cost, etc.)
    'stress': True,         # Large random observable set
    'commuting': True,      # All-commuting baseline
    'posthoc': True,        # Post-hoc query library
    'diagnostics': True,    # System diagnostics
}

# Number of observables for stress/posthoc suites
N_STRESS_OBSERVABLES = 100
N_POSTHOC_OBSERVABLES = 200

# Count enabled
enabled_circuits = [k for k, v in CIRCUITS_TO_RUN.items() if v]
enabled_suites = [k for k, v in SUITES_TO_RUN.items() if v]

print(f"Circuits to run: {len(enabled_circuits)} / {len(CIRCUITS_TO_RUN)}")
for c in enabled_circuits:
    print(f"  + {c}")

print()
print(f"Suite types to run: {enabled_suites}")
if SUITES_TO_RUN.get('stress'):
    print(f"  Stress observables: {N_STRESS_OBSERVABLES}")
if SUITES_TO_RUN.get('posthoc'):
    print(f"  Posthoc observables: {N_POSTHOC_OBSERVABLES}")


Circuits to run: 1 / 12
  + S-BELL-2

Suite types to run: ['merged', 'workload', 'stress', 'commuting', 'posthoc', 'diagnostics']
  Stress observables: 100
  Posthoc observables: 200


In [5]:
# =============================================================================
# BENCHMARK CONFIGURATION
# =============================================================================

# Per-protocol timeout: stop any single protocol.run() that exceeds this limit.
# The protocol will finalize with partial data and flag timed_out=True.
# Set to None to disable (run to completion).
TIMEOUT_PER_PROTOCOL_S = 1  # 1 second per protocol run; None to disable

# Hardware timing profile for estimating real-device execution time.
# This does NOT affect simulation - it adds an est_quantum_hw_s column
# to results showing what the run WOULD cost on real hardware.
# Set to None to skip hardware time estimation.
HW_TIMING_PROFILE = IBM_HERON  # IBM Heron R2 defaults; None to disable

CONFIG = BenchmarkSuiteConfig(
    mode=BenchmarkMode.ANALYSIS,      # Full analysis with all features
    n_shots_grid=[10000],
    n_replicates=10,                  # Increase to 20+ for publication
    seed=42,
    epsilon=0.05,                     # Target precision
    delta=0.05,                       # Failure probability
    shadows_protocol_id="classical_shadows_v0",
    baseline_protocol_id="direct_grouped",
    output_base_dir="benchmark_results",
    timeout_per_protocol_s=TIMEOUT_PER_PROTOCOL_S,
    hw_timing_profile=HW_TIMING_PROFILE,
)

# Optional: Enable noise sweep for Task 7
RUN_NOISE_SWEEP = True  # Run with multiple noise profiles
NOISE_PROFILES = ['ideal', 'readout_1e-2', 'depol_low']  # If enabled

print(f"Mode: {CONFIG.mode.value}")
print(f"Shots: {CONFIG.n_shots_grid}")
print(f"Replicates: {CONFIG.n_replicates}")
print(f"Timeout per protocol: {TIMEOUT_PER_PROTOCOL_S}s" if TIMEOUT_PER_PROTOCOL_S else "Timeout: disabled")
print(f"HW timing profile: {HW_TIMING_PROFILE.profile_id}" if HW_TIMING_PROFILE else "HW timing: disabled")
print(f"Noise sweep: {RUN_NOISE_SWEEP}")

Mode: analysis
Shots: [10000]
Replicates: 10
Timeout per protocol: 1s
HW timing profile: ibm_heron_r2
Noise sweep: True


In [6]:
# =============================================================================
# BUILD CIRCUITS AND SUITES
# =============================================================================

from quartumse.observables.core import Observable, ObservableSet

def filter_suites(all_suites: dict, enabled_types: dict) -> dict:
    """Filter suites based on enabled suite types."""
    filtered = {}
    for name, suite in all_suites.items():
        suite_type = suite.suite_type.value
        # Check if this suite type is enabled
        if enabled_types.get(suite_type, False):
            filtered[name] = suite
        # Also check for partial matches (e.g., 'workload_energy' matches 'workload')
        elif any(enabled_types.get(t, False) and t in name for t in enabled_types):
            filtered[name] = suite
    return filtered

def merge_suites_for_circuit(suites: dict[str, ObservableSuite], circuit_id: str) -> tuple[ObservableSet, dict]:
    """Merge multiple suites into one ObservableSet, deduplicating by pauli_string.
    
    Returns:
        merged_set: ObservableSet with all unique observables, tagged with source suites
        suite_mapping: dict mapping observable_id -> list of source suite names
    """
    # Collect all observables, tracking which suites they came from
    pauli_to_obs = {}  # pauli_string -> (Observable, set of suite names)
    
    for suite_name, suite in suites.items():
        for obs in suite.observables:
            key = obs.pauli_string
            if key in pauli_to_obs:
                # Observable already exists - add this suite to its sources
                pauli_to_obs[key][1].add(suite_name)
            else:
                # New observable - create entry with this suite as source
                pauli_to_obs[key] = (obs, {suite_name})
    
    # Build merged observable list with suite tags in metadata
    merged_observables = []
    suite_mapping = {}
    
    for pauli_string, (obs, source_suites) in pauli_to_obs.items():
        # Create new observable with suite membership in metadata
        new_metadata = dict(obs.metadata) if obs.metadata else {}
        new_metadata['source_suites'] = sorted(source_suites)
        
        merged_obs = Observable(
            pauli_string=obs.pauli_string,
            coefficient=obs.coefficient,
            observable_id=obs.observable_id,
            group_id=obs.group_id,
            metadata=new_metadata,
        )
        merged_observables.append(merged_obs)
        suite_mapping[merged_obs.observable_id] = sorted(source_suites)
    
    # Create merged ObservableSet (n_qubits is derived from observables)
    merged_set = ObservableSet(
        observables=merged_observables,
        observable_set_id=f"{circuit_id}_merged",
        generator_id="suite_merger",
        generator_version="1.0.0",
        metadata={
            'merged_from': list(suites.keys()),
            'original_counts': {name: suite.n_observables for name, suite in suites.items()},
        },
    )
    
    return merged_set, suite_mapping

# Circuit definitions: (circuit_builder, suite_builder)
CIRCUIT_DEFS = {
    # Workstream S: Shadows Core
    'S-GHZ-4':   (build_ghz(4), make_ghz_suites(4)),
    'S-GHZ-5':   (build_ghz(5), make_ghz_suites(5)),
    'S-BELL-2':  (build_bell_pairs(2), make_bell_suites(2)),
    'S-BELL-3':  (build_bell_pairs(3), make_bell_suites(3)),
    'S-ISING-4': (build_ising_trotter(4, 3), make_ising_suites(4)),
    'S-ISING-6': (build_ising_trotter(6, 3), make_ising_suites(6)),
    # Workstream C: Chemistry
    'C-H2':      (build_h2_ansatz(), make_chemistry_suites(4, molecule_name='H2')),
    'C-LiH':     (build_lih_ansatz(), make_chemistry_suites(6, molecule_name='LiH')),
    # Workstream O: Optimization
    'O-QAOA-5':  (build_qaoa_maxcut_ring(5, p=1), make_qaoa_ring_suites(5)),
    'O-QAOA-7':  (build_qaoa_maxcut_ring(7, p=1), make_qaoa_ring_suites(7)),
    # Workstream M: Metrology
    'M-PHASE-3': (build_ghz_phase_sensing(3, 0.1), make_phase_sensing_suites(3)),
    'M-PHASE-4': (build_ghz_phase_sensing(4, 0.1), make_phase_sensing_suites(4)),
}

# Build selected circuits with filtered suites
circuits = {}
for cid, run in CIRCUITS_TO_RUN.items():
    if run and cid in CIRCUIT_DEFS:
        circ, all_suites = CIRCUIT_DEFS[cid]
        filtered = filter_suites(all_suites, SUITES_TO_RUN)
        
        if filtered:
            circuits[cid] = {
                'circuit': circ,
                'suites': filtered,
                'n_qubits': circ.num_qubits,
            }

# Override stress suites with configurable observable count
if N_STRESS_OBSERVABLES != 1000:
    for cid, info in circuits.items():
        stress_keys = [k for k in info['suites'] if 'stress' in k]
        for key in stress_keys:
            info['suites'][key] = make_stress_suite(
                n_qubits=info['n_qubits'],
                n_observables=N_STRESS_OBSERVABLES,
                seed=42,  # Same seed as original
            )

# Override posthoc suites with configurable observable count
# Use different seed (1042) to avoid overlap with stress observables
if N_POSTHOC_OBSERVABLES != 2000:
    for cid, info in circuits.items():
        posthoc_keys = [k for k in info['suites'] if 'posthoc' in k]
        for key in posthoc_keys:
            info['suites'][key] = make_posthoc_library(
                n_qubits=info['n_qubits'],
                n_observables=N_POSTHOC_OBSERVABLES,
                seed=1042,  # Different seed to avoid overlap with stress
            )

# Check for posthoc/stress redundancy: skip posthoc if it has same or fewer observables than stress
posthoc_skipped = []
if SUITES_TO_RUN.get('stress') and SUITES_TO_RUN.get('posthoc'):
    for cid, info in circuits.items():
        stress_keys = [k for k in info['suites'] if 'stress' in k]
        posthoc_keys = [k for k in info['suites'] if 'posthoc' in k]
        
        if stress_keys and posthoc_keys:
            # Get actual observable counts
            stress_obs = max(info['suites'][k].n_observables for k in stress_keys)
            posthoc_obs = max(info['suites'][k].n_observables for k in posthoc_keys)
            
            # If posthoc has same or fewer observables than stress, it's redundant
            if posthoc_obs <= stress_obs:
                for key in posthoc_keys:
                    del info['suites'][key]
                    posthoc_skipped.append((cid, key, stress_obs, posthoc_obs))

# =============================================================================
# MERGE SUITES TO AVOID REDUNDANT SAMPLING
# =============================================================================
# When multiple suites are enabled for a circuit, merge them into one set
# to sample the circuit only once. Results are tagged by source suite.

merge_enabled = True  # Set to False to disable merging (run suites separately)

for cid, info in circuits.items():
    if merge_enabled and len(info['suites']) > 1:
        # Merge all suites for this circuit
        merged_set, suite_mapping = merge_suites_for_circuit(info['suites'], cid)
        
        # Store merge info
        info['merged'] = True
        info['merged_observable_set'] = merged_set
        info['suite_mapping'] = suite_mapping  # observable_id -> [suite_names]
        info['original_suite_obs_counts'] = {
            name: suite.n_observables for name, suite in info['suites'].items()
        }
        
        # Count overlapping observables
        total_original = sum(info['original_suite_obs_counts'].values())
        merged_count = len(merged_set.observables)
        info['overlap_count'] = total_original - merged_count
    else:
        info['merged'] = False

# Display what was built
print(f"\nBuilt {len(circuits)} circuits:")
print("="*80)

# Show redundancy warnings
if posthoc_skipped:
    print("\nâš  POSTHOC SUITES SKIPPED (redundant with stress - same or fewer observables):")
    for cid, key, stress_obs, posthoc_obs in posthoc_skipped:
        print(f"  â€¢ {cid}/{key}: posthoc has {posthoc_obs} obs <= stress has {stress_obs} obs")
    print("  To run posthoc, increase N_POSTHOC_OBSERVABLES or decrease N_STRESS_OBSERVABLES")
    print()

total_benchmarks = 0
total_original_obs = 0
total_merged_obs = 0

for cid, info in circuits.items():
    print(f"\n{cid} ({info['n_qubits']} qubits):")
    
    for suite_name, suite in info['suites'].items():
        comm = suite.commutation_analysis()
        comm_str = "COMMUTING" if comm['fully_commuting'] else f"{comm['n_commuting_groups']} groups"
        obj_str = "[weighted]" if suite.objective == ObjectiveType.WEIGHTED_SUM else ""
        print(f"  â€¢ {suite_name:30s} {suite.n_observables:4d} obs  {comm_str:15s} {obj_str}")
        total_original_obs += suite.n_observables
    
    if info.get('merged'):
        merged_count = len(info['merged_observable_set'].observables)
        overlap = info['overlap_count']
        total_merged_obs += merged_count
        print(f"  â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€")
        print(f"  âœ“ MERGED: {merged_count} unique obs (saved {overlap} redundant)")
        total_benchmarks += 1  # Only one benchmark run for merged
    else:
        total_benchmarks += len(info['suites'])
        total_merged_obs += sum(s.n_observables for s in info['suites'].values())

print(f"\n{'='*80}")
print(f"TOTAL: {total_benchmarks} benchmark runs")
if total_original_obs > total_merged_obs:
    saved = total_original_obs - total_merged_obs
    pct = 100 * saved / total_original_obs
    print(f"EFFICIENCY: {total_merged_obs} unique obs from {total_original_obs} total ({saved} redundant, {pct:.0f}% saved)")


Built 1 circuits:

S-BELL-2 (4 qubits):
  â€¢ workload_pair_correlations        6 obs  3 groups        
  â€¢ diagnostics_single_qubit          4 obs  COMMUTING       
  â€¢ diagnostics_cross_pair            1 obs  COMMUTING       
  â€¢ stress_random_1000               87 obs  32 groups       
  â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
  âœ“ MERGED: 90 unique obs (saved 8 redundant)

TOTAL: 1 benchmark runs
EFFICIENCY: 90 unique obs from 98 total (8 redundant, 8% saved)


---

## 3. Run Benchmarks

In [7]:
%%time
# =============================================================================
# RUN ALL BENCHMARKS (with suite merging optimization)
# =============================================================================
# When merge_enabled=True, we run ONE benchmark per circuit with all observables,
# then split results by suite tags. This avoids redundant circuit sampling.

all_results = {}  # Keyed by (circuit_id, suite_name) or circuit_id for merged

run_count = 0
total_runs = sum(1 if info.get('merged') else len(info['suites']) for info in circuits.values())

for cid, info in circuits.items():
    
    if info.get('merged'):
        # === MERGED MODE: Run once with all observables ===
        run_count += 1
        merged_set = info['merged_observable_set']
        suite_names = list(info['suites'].keys())
        
        print(f"\n{'='*80}")
        print(f"BENCHMARK {run_count}/{total_runs}: {cid} (MERGED)")
        print(f"  Circuit: {cid} ({info['n_qubits']}q)")
        print(f"  Merged suites: {', '.join(suite_names)}")
        print(f"  Total unique observables: {len(merged_set.observables)}")
        print(f"  Overlap saved: {info['overlap_count']} redundant obs")
        print(f"{'='*80}")
        
        # Build locality map from merged observables
        loc_map = {obs.observable_id: obs.locality for obs in merged_set.observables}
        
        # Run benchmark ONCE with merged observable set
        result = run_benchmark_suite(
            circuit=info['circuit'],
            observable_set=merged_set,
            circuit_id=f"{cid}__merged",
            config=CONFIG,
            locality_map=loc_map,
        )
        
        # Store the merged result with suite metadata for later splitting
        all_results[f"{cid}__merged"] = {
            'result': result,
            'circuit_id': cid,
            'suite_name': '_merged_',
            'suites': info['suites'],  # Original suites for analysis
            'suite_mapping': info['suite_mapping'],  # obs_id -> [suite_names]
            'n_qubits': info['n_qubits'],
            'merged': True,
        }
        
    else:
        # === NON-MERGED MODE: Run each suite separately (single suite or merge disabled) ===
        for suite_name, suite in info['suites'].items():
            run_count += 1
            run_key = f"{cid}__{suite_name}"
            
            print(f"\n{'='*80}")
            print(f"BENCHMARK {run_count}/{total_runs}: {cid} / {suite_name}")
            print(f"  Circuit: {cid} ({info['n_qubits']}q)")
            print(f"  Suite: {suite_name} ({suite.n_observables} observables)")
            print(f"  Type: {suite.suite_type.value}, Objective: {suite.objective.value}")
            print(f"{'='*80}")
            
            # Build locality map from suite
            loc_map = {obs.observable_id: obs.locality for obs in suite.observables}
            
            # Run benchmark
            result = run_benchmark_suite(
                circuit=info['circuit'],
                observable_set=suite.observable_set,
                circuit_id=run_key,
                config=CONFIG,
                locality_map=loc_map,
            )
            
            # Store result with suite metadata
            all_results[run_key] = {
                'result': result,
                'circuit_id': cid,
                'suite_name': suite_name,
                'suite': suite,
                'n_qubits': info['n_qubits'],
                'merged': False,
            }

print(f"\n\n{'='*80}")
print(f"ALL BENCHMARKS COMPLETE: {len(all_results)} runs")
if any(r.get('merged') for r in all_results.values()):
    print("(Suite merging enabled - results will be split by suite in analysis)")
print(f"{'='*80}")


BENCHMARK 1/1: S-BELL-2 (MERGED)
  Circuit: S-BELL-2 (4q)
  Merged suites: workload_pair_correlations, diagnostics_single_qubit, diagnostics_cross_pair, stress_random_1000
  Total unique observables: 90
  Overlap saved: 8 redundant obs
BENCHMARK SUITE: ANALYSIS
Run ID: S-BELL-2__merged_20260227_143006_39bb9b0a
Output: benchmark_results\S-BELL-2__merged_20260227_143006_39bb9b0a
Mode: analysis

Step 1: Running base benchmark...


  x = asanyarray(arr - arrmean)
  diff_b_a = subtract(b, a)


  diff_b_a = subtract(b, a)


  Completed: 2700 rows

Step 2: Running all 8 tasks...


  x = asanyarray(arr - arrmean)
  diff_b_a = subtract(b, a)


  Completed: 12 task evaluations

Step 3: Running comprehensive analysis...


  X -= avg[:, None]
  b = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
  null_diffs[i] = statistic(perm_a) - statistic(perm_b)


  Comprehensive analysis complete

Step 4: Generating reports...
  Basic report: benchmark_results\S-BELL-2__merged_20260227_143006_39bb9b0a\basic_report.md
  Complete report: benchmark_results\S-BELL-2__merged_20260227_143006_39bb9b0a\complete_report.md
  Analysis report: benchmark_results\S-BELL-2__merged_20260227_143006_39bb9b0a\analysis_report.md
  Analysis JSON: benchmark_results\S-BELL-2__merged_20260227_143006_39bb9b0a\analysis.json

BENCHMARK COMPLETE
Output directory: benchmark_results\S-BELL-2__merged_20260227_143006_39bb9b0a
Reports generated: ['basic', 'complete', 'analysis', 'analysis_json', 'config', 'manifest']



ALL BENCHMARKS COMPLETE: 1 runs
(Suite merging enabled - results will be split by suite in analysis)
CPU times: total: 1min
Wall time: 1min 3s


In [8]:
%%time
# =============================================================================
# NOISE SWEEP (Task 7: Noise Sensitivity Analysis)
# =============================================================================
# Run the same benchmark with different noise profiles to analyze sensitivity

noise_results = {}  # Store results for each noise profile

if RUN_NOISE_SWEEP:
    print(f"Running noise sweep with profiles: {NOISE_PROFILES}")
    print()
    
    for noise_profile in NOISE_PROFILES:
        print()
        print("=" * 80)
        print(f"NOISE PROFILE: {noise_profile}")
        print("=" * 80)
        
        # Create config with noise profile (inherits timeout + hw profile)
        noise_config = BenchmarkSuiteConfig(
            mode=BenchmarkMode.ANALYSIS,
            n_shots_grid=CONFIG.n_shots_grid,
            n_replicates=CONFIG.n_replicates,
            seed=CONFIG.seed,
            epsilon=CONFIG.epsilon,
            delta=CONFIG.delta,
            shadows_protocol_id=CONFIG.shadows_protocol_id,
            baseline_protocol_id=CONFIG.baseline_protocol_id,
            output_base_dir=f"benchmark_results_noise/{noise_profile}",
            noise_profile=noise_profile if noise_profile != 'ideal' else None,
            timeout_per_protocol_s=TIMEOUT_PER_PROTOCOL_S,
            hw_timing_profile=HW_TIMING_PROFILE,
        )
        
        # Run for each circuit
        for cid, info in circuits.items():
            if info.get('merged'):
                merged_set = info['merged_observable_set']
                loc_map = {obs.observable_id: obs.locality for obs in merged_set.observables}
                
                print()
                print(f"  Running {cid} with {noise_profile}...")
                
                result = run_benchmark_suite(
                    circuit=info['circuit'],
                    observable_set=merged_set,
                    circuit_id=f"{cid}__{noise_profile}",
                    config=noise_config,
                    locality_map=loc_map,
                )
                
                noise_results[f"{cid}__{noise_profile}"] = {
                    'result': result,
                    'circuit_id': cid,
                    'noise_profile': noise_profile,
                    'n_qubits': info['n_qubits'],
                }
    
    print()
    print()
    print("=" * 80)
    print(f"NOISE SWEEP COMPLETE: {len(noise_results)} runs")
    print("=" * 80)
else:
    print("Noise sweep disabled (RUN_NOISE_SWEEP = False)")


Running noise sweep with profiles: ['ideal', 'readout_1e-2', 'depol_low']


NOISE PROFILE: ideal

  Running S-BELL-2 with ideal...
BENCHMARK SUITE: ANALYSIS
Run ID: S-BELL-2__ideal_20260227_143109_e731fe2c
Output: benchmark_results_noise\ideal\S-BELL-2__ideal_20260227_143109_e731fe2c
Mode: analysis

Step 1: Running base benchmark...


  diff_b_a = subtract(b, a)


  Completed: 2700 rows

Step 2: Running all 8 tasks...


  x = asanyarray(arr - arrmean)
  diff_b_a = subtract(b, a)


  Completed: 12 task evaluations

Step 3: Running comprehensive analysis...


  X -= avg[:, None]
  b = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
  null_diffs[i] = statistic(perm_a) - statistic(perm_b)


  Comprehensive analysis complete

Step 4: Generating reports...
  Basic report: benchmark_results_noise\ideal\S-BELL-2__ideal_20260227_143109_e731fe2c\basic_report.md
  Complete report: benchmark_results_noise\ideal\S-BELL-2__ideal_20260227_143109_e731fe2c\complete_report.md
  Analysis report: benchmark_results_noise\ideal\S-BELL-2__ideal_20260227_143109_e731fe2c\analysis_report.md
  Analysis JSON: benchmark_results_noise\ideal\S-BELL-2__ideal_20260227_143109_e731fe2c\analysis.json

BENCHMARK COMPLETE
Output directory: benchmark_results_noise\ideal\S-BELL-2__ideal_20260227_143109_e731fe2c
Reports generated: ['basic', 'complete', 'analysis', 'analysis_json', 'config', 'manifest']


NOISE PROFILE: readout_1e-2

  Running S-BELL-2 with readout_1e-2...
BENCHMARK SUITE: ANALYSIS
Run ID: S-BELL-2__readout_1e-2_20260227_143225_9c7c9b9d
Output: benchmark_results_noise\readout_1e-2\S-BELL-2__readout_1e-2_20260227_143225_9c7c9b9d
Mode: analysis

Step 1: Running base benchmark (noise: readout_1e

  diff_b_a = subtract(b, a)


  Completed: 2700 rows

Step 2: Running all 8 tasks...
  Completed: 12 task evaluations

Step 3: Running comprehensive analysis...


  x = asanyarray(arr - arrmean)
  diff_b_a = subtract(b, a)
  X -= avg[:, None]


  b = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
  null_diffs[i] = statistic(perm_a) - statistic(perm_b)


  Comprehensive analysis complete

Step 4: Generating reports...
  Basic report: benchmark_results_noise\readout_1e-2\S-BELL-2__readout_1e-2_20260227_143225_9c7c9b9d\basic_report.md
  Complete report: benchmark_results_noise\readout_1e-2\S-BELL-2__readout_1e-2_20260227_143225_9c7c9b9d\complete_report.md
  Analysis report: benchmark_results_noise\readout_1e-2\S-BELL-2__readout_1e-2_20260227_143225_9c7c9b9d\analysis_report.md
  Analysis JSON: benchmark_results_noise\readout_1e-2\S-BELL-2__readout_1e-2_20260227_143225_9c7c9b9d\analysis.json

BENCHMARK COMPLETE
Output directory: benchmark_results_noise\readout_1e-2\S-BELL-2__readout_1e-2_20260227_143225_9c7c9b9d
Reports generated: ['basic', 'complete', 'analysis', 'analysis_json', 'config', 'manifest']


NOISE PROFILE: depol_low

  Running S-BELL-2 with depol_low...
BENCHMARK SUITE: ANALYSIS
Run ID: S-BELL-2__depol_low_20260227_143315_10593e6a
Output: benchmark_results_noise\depol_low\S-BELL-2__depol_low_20260227_143315_10593e6a
Mode: anal

  diff_b_a = subtract(b, a)


  Completed: 2700 rows

Step 2: Running all 8 tasks...
  Completed: 12 task evaluations

Step 3: Running comprehensive analysis...


  x = asanyarray(arr - arrmean)
  diff_b_a = subtract(b, a)
  X -= avg[:, None]
  b = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)


  null_diffs[i] = statistic(perm_a) - statistic(perm_b)


  Comprehensive analysis complete

Step 4: Generating reports...
  Basic report: benchmark_results_noise\depol_low\S-BELL-2__depol_low_20260227_143315_10593e6a\basic_report.md
  Complete report: benchmark_results_noise\depol_low\S-BELL-2__depol_low_20260227_143315_10593e6a\complete_report.md
  Analysis report: benchmark_results_noise\depol_low\S-BELL-2__depol_low_20260227_143315_10593e6a\analysis_report.md
  Analysis JSON: benchmark_results_noise\depol_low\S-BELL-2__depol_low_20260227_143315_10593e6a\analysis.json

BENCHMARK COMPLETE
Output directory: benchmark_results_noise\depol_low\S-BELL-2__depol_low_20260227_143315_10593e6a
Reports generated: ['basic', 'complete', 'analysis', 'analysis_json', 'config', 'manifest']



NOISE SWEEP COMPLETE: 3 runs
CPU times: total: 2min 48s
Wall time: 2min 49s


---

## 4. Complete Results Analysis

This section displays ALL analysis features from the enhanced benchmarking system.

In [9]:
# =============================================================================
# TASK SUMMARY FOR EACH (Circuit, Suite) PAIR (All 8 Measurements Bible Tasks)
# =============================================================================

def filter_results_by_suite(long_form_results, suite_mapping, target_suite):
    """Filter long-form results to only include observables belonging to a specific suite."""
    # Get observable IDs that belong to this suite
    suite_obs_ids = {
        obs_id for obs_id, suites in suite_mapping.items() 
        if target_suite in suites
    }
    return [r for r in long_form_results if r.observable_id in suite_obs_ids]

def compute_full_task_summary(long_form, truth, config, n_observables):
    """Compute all 8 task answers for a given set of results."""
    max_n = max(config.n_shots_grid)
    eps = config.epsilon
    
    # Group by protocol and N
    by_pn = defaultdict(lambda: defaultdict(list))
    for row in long_form:
        by_pn[row.protocol_id][row.N_total].append(row)
    protocols = sorted(by_pn.keys())
    
    tasks = {}
    
    # Task 1: Worst-case N*
    tasks['1'] = {}
    for p in protocols:
        n_star = None
        for n in sorted(by_pn[p].keys()):
            ses = [r.se for r in by_pn[p][n] if r.se is not None]
            if ses:
                max_se = max(ses)
                if max_se <= eps:
                    n_star = n
                    break
        tasks['1'][p] = f"N*={n_star}" if n_star else f"N*>{max_n}"
    
    # Task 2: Average N*
    tasks['2'] = {}
    for p in protocols:
        n_star = None
        for n in sorted(by_pn[p].keys()):
            ses = [r.se for r in by_pn[p][n] if r.se is not None]
            if ses:
                mean_se = np.mean(ses)
                if mean_se <= eps:
                    n_star = n
                    break
        tasks['2'][p] = f"N*={n_star}" if n_star else f"N*>{max_n}"
    
    # Task 3: SE distribution at max N
    tasks['3'] = {}
    for p in protocols:
        ses = [r.se for r in by_pn[p][max_n] if r.se is not None]
        if ses:
            tasks['3'][p] = {'mean': np.mean(ses), 'median': np.median(ses), 'max': np.max(ses)}
        else:
            tasks['3'][p] = {'mean': float('nan'), 'median': float('nan'), 'max': float('nan')}
    
    # Task 4: Dominance
    obs_best = {}
    for p in protocols:
        for r in by_pn[p][max_n]:
            if r.observable_id not in obs_best or (r.se is not None and r.se < obs_best[r.observable_id][1]):
                obs_best[r.observable_id] = (p, r.se if r.se is not None else float('inf'))
    wins = defaultdict(int)
    for oid, (p, _) in obs_best.items():
        wins[p] += 1
    total = len(obs_best) if obs_best else 1
    tasks['4'] = {p: f"{wins[p]}/{total} ({100*wins[p]/total:.0f}%)" for p in protocols}
    tasks['4']['winner'] = max(wins, key=wins.get) if wins else "N/A"
    
    # Task 5: Pilot selection (placeholder - requires analysis object)
    tasks['5'] = "N/A"
    
    # Task 6: Bias-variance
    tasks['6'] = {}
    if truth:
        for p in protocols:
            by_obs = defaultdict(list)
            for r in by_pn[p][max_n]:
                if r.observable_id in truth:
                    by_obs[r.observable_id].append(r.estimate)
            biases_sq, vars_ = [], []
            for oid, ests in by_obs.items():
                if ests:
                    biases_sq.append((np.mean(ests) - truth[oid])**2)
                    vars_.append(np.var(ests))
            if biases_sq:
                tasks['6'][p] = {'bias2': np.mean(biases_sq), 'var': np.mean(vars_),
                                 'mse': np.mean(biases_sq) + np.mean(vars_)}
    
    # Task 7: Noise sensitivity (placeholder)
    tasks['7'] = "Requires noise sweep" if not RUN_NOISE_SWEEP else "See noise analysis"
    
    # Task 8: Adaptive efficiency (placeholder)
    tasks['8'] = "See Task 5 pilot analysis"
    
    return tasks, protocols

def display_task_summary(run_key, tasks, protocols, suite_info):
    """Display formatted task summary."""
    col_w = 24
    hdr = f"{'Task':<6} {'Question':<40}"
    for p in protocols:
        short = p.replace('classical_shadows_v0', 'shadows').replace('direct_', '')
        hdr += f" {short:>{col_w}}"
    print(hdr)
    print("-" * len(hdr))
    
    # Task 1
    row = f"{'1':<6} {'Worst-case N* (max SE <= eps)?':<40}"
    for p in protocols: row += f" {tasks['1'][p]:>{col_w}}"
    print(row)
    
    # Task 2
    row = f"{'2':<6} {'Average N* (mean SE <= eps)?':<40}"
    for p in protocols: row += f" {tasks['2'][p]:>{col_w}}"
    print(row)
    
    # Task 3
    print(f"{'3':<6} {'SE distribution at max N?':<40}")
    for m in ['mean', 'median', 'max']:
        row = f"{'':.<6} {'  ' + m:<40}"
        for p in protocols: row += f" {tasks['3'][p][m]:>{col_w}.4f}"
        print(row)
    
    # Task 4
    row = f"{'4':<6} {'Dominance (wins)?':<40}"
    for p in protocols: row += f" {tasks['4'][p]:>{col_w}}"
    print(row)
    print(f"{'':.<6} {'  WINNER:':<40} {tasks['4']['winner']}")
    
    # Task 5
    print(f"{'5':<6} {'Optimal pilot fraction?':<40} {tasks['5']}")
    
    # Task 6
    if tasks['6']:
        print(f"{'6':<6} {'Bias-variance decomposition?':<40}")
        for m in ['bias2', 'var', 'mse']:
            row = f"{'':.<6} {'  ' + m:<40}"
            for p in protocols:
                if p in tasks['6']:
                    row += f" {tasks['6'][p][m]:>{col_w}.6f}"
                else:
                    row += f" {'N/A':>{col_w}}"
            print(row)
    else:
        print(f"{'6':<6} {'Bias-variance?':<40} (requires ground truth)")
    
    # Task 7 & 8
    print(f"{'7':<6} {'Noise sensitivity?':<40} {tasks['7']}")
    print(f"{'8':<6} {'Adaptive efficiency?':<40} {tasks['8']}")
    print("-" * len(hdr))

# Process results (handling both merged and non-merged)
for run_key, run_data in all_results.items():
    bench_result = run_data['result']
    cid = run_data['circuit_id']
    n_qubits = run_data['n_qubits']
    truth = bench_result.ground_truth.truth_values if bench_result.ground_truth else {}
    
    if run_data.get('merged'):
        # === MERGED RESULTS: Split by suite and display each ===
        suite_mapping = run_data['suite_mapping']
        suites = run_data['suites']
        
        print(f"\n{'='*100}")
        print(f"{cid} (MERGED RUN - splitting results by suite)")
        print(f"{'='*100}")
        
        for suite_name, suite in suites.items():
            # Filter results for this suite
            suite_results = filter_results_by_suite(
                bench_result.long_form_results, 
                suite_mapping, 
                suite_name
            )
            
            if not suite_results:
                print(f"\n--- {suite_name}: No results (0 obs) ---")
                continue
            
            # Get suite metadata
            comm = suite.commutation_analysis()
            comm_str = "FULLY COMMUTING" if comm['fully_commuting'] else f"{comm['n_commuting_groups']} groups"
            obj_str = f"[{suite.objective.value}]" if suite.objective != ObjectiveType.PER_OBSERVABLE else ""
            
            print(f"\n{'â”€'*100}")
            print(f"{cid}__{suite_name}")
            print(f"  {n_qubits}q, {suite.n_observables} obs, {suite.suite_type.value} {obj_str}")
            print(f"  Commutation: {comm_str}")
            print(f"{'â”€'*100}")
            
            # Filter ground truth for this suite's observables
            suite_obs_ids = {r.observable_id for r in suite_results}
            suite_truth = {k: v for k, v in truth.items() if k in suite_obs_ids}
            
            # Compute and display task summary
            tasks, protocols = compute_full_task_summary(
                suite_results, suite_truth, CONFIG, suite.n_observables
            )
            display_task_summary(f"{cid}__{suite_name}", tasks, protocols, suite)
            
    else:
        # === NON-MERGED: Display as before ===
        suite = run_data['suite']
        
        comm = suite.commutation_analysis()
        comm_str = "FULLY COMMUTING" if comm['fully_commuting'] else f"{comm['n_commuting_groups']} groups"
        obj_str = f"[{suite.objective.value}]" if suite.objective != ObjectiveType.PER_OBSERVABLE else ""
        
        print(f"\n{'='*100}")
        print(f"{run_key}")
        print(f"  {n_qubits}q, {suite.n_observables} obs, {suite.suite_type.value} {obj_str}")
        print(f"  Commutation: {comm_str}")
        print(f"{'='*100}")
        
        tasks, protocols = compute_full_task_summary(
            bench_result.long_form_results, truth, CONFIG, suite.n_observables
        )
        display_task_summary(run_key, tasks, protocols, suite)


S-BELL-2 (MERGED RUN - splitting results by suite)

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
S-BELL-2__workload_pair_correlations
  4q, 6 obs, workload 
  Commutation: 3 groups
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Task   Question                                                  shadows                  grouped                optimized
--------------------------------------------------------------------------------------------------------------------------
1      Worst-ca

In [10]:
# =============================================================================
# ENHANCED ANALYSIS (Bootstrap, K-S Tests, Crossover, Locality)
# =============================================================================

def display_enhanced_analysis(bench_result, run_key, suite):
    """Display enhanced statistical analysis from result.analysis."""
    if not bench_result.analysis:
        print(f"  No enhanced analysis available")
        return

    analysis = bench_result.analysis

    # N* Interpolation
    if hasattr(analysis, 'n_star_interpolation') and analysis.n_star_interpolation:
        print(f"\n  N* INTERPOLATION (power-law fit):")
        for protocol, data in analysis.n_star_interpolation.items():
            if hasattr(data, 'n_star'):
                print(f"    {protocol}: N* = {data.n_star:.0f}")

    # Statistical Tests
    if hasattr(analysis, 'statistical_tests') and analysis.statistical_tests:
        print(f"\n  STATISTICAL TESTS:")
        st = analysis.statistical_tests
        if hasattr(st, 'ks_statistic'):
            print(f"    K-S statistic: {st.ks_statistic:.4f}")
            print(f"    K-S p-value: {st.ks_pvalue:.4f}")
            sig = "YES" if st.ks_pvalue < 0.05 else "NO"
            print(f"    Distributions significantly different: {sig}")
        if hasattr(st, 'ssf_estimate'):
            print(f"    SSF estimate: {st.ssf_estimate:.2f}x")
        if hasattr(st, 'ssf_ci_low') and hasattr(st, 'ssf_ci_high'):
            print(f"    SSF 95% CI: [{st.ssf_ci_low:.2f}, {st.ssf_ci_high:.2f}]")

    # Crossover Analysis
    if hasattr(analysis, 'crossover_analysis') and analysis.crossover_analysis:
        print(f"\n  CROSSOVER ANALYSIS:")
        ca = analysis.crossover_analysis
        if hasattr(ca, 'crossover_n') and ca.crossover_n:
            print(f"    Crossover N: {ca.crossover_n:.0f}")
        if hasattr(ca, 'shadows_wins_above'):
            print(f"    Shadows wins above crossover: {ca.shadows_wins_above}")

    # Locality Breakdown
    if hasattr(analysis, 'locality_analysis') and analysis.locality_analysis:
        print(f"\n  LOCALITY BREAKDOWN:")
        la = analysis.locality_analysis
        if hasattr(la, 'by_locality'):
            for k, data in sorted(la.by_locality.items()):
                if hasattr(data, 'shadows_mean_se') and hasattr(data, 'baseline_mean_se'):
                    ratio = data.shadows_mean_se / data.baseline_mean_se if data.baseline_mean_se > 0 else float('inf')
                    winner = "shadows" if ratio < 1 else "baseline"
                    print(f"    K={k}: ratio={ratio:.2f}x ({winner})")

    # Pilot Analysis
    if hasattr(analysis, 'pilot_analysis') and analysis.pilot_analysis:
        print(f"\n  PILOT ANALYSIS:")
        pa = analysis.pilot_analysis
        if pa.optimal_fraction is not None:
            print(f"    Optimal fraction: {pa.optimal_fraction*100:.0f}%")
        else:
            print(f"    Optimal fraction: N/A")
        if hasattr(pa, 'results') and pa.results:
            print(f"    Fractions tested: {list(pa.results.keys())}")

print("\n" + "="*100)
print("ENHANCED STATISTICAL ANALYSIS")
print("="*100)

for run_key, run_data in all_results.items():
    bench_result = run_data['result']
    cid = run_data['circuit_id']

    if run_data.get('merged'):
        # === MERGED: Show analysis for each suite ===
        suites = run_data['suites']
        print(f"\n{'='*80}")
        print(f"{cid} (MERGED RUN)")
        print(f"{'='*80}")

        for suite_name, suite in suites.items():
            comm = suite.commutation_analysis()
            comm_str = "commuting" if comm['fully_commuting'] else f"{comm['n_commuting_groups']} groups"
            print(f"\n--- {cid}__{suite_name} ({suite.n_observables} obs, {comm_str}) ---")

        # Display analysis (applies to merged set)
        display_enhanced_analysis(bench_result, f"{cid}__merged", list(suites.values())[0])
    else:
        # === NON-MERGED: Single suite ===
        suite = run_data['suite']
        comm = suite.commutation_analysis()
        comm_str = "commuting" if comm['fully_commuting'] else f"{comm['n_commuting_groups']} groups"

        print(f"\n--- {run_key} ({suite.n_observables} obs, {comm_str}) ---")
        display_enhanced_analysis(bench_result, run_key, suite)


ENHANCED STATISTICAL ANALYSIS

S-BELL-2 (MERGED RUN)

--- S-BELL-2__workload_pair_correlations (6 obs, 3 groups) ---

--- S-BELL-2__diagnostics_single_qubit (4 obs, commuting) ---

--- S-BELL-2__diagnostics_cross_pair (1 obs, commuting) ---

--- S-BELL-2__stress_random_1000 (87 obs, 32 groups) ---

  CROSSOVER ANALYSIS:

  LOCALITY BREAKDOWN:

  PILOT ANALYSIS:
    Optimal fraction: N/A


---

## 4b. Timing Breakdown and Timeout Analysis

Shows where wall-clock time is spent (pre-compute, AER simulation, post-processing)
and estimated quantum hardware time based on circuit properties.

In [11]:
# =============================================================================
# TIMING BREAKDOWN AND TIMEOUT ANALYSIS
# =============================================================================
# New columns in LongFormRow:
#   time_total_s        - Total wall-clock time
#   time_pre_compute_s  - initialize() + next_plan() time
#   time_aer_simulate_s - Just backend.run() / simulation time
#   time_post_process_s - update() + finalize() time
#   est_quantum_hw_s    - Estimated real hardware time (from HardwareTimingProfile)
#   timed_out           - Whether the protocol run timed out
#   n_shots_completed   - Actual shots completed if timed out

print("=" * 100)
print("TIMING BREAKDOWN ANALYSIS")
print("=" * 100)

for run_key, run_data in all_results.items():
    bench_result = run_data['result']
    cid = run_data['circuit_id']
    long_form = bench_result.long_form_results

    if not long_form:
        continue

    print(f"\n{'='*80}")
    print(f"{run_key}")
    print(f"{'='*80}")

    # Check for timeouts
    timed_out_rows = [r for r in long_form if r.timed_out]
    if timed_out_rows:
        # Count unique (protocol, N, replicate) combinations that timed out
        timed_out_combos = {(r.protocol_id, r.N_total, r.replicate_id) for r in timed_out_rows}
        print(f"\n  TIMEOUTS: {len(timed_out_combos)} protocol runs timed out")
        for protocol_id, n_total, rep in sorted(timed_out_combos):
            # Find the n_shots_completed for this combo
            combo_rows = [r for r in timed_out_rows
                          if r.protocol_id == protocol_id and r.N_total == n_total and r.replicate_id == rep]
            shots = combo_rows[0].n_shots_completed if combo_rows and combo_rows[0].n_shots_completed else "?"
            print(f"    {protocol_id} @ N={n_total}, rep={rep}: completed {shots}/{n_total} shots")
    else:
        print(f"\n  No timeouts (all runs completed within budget)")

    # Timing breakdown by protocol and N
    print(f"\n  TIMING BREAKDOWN (mean across replicates):")
    print(f"  {'Protocol':<25} {'N':>8} {'Total(s)':>10} {'PreComp':>10} {'AerSim':>10} {'PostProc':>10} {'HW Est(s)':>10}")
    print(f"  {'-'*93}")

    by_pn = defaultdict(list)
    for r in long_form:
        by_pn[(r.protocol_id, r.N_total)].append(r)

    for (protocol_id, n_total), rows in sorted(by_pn.items(), key=lambda x: (x[0][0], x[0][1])):
        # Take one row per replicate (timing is per-run, same for all observables in a run)
        seen_reps = set()
        unique_rows = []
        for r in rows:
            if r.replicate_id not in seen_reps:
                seen_reps.add(r.replicate_id)
                unique_rows.append(r)

        def safe_mean(vals):
            clean = [v for v in vals if v is not None]
            return np.mean(clean) if clean else None

        total = safe_mean([r.time_total_s for r in unique_rows])
        pre = safe_mean([r.time_pre_compute_s for r in unique_rows])
        aer = safe_mean([r.time_aer_simulate_s for r in unique_rows])
        post = safe_mean([r.time_post_process_s for r in unique_rows])
        hw = safe_mean([r.est_quantum_hw_s for r in unique_rows])

        def fmt(v):
            return f"{v:>10.3f}" if v is not None else f"{'N/A':>10}"

        short_p = protocol_id.replace('classical_shadows_v0', 'shadows').replace('direct_', '')
        print(f"  {short_p:<25} {n_total:>8} {fmt(total)} {fmt(pre)} {fmt(aer)} {fmt(post)} {fmt(hw)}")

    # Speedup: AER sim time vs estimated HW time
    hw_rows = [r for r in long_form if r.est_quantum_hw_s is not None and r.time_aer_simulate_s is not None]
    if hw_rows:
        print(f"\n  SIMULATOR vs HARDWARE TIME:")
        print(f"  {'Protocol':<25} {'N':>8} {'AER(s)':>10} {'HW Est(s)':>10} {'Speedup':>10}")
        print(f"  {'-'*73}")

        for (protocol_id, n_total), rows in sorted(by_pn.items(), key=lambda x: (x[0][0], x[0][1])):
            seen_reps = set()
            unique_rows = []
            for r in rows:
                if r.replicate_id not in seen_reps and r.est_quantum_hw_s is not None:
                    seen_reps.add(r.replicate_id)
                    unique_rows.append(r)

            if not unique_rows:
                continue

            aer = np.mean([r.time_aer_simulate_s for r in unique_rows if r.time_aer_simulate_s])
            hw = np.mean([r.est_quantum_hw_s for r in unique_rows if r.est_quantum_hw_s])
            speedup = aer / hw if hw > 0 else float('inf')

            short_p = protocol_id.replace('classical_shadows_v0', 'shadows').replace('direct_', '')
            print(f"  {short_p:<25} {n_total:>8} {aer:>10.3f} {hw:>10.4f} {speedup:>9.1f}x")

        print(f"\n  NOTE: Speedup > 1 means AER simulation is SLOWER than real HW would be.")
        print(f"        Real HW benefits from massive parallelism across shots.")

print(f"\n{'='*100}")
print("KEY INSIGHT: est_quantum_hw_s shows the true cost of each protocol on hardware.")
print("             Shadows may have fewer settings but deeper measurement circuits.")
print("=" * 100)


TIMING BREAKDOWN ANALYSIS

S-BELL-2__merged

  TIMEOUTS: 30 protocol runs timed out
    classical_shadows_v0 @ N=10000, rep=0: completed 200/10000 shots
    classical_shadows_v0 @ N=10000, rep=1: completed 400/10000 shots
    classical_shadows_v0 @ N=10000, rep=2: completed 600/10000 shots
    classical_shadows_v0 @ N=10000, rep=3: completed 700/10000 shots
    classical_shadows_v0 @ N=10000, rep=4: completed 500/10000 shots
    classical_shadows_v0 @ N=10000, rep=5: completed 400/10000 shots
    classical_shadows_v0 @ N=10000, rep=6: completed 400/10000 shots
    classical_shadows_v0 @ N=10000, rep=7: completed 400/10000 shots
    classical_shadows_v0 @ N=10000, rep=8: completed 300/10000 shots
    classical_shadows_v0 @ N=10000, rep=9: completed 300/10000 shots
    direct_grouped @ N=10000, rep=0: completed 303/10000 shots
    direct_grouped @ N=10000, rep=1: completed 606/10000 shots
    direct_grouped @ N=10000, rep=2: completed 606/10000 shots
    direct_grouped @ N=10000, rep=3: 

In [12]:
# =============================================================================
# OBJECTIVE-LEVEL ANALYSIS (Work Item 4: Task-Level Metrics)
# =============================================================================
# For suites with weighted objectives (QAOA cost, chemistry energy), compute
# the error in the OBJECTIVE, not individual observables.
#
# This is the actual metric practitioners care about:
#   - QAOA: C = Î£ (1 - âŸ¨ZZâŸ©) / 2   (MAX-CUT cost)
#   - Chemistry: E = Î£ c_k âŸ¨P_kâŸ©   (ground state energy)

from quartumse.analysis.objective_metrics import (
    compute_objective_metrics,
    format_objective_analysis,
)

print("\n" + "="*100)
print("OBJECTIVE-LEVEL ANALYSIS (Weighted Suites Only)")
print("="*100)

# Find weighted suites (handling both merged and non-merged)
weighted_runs = []
for run_key, run_data in all_results.items():
    if run_data.get('merged'):
        # Check each suite in merged results
        for suite_name, suite in run_data['suites'].items():
            if suite.objective == ObjectiveType.WEIGHTED_SUM and suite.weights:
                weighted_runs.append((f"{run_data['circuit_id']}__{suite_name}", run_data, suite_name, suite))
    else:
        suite = run_data['suite']
        if suite.objective == ObjectiveType.WEIGHTED_SUM and suite.weights:
            weighted_runs.append((run_key, run_data, run_data['suite_name'], suite))

if not weighted_runs:
    print("\nNo weighted suites found. Enable QAOA workload or Chemistry suites to see objective metrics.")
else:
    print(f"\nFound {len(weighted_runs)} weighted suite(s):")

    for run_key, run_data, suite_name, suite in weighted_runs:
        bench_result = run_data['result']

        print(f"\n{'='*80}")
        print(f"{run_key}")
        print(f"  Objective type: {suite.suite_type.value}")
        print(f"  Weighted observables: {len(suite.weights)}")
        print(f"{'='*80}")

        # Determine objective type for computation
        obj_type = "qaoa_cost" if "qaoa" in run_key.lower() or "cost" in suite.name.lower() else "weighted_sum"

        # Filter results if merged
        if run_data.get('merged'):
            suite_mapping = run_data['suite_mapping']
            suite_obs_ids = {
                obs_id for obs_id, suites in suite_mapping.items()
                if suite_name in suites
            }
            long_form = [r for r in bench_result.long_form_results if r.observable_id in suite_obs_ids]
        else:
            long_form = bench_result.long_form_results

        # Compute objective metrics
        obj_analysis = compute_objective_metrics(
            long_form_results=long_form,
            weights=suite.weights,
            objective_type=obj_type,
            true_objective=None,  # Could add ground truth if available
            target_epsilon=CONFIG.epsilon,
            n_bootstrap=500,
            seed=CONFIG.seed,
        )

        # Display formatted results
        print(format_objective_analysis(obj_analysis))

        # Store in results for later
        if not run_data.get('merged'):
            run_data['objective_analysis'] = obj_analysis

print("\n" + "="*100)
print("KEY INSIGHT: For weighted objectives, what matters is the TOTAL error,")
print("             not individual observable errors. This may change which protocol wins!")
print("="*100)


OBJECTIVE-LEVEL ANALYSIS (Weighted Suites Only)

No weighted suites found. Enable QAOA workload or Chemistry suites to see objective metrics.

KEY INSIGHT: For weighted objectives, what matters is the TOTAL error,
             not individual observable errors. This may change which protocol wins!


In [13]:
# =============================================================================
# POST-HOC QUERYING BENCHMARK (Work Item 3)
# =============================================================================
# This simulates the core advantage of classical shadows:
#   "Measure once, decide observables later"
#
# Cost accounting:
#   - Shadows: quantum cost = ONE acquisition; all new queries are FREE
#   - Direct: pay for each new basis not already measured
#
# This quantifies the "option value" of shadows.

from quartumse.analysis.posthoc_benchmark import (
    run_posthoc_benchmark_from_suite,
    format_posthoc_result,
)

print("\n" + "="*100)
print("POST-HOC QUERYING BENCHMARK")
print("="*100)

# Check if posthoc was skipped due to redundancy
if posthoc_skipped:
    print("\nâš  POSTHOC ANALYSIS SKIPPED (redundant with stress - same or fewer observables)")
    print("  Posthoc adds value when it tests MORE observables than stress.")
    print("  To run posthoc, increase N_POSTHOC_OBSERVABLES or decrease N_STRESS_OBSERVABLES.")
    for cid, key, stress_obs, posthoc_obs in posthoc_skipped:
        print(f"\n  Skipped: {cid}/{key}")
        print(f"    Reason: posthoc ({posthoc_obs} obs) <= stress ({stress_obs} obs)")

# Find posthoc suites (checking both merged and non-merged results)
posthoc_runs = []
for run_key, run_data in all_results.items():
    if run_data.get('merged'):
        # Check if any merged suite is posthoc type
        for suite_name, suite in run_data['suites'].items():
            if suite.suite_type == SuiteType.POSTHOC:
                posthoc_runs.append((f"{run_data['circuit_id']}__{suite_name}", suite))
    else:
        suite = run_data['suite']
        if suite.suite_type == SuiteType.POSTHOC:
            posthoc_runs.append((run_key, suite))

# Also check for posthoc libraries in circuit definitions (even if not benchmarked)
posthoc_available = []
for cid, info in circuits.items():
    for suite_name, suite in info.get('suites', {}).items():
        if 'posthoc' in suite_name.lower() or suite.suite_type == SuiteType.POSTHOC:
            posthoc_available.append((cid, suite_name, suite))

if not posthoc_available and not posthoc_runs and not posthoc_skipped:
    print("\nNo post-hoc suites found.")
    print("Enable 'posthoc': True in SUITES_TO_RUN to run post-hoc benchmarks.")
    print("\nExample configuration:")
    print("  SUITES_TO_RUN = {")
    print("      'workload': True,")
    print("      'posthoc': True,  # <-- Enable this")
    print("  }")
elif posthoc_available:
    # Run post-hoc simulation on available posthoc suites
    print(f"\nFound {len(posthoc_available)} post-hoc suite(s):")

    for cid, suite_name, suite in posthoc_available:
        print(f"\n{'='*80}")
        print(f"POST-HOC SIMULATION: {cid}:{suite_name}")
        print(f"  Library size: {suite.n_observables} observables")
        print(f"{'='*80}")

        # Configure simulation
        n_rounds = 5
        obs_per_round = max(10, suite.n_observables // 10)  # ~10% per round

        # Run simulation
        result = run_posthoc_benchmark_from_suite(
            posthoc_suite=suite,
            n_rounds=n_rounds,
            observables_per_round=obs_per_round,
            shadows_shots=max(CONFIG.n_shots_grid),  # Use max shot budget
            direct_shots_per_basis=100,  # Shots per basis for direct
            seed=CONFIG.seed,
        )

        # Display results
        print(format_posthoc_result(result))
        print()

        # Cumulative cost curves
        print("\nCUMULATIVE COST CURVES:")
        print(f"{'Round':<8} {'Cum Shadows':>15} {'Cum Direct':>15} {'Cum Obs':>12} {'Savings':>10}")
        print("-" * 65)

        shadows = result.shadows_costs
        direct = result.direct_costs

        if shadows and direct:
            for i in range(result.n_rounds):
                savings = direct.cumulative_shots[i] / shadows.cumulative_shots[i] if shadows.cumulative_shots[i] > 0 else float('inf')
                print(
                    f"{i:<8} {shadows.cumulative_shots[i]:>15,} {direct.cumulative_shots[i]:>15,} "
                    f"{shadows.cumulative_observables_answered[i]:>12} {savings:>9.1f}x"
                )

print("\n" + "="*100)
print("KEY INSIGHT: Shadows' quantum cost is FIXED after acquisition.")
print("             Direct measurement cost GROWS with each new query round.")
print("             The more observables you query later, the more shadows saves.")
print("="*100)


POST-HOC QUERYING BENCHMARK

No post-hoc suites found.
Enable 'posthoc': True in SUITES_TO_RUN to run post-hoc benchmarks.

Example configuration:
  SUITES_TO_RUN = {
      'workload': True,
      'posthoc': True,  # <-- Enable this
  }

KEY INSIGHT: Shadows' quantum cost is FIXED after acquisition.
             Direct measurement cost GROWS with each new query round.
             The more observables you query later, the more shadows saves.


In [14]:
# =============================================================================
# CROSS-CIRCUIT CONSOLIDATED COMPARISON (Suite-Aware)
# =============================================================================

print("\n" + "="*100)
print("CROSS-CIRCUIT COMPARISON BY SUITE TYPE")
print("="*100)

def compute_suite_summary(long_form_results, suite_mapping, suite_name, n_qubits, suite):
    """Compute summary stats for a specific suite from merged results."""
    # Filter to this suite's observables
    suite_obs_ids = {
        obs_id for obs_id, suites in suite_mapping.items() 
        if suite_name in suites
    }
    suite_results = [r for r in long_form_results if r.observable_id in suite_obs_ids]
    
    if not suite_results:
        return None
    
    # Get max N
    max_n = max(r.N_total for r in suite_results)
    
    # Compute mean SE by protocol at max N
    by_protocol = defaultdict(list)
    for r in suite_results:
        if r.N_total == max_n and r.se is not None:
            by_protocol[r.protocol_id].append(r.se)
    
    shadows_se = np.mean(by_protocol.get('classical_shadows_v0', [float('inf')]))
    grouped_se = np.mean(by_protocol.get('direct_grouped', [float('inf')]))
    
    return {
        'shadows_se': shadows_se,
        'grouped_se': grouped_se,
        'n_qubits': n_qubits,
        'n_observables': suite.n_observables,
        'suite': suite,
    }

# Build per-suite summaries (handling both merged and non-merged)
suite_summaries = []  # List of (run_key, suite_name, summary_dict)

for run_key, run_data in all_results.items():
    bench_result = run_data['result']
    cid = run_data['circuit_id']
    n_qubits = run_data['n_qubits']
    
    if run_data.get('merged'):
        # Split merged results by suite
        suite_mapping = run_data['suite_mapping']
        suites = run_data['suites']
        
        for suite_name, suite in suites.items():
            summary = compute_suite_summary(
                bench_result.long_form_results,
                suite_mapping,
                suite_name,
                n_qubits,
                suite
            )
            if summary:
                suite_summaries.append((f"{cid}__{suite_name}", suite_name, summary))
    else:
        # Non-merged: use result summary directly
        suite = run_data['suite']
        summaries = bench_result.summary.get('protocol_summaries', {})
        
        shadows_se = summaries.get('classical_shadows_v0', {}).get('mean_se', float('inf'))
        grouped_se = summaries.get('direct_grouped', {}).get('mean_se', float('inf'))
        
        summary = {
            'shadows_se': shadows_se,
            'grouped_se': grouped_se,
            'n_qubits': n_qubits,
            'n_observables': suite.n_observables,
            'suite': suite,
        }
        suite_summaries.append((run_key, run_data['suite_name'], summary))

# Group by suite type for analysis
by_suite_type = defaultdict(list)
for run_key, suite_name, summary in suite_summaries:
    suite_type = summary['suite'].suite_type.value
    by_suite_type[suite_type].append((run_key, summary))

# Summary table for each suite type
for suite_type, runs in by_suite_type.items():
    print(f"\n{'='*80}")
    print(f"SUITE TYPE: {suite_type.upper()}")
    print(f"{'='*80}")
    
    print(f"{'Run Key':<35} {'Q':>3} {'Obs':>5} {'Comm?':>6} {'Shadows SE':>12} {'Grouped SE':>12} {'Ratio':>8} {'Winner':>10}")
    print("-" * 105)
    
    shadows_wins = 0
    total_runs = 0
    
    for run_key, summary in runs:
        suite = summary['suite']
        n_qubits = summary['n_qubits']
        
        # Check commutation
        comm = suite.commutation_analysis()
        comm_str = "YES" if comm['fully_commuting'] else "no"
        
        shadows_se = summary['shadows_se']
        grouped_se = summary['grouped_se']
        
        ratio = shadows_se / grouped_se if grouped_se > 0 else float('inf')
        winner = 'Shadows' if ratio < 1 else 'Grouped'
        
        if ratio < 1:
            shadows_wins += 1
        total_runs += 1
        
        print(f"{run_key:<35} {n_qubits:>3} {suite.n_observables:>5} {comm_str:>6} "
              f"{shadows_se:>12.4f} {grouped_se:>12.4f} {ratio:>8.2f}x {winner:>10}")
    
    print("-" * 105)
    print(f"  {suite_type.upper()}: Shadows wins {shadows_wins}/{total_runs} runs")

# Overall summary
print("\n" + "="*100)
print("OVERALL SUMMARY")
print("="*100)

total_wins_shadows = 0
total_runs = 0
commuting_shadows_wins = 0
commuting_total = 0
noncommuting_shadows_wins = 0
noncommuting_total = 0

for run_key, suite_name, summary in suite_summaries:
    suite = summary['suite']
    comm = suite.commutation_analysis()
    
    shadows_se = summary['shadows_se']
    grouped_se = summary['grouped_se']
    
    shadows_won = shadows_se < grouped_se
    
    total_runs += 1
    if shadows_won:
        total_wins_shadows += 1
    
    if comm['fully_commuting']:
        commuting_total += 1
        if shadows_won:
            commuting_shadows_wins += 1
    else:
        noncommuting_total += 1
        if shadows_won:
            noncommuting_shadows_wins += 1

print(f"\nTotal suite evaluations: {total_runs}")
print(f"Shadows wins overall: {total_wins_shadows}/{total_runs} ({100*total_wins_shadows/total_runs:.1f}%)")
if commuting_total > 0:
    print(f"Shadows wins on COMMUTING suites: {commuting_shadows_wins}/{commuting_total} ({100*commuting_shadows_wins/commuting_total:.1f}%)")
if noncommuting_total > 0:
    print(f"Shadows wins on NON-COMMUTING suites: {noncommuting_shadows_wins}/{noncommuting_total} ({100*noncommuting_shadows_wins/noncommuting_total:.1f}%)")

# Show merge efficiency if applicable
merged_runs = [r for r in all_results.values() if r.get('merged')]
if merged_runs:
    print(f"\nâœ“ MERGE OPTIMIZATION ACTIVE")
    for r in merged_runs:
        cid = r['circuit_id']
        n_suites = len(r['suites'])
        overlap = circuits[cid]['overlap_count']
        print(f"  {cid}: {n_suites} suites merged, {overlap} redundant obs eliminated")

print("\nKEY INSIGHT:")
print("  - For COMMUTING suites (e.g., QAOA cost), grouped measurement should dominate")
print("  - For NON-COMMUTING suites (e.g., stress), shadows may become competitive")
print("  - The crossover depends on K (number of observables) and locality distribution")


CROSS-CIRCUIT COMPARISON BY SUITE TYPE

SUITE TYPE: WORKLOAD
Run Key                               Q   Obs  Comm?   Shadows SE   Grouped SE    Ratio     Winner
---------------------------------------------------------------------------------------------------------
S-BELL-2__workload_pair_correlations   4     6     no       0.1432          inf     0.00x    Shadows
---------------------------------------------------------------------------------------------------------
  WORKLOAD: Shadows wins 1/1 runs

SUITE TYPE: DIAGNOSTIC
Run Key                               Q   Obs  Comm?   Shadows SE   Grouped SE    Ratio     Winner
---------------------------------------------------------------------------------------------------------
S-BELL-2__diagnostics_single_qubit    4     4    YES       0.0872          inf     0.00x    Shadows
S-BELL-2__diagnostics_cross_pair      4     1    YES       0.1508          inf     0.00x    Shadows
---------------------------------------------------------------

In [15]:
# =============================================================================
# SAVE CONSOLIDATED RESULTS (with Suite Metadata)
# =============================================================================

output_dir = Path(CONFIG.output_base_dir)
output_dir.mkdir(parents=True, exist_ok=True)

consolidated = {
    'timestamp': datetime.now().isoformat(),
    'n_runs': len(all_results),
    'config': {
        'mode': CONFIG.mode.value,
        'n_shots_grid': CONFIG.n_shots_grid,
        'n_replicates': CONFIG.n_replicates,
        'epsilon': CONFIG.epsilon,
    },
    'suites_enabled': {k: v for k, v in SUITES_TO_RUN.items() if v},
    'circuits_enabled': [k for k, v in CIRCUITS_TO_RUN.items() if v],
    'merge_enabled': merge_enabled,
    'runs': {},
}

for run_key, run_data in all_results.items():
    bench_result = run_data['result']
    cid = run_data['circuit_id']

    if run_data.get('merged'):
        # === MERGED: Store info about all suites in merge ===
        suites = run_data['suites']
        suite_metadata = {}
        for suite_name, suite in suites.items():
            comm = suite.commutation_analysis()
            suite_metadata[suite_name] = {
                'suite_type': suite.suite_type.value,
                'objective': suite.objective.value,
                'n_observables': suite.n_observables,
                'fully_commuting': comm['fully_commuting'],
                'n_commuting_groups': comm['n_commuting_groups'],
                'has_weights': suite.weights is not None,
                'description': suite.description,
            }

        consolidated['runs'][run_key] = {
            'circuit_id': cid,
            'suite_name': '_merged_',
            'n_qubits': run_data['n_qubits'],
            'merged': True,
            'merged_suites': list(suites.keys()),
            'overlap_eliminated': circuits[cid]['overlap_count'],
            'suite_metadata': suite_metadata,
            # Benchmark results
            'run_id': bench_result.run_id,
            'output_dir': str(bench_result.output_dir),
            'summary': bench_result.summary,
        }
    else:
        # === NON-MERGED: Single suite ===
        suite = run_data['suite']
        comm = suite.commutation_analysis()

        consolidated['runs'][run_key] = {
            'circuit_id': cid,
            'suite_name': run_data['suite_name'],
            'n_qubits': run_data['n_qubits'],
            'merged': False,
            # Suite metadata
            'suite_metadata': {
                run_data['suite_name']: {
                    'suite_type': suite.suite_type.value,
                    'objective': suite.objective.value,
                    'n_observables': suite.n_observables,
                    'fully_commuting': comm['fully_commuting'],
                    'n_commuting_groups': comm['n_commuting_groups'],
                    'has_weights': suite.weights is not None,
                    'description': suite.description,
                }
            },
            # Benchmark results
            'run_id': bench_result.run_id,
            'output_dir': str(bench_result.output_dir),
            'summary': bench_result.summary,
        }

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
consolidated_path = output_dir / f'consolidated_{timestamp}.json'
with open(consolidated_path, 'w') as f:
    json.dump(consolidated, f, indent=2, default=str)

print(f"Consolidated results saved: {consolidated_path}")
print(f"\nSummary:")
print(f"  Total runs: {len(all_results)}")
print(f"  Suites enabled: {list(consolidated['suites_enabled'].keys())}")
print(f"  Circuits: {consolidated['circuits_enabled']}")
print(f"  Merge optimization: {'enabled' if merge_enabled else 'disabled'}")

# Show merge info if applicable
merged_runs = [r for r in all_results.values() if r.get('merged')]
if merged_runs:
    print(f"\n  MERGED RUNS:")
    for r in merged_runs:
        cid = r['circuit_id']
        n_suites = len(r['suites'])
        overlap = circuits[cid]['overlap_count']
        print(f"    {cid}: {n_suites} suites merged, {overlap} redundant obs eliminated")

print(f"\nIndividual run directories:")
for run_key, run_data in all_results.items():
    print(f"  {run_key}: {run_data['result'].output_dir}")

Consolidated results saved: benchmark_results\consolidated_20260227_143359.json

Summary:
  Total runs: 1
  Suites enabled: ['merged', 'workload', 'stress', 'commuting', 'posthoc', 'diagnostics']
  Circuits: ['S-BELL-2']
  Merge optimization: enabled

  MERGED RUNS:
    S-BELL-2: 4 suites merged, 8 redundant obs eliminated

Individual run directories:
  S-BELL-2__merged: benchmark_results\S-BELL-2__merged_20260227_143006_39bb9b0a


---

## Summary

This notebook provides **complete benchmarking** of classical shadows vs direct measurement:

### Tasks Evaluated (Measurements Bible)

| Task | Question | Output |
|------|----------|--------|
| 1 | Worst-case N* (all obs)? | N* per protocol |
| 2 | Average N* (mean)? | N* per protocol |
| 3 | SE distribution at fixed N? | mean, median, max |
| 4 | Dominance (% wins)? | Winner + breakdown |
| 5 | Optimal pilot fraction? | % of budget |
| 6 | Bias-variance decomposition? | Bias2, Var, MSE |
| 7 | Noise sensitivity? | (with sweep) |
| 8 | Adaptive efficiency? | (from pilot) |

### Enhanced Analysis

- Power-law N* interpolation
- K-S distribution tests
- Bootstrap confidence intervals
- Per-observable crossover analysis
- Locality breakdown (k=1,2,3,...,n)
- Cost-normalized metrics

### Timing and Timeout Features

- **Per-protocol timeout** (`timeout_per_protocol_s`): Stops slow protocols gracefully with partial data
- **Fine-grained timing breakdown**: Pre-compute, AER simulation, post-processing phases
- **Hardware time model** (`hw_timing_profile`): Estimates real-device execution time using gate/measurement timings (e.g., IBM Heron R2)
- **Simulator vs hardware comparison**: Shows how AER wall time compares to estimated quantum hardware time

### Obsolete Notebooks

This notebook supersedes:
- `benchmark_shadows_vs_baselines.ipynb`
- `notebook_j_full_publication_benchmark_ghz_shadows_v0.ipynb`
- `notebook_k_locality_benchmark.ipynb`
- `notebook_l_random_bloch_benchmark.ipynb`
- `notebook_l_comprehensive_benchmark.ipynb`
- `notebook_benchmark_suite.ipynb`