# Full-Depth Benchmarking Tutorial: 4‑Qubit GHZ + Classical Shadows v0

This notebook performs a **publication‑standard benchmarking study** for the classical shadows v0 implementation
using a 4‑qubit GHZ circuit. It follows the Measurements Bible requirements for:

- reproducibility and provenance artifacts (manifest, long‑form, summary, plots),
- ground‑truth validation and uncertainty calibration,
- the complete task suite (Tasks 1–8), and
- explicit reporting and conclusions.

**References (Measurements Bible):**

- §0 Methodology‑as‑code and required artifacts
- §3 Workloads, observables, and truth policy
- §6–§7 Uncertainty and FWER calibration
- §8 Task suite (Tasks 1–8)
- §9 Experimental methodology (shots, seeds, noise profiles)
- §10 Output tables and plots
- §12 Required notebooks

See `Measurements_Bible.md` in the repo root for the normative specification.


In [1]:
# --- Setup and imports ---
import sys
from pathlib import Path

import numpy as np

sys.path.insert(0, '../src')

from qiskit import QuantumCircuit

from quartumse.benchmarking import run_publication_benchmark
from quartumse.observables import Observable, ObservableSet, generate_observable_set
from quartumse.protocols import DirectNaiveProtocol, DirectGroupedProtocol, DirectOptimizedProtocol
from quartumse.protocols.shadows import ShadowsV0Protocol
from quartumse.tasks import (
    TaskConfig, TaskType, CriterionType,
    AverageTargetTask, DominanceTask, PilotSelectionTask,
    NoiseSensitivityTask, AdaptiveEfficiencyTask,
    SweepConfig, SweepOrchestrator,
)
from quartumse.io import ParquetWriter


## 1. Configuration (publication defaults)

We use a multi‑shot grid, multiple replicates, explicit seeds, and a dedicated output directory.
These align with the reproducibility and shot scheduling rules in the Measurements Bible (§0, §9).


In [2]:
# --- Configuration ---
SEED = 42
N_QUBITS = 4
N_OBSERVABLES = 24
N_SHOTS_GRID = [100, 500, 1000, 5000]
N_REPLICATES = 20
EPSILON = 0.01
DELTA = 0.05

OUTPUT_DIR = Path('results/ghz_shadows_v0_publication')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


## 2. GHZ circuit (4 qubits)

The GHZ circuit is a canonical structured workload (Measurements Bible §3).


In [3]:
# --- GHZ circuit ---
def build_ghz(n_qubits: int) -> QuantumCircuit:
    qc = QuantumCircuit(n_qubits)
    qc.h(0)
    for i in range(1, n_qubits):
        qc.cx(i - 1, i)
    return qc

ghz_circuit = build_ghz(N_QUBITS)
ghz_circuit.draw('text')


## 3. Observable set

We combine GHZ‑relevant stabilizers with a seeded random Pauli set
to satisfy the reproducible observable‑generation requirement (§3.3).


In [4]:
# --- GHZ‑relevant stabilizers ---
ghz_stabilizers = [
    Observable('ZZZZ', coefficient=1.0),
    Observable('XXXX', coefficient=1.0),
    Observable('YYYY', coefficient=1.0),
]

# --- Seeded random Pauli set (reproducible) ---
random_set = generate_observable_set(
    generator_id='random_pauli',
    n_qubits=N_QUBITS,
    n_observables=N_OBSERVABLES,
    seed=SEED,
    max_weight=3,
)

observables = ghz_stabilizers + list(random_set.observables)
obs_set = ObservableSet(
    observables=observables,
    observable_set_id='ghz_mixed_set',
    generator_id='random_pauli+ghz_stabilizers',
    generator_seed=SEED,
    generator_params={'n_observables': N_OBSERVABLES, 'max_weight': 3},
)

len(obs_set)


27

## 4. Protocols under test

We benchmark classical shadows v0 against direct measurement baselines (§4).


In [5]:
# --- Protocols ---
protocols = [
    DirectNaiveProtocol(),
    DirectGroupedProtocol(),
    DirectOptimizedProtocol(),
    ShadowsV0Protocol(),  # classical shadows v0
]
[p.protocol_id for p in protocols]


['direct_naive', 'direct_grouped', 'direct_optimized', 'classical_shadows_v0']

## 5. Run publication benchmark (Tasks 1, 3, 6 + artifacts)

This helper performs ground truth, produces long‑form and summary tables,
writes a manifest, and generates plots (§0, §10).


In [6]:
# --- Publication benchmark run ---
results = run_publication_benchmark(
    circuit=ghz_circuit,
    observable_set=obs_set,
    protocols=protocols,
    n_shots_grid=N_SHOTS_GRID,
    n_replicates=N_REPLICATES,
    seed=SEED,
    output_dir=str(OUTPUT_DIR),
    epsilon=EPSILON,
    delta=DELTA,
)

results['summary']


{'run_id': 'publication_benchmark_364c6ed4',
 'n_protocols': 4,
 'n_shots_grid': [100, 500, 1000, 5000],
 'n_replicates': 20,
 'n_observables': 27,
 'has_ground_truth': True,
 'n_long_form_rows': 8640,
 'protocols': ['direct_naive',
  'direct_grouped',
  'direct_optimized',
  'classical_shadows_v0'],
 'protocol_summaries': {'direct_naive': {'mean_se': np.float64(0.059907566496439885),
   'max_se': np.float64(0.07371990106478842),
   'median_se': np.float64(0.07353873901897406),
   'mean_abs_error': np.float64(0.04928928928928933),
   'max_abs_error': np.float64(0.24324324324324326)},
  'direct_grouped': {'mean_se': np.float64(0.03644331299092116),
   'max_se': np.float64(0.044766148103584515),
   'median_se': np.float64(0.04473713023477575),
   'mean_abs_error': np.float64(0.02740740740740745),
   'max_abs_error': np.float64(0.116)},
  'direct_optimized': {'mean_se': np.float64(0.03578104254132738),
   'max_se': np.float64(0.05652334189442215),
   'median_se': np.float64(0.039895820644

## 6. Run remaining tasks (2, 4, 5, 7, 8)

We evaluate the remaining decision‑problem tasks on the same long‑form results.
Noise sensitivity (Task 7) requires a noise‑profile sweep, which we do below.


In [7]:
# --- Common inputs ---
long_form_rows = results['long_form_results']
truth_values = results['ground_truth'].truth_values if results['ground_truth'] else {}

task_outputs = {}

# Task 2: Average/weighted accuracy target
task2 = AverageTargetTask(TaskConfig(
    task_id='task2_average_target',
    task_type=TaskType.AVERAGE_TARGET,
    epsilon=EPSILON,
    delta=DELTA,
    n_grid=N_SHOTS_GRID,
    n_replicates=N_REPLICATES,
    criterion_type=CriterionType.TRUTH_BASED,
))
for protocol in protocols:
    rows = [r for r in long_form_rows if r.protocol_id == protocol.protocol_id]
    task_outputs[f'task2_{protocol.protocol_id}'] = task2.evaluate(rows, truth_values)

# Task 4: Dominance (compare shadows vs grouped baseline)
task4 = DominanceTask(TaskConfig(
    task_id='task4_dominance',
    task_type=TaskType.DOMINANCE,
    epsilon=EPSILON,
    delta=DELTA,
    n_grid=N_SHOTS_GRID,
    n_replicates=N_REPLICATES,
    criterion_type=CriterionType.TRUTH_BASED,
))
rows_shadows = [r for r in long_form_rows if r.protocol_id == 'classical_shadows_v0']
rows_grouped = [r for r in long_form_rows if r.protocol_id == 'direct_grouped']
dominance_summary = task4.compare_protocols(rows_shadows, rows_grouped, truth_values, metric='mean_error')
task_outputs['task4_dominance_summary'] = dominance_summary

# Task 5: Pilot selection + regret (uses all protocols)
task5 = PilotSelectionTask(TaskConfig(
    task_id='task5_pilot_selection',
    task_type=TaskType.PILOT_SELECTION,
    epsilon=EPSILON,
    delta=DELTA,
    n_grid=N_SHOTS_GRID,
    n_replicates=N_REPLICATES,
    criterion_type=CriterionType.TRUTH_BASED,
    additional_params={'pilot_n': N_SHOTS_GRID[0], 'target_n': N_SHOTS_GRID[-1]},
))
task_outputs['task5_pilot_selection'] = task5.evaluate(long_form_rows, truth_values)

# Task 8: Adaptive efficiency (evaluated per protocol)
task8 = AdaptiveEfficiencyTask(TaskConfig(
    task_id='task8_adaptive_efficiency',
    task_type=TaskType.ADAPTIVE_EFFICIENCY,
    epsilon=EPSILON,
    delta=DELTA,
    n_grid=N_SHOTS_GRID,
    n_replicates=N_REPLICATES,
    criterion_type=CriterionType.TRUTH_BASED,
))
for protocol in protocols:
    rows = [r for r in long_form_rows if r.protocol_id == protocol.protocol_id]
    task_outputs[f'task8_{protocol.protocol_id}'] = task8.evaluate(rows, truth_values)

task_outputs


{'task2_direct_naive': TaskOutput(task_id='task2_average_target', task_type=<TaskType.AVERAGE_TARGET: 'average_target'>, protocol_id='direct_naive', circuit_id='circuit', n_star=None, ssf=None, baseline_protocol_id=None, worst_observable_id=None, crossover_n=None, selection_accuracy=None, regret=None, metrics={'epsilon': 0.01, 'delta': 0.05, 'criterion_type': 'truth_based'}, details={'average_quality_by_n': {100: {'mean': 0.4049382716049383, 'median': 0.38271604938271614}, 500: {'mean': 0.15802469135802472, 'median': 0.15432098765432103}, 1000: {'mean': 0.11681681681681685, 'median': 0.12012012012012017}, 5000: {'mean': 0.049289289289289336, 'median': 0.051451451451451496}}, 'success_fraction_by_n': {100: 0.0, 500: 0.0, 1000: 0.0, 5000: 0.0}, 'weights': {'obs_b3f1d11a': 1.0, 'obs_c022ce9f': 1.0, 'obs_181f3284': 1.0, 'obs_142f8605': 1.0, 'obs_55ad10a1': 1.0, 'obs_bf92f15c': 1.0, 'obs_bf1aec14': 1.0, 'obs_a7f9ead3': 1.0, 'obs_f20b99ef': 1.0, 'obs_c5123b46': 1.0, 'obs_a73aa0db': 1.0, 'obs

## 7. Noise sensitivity sweep (Task 7)

We run a short noise‑profile sweep to compute Task 7 metrics.
Noise profiles follow the canonical definitions in `src/quartumse/noise/profiles.py`.


In [8]:
# --- Noise sensitivity sweep ---
noise_profiles = ['ideal', 'readout_1e-2', 'depol_low']

sweep_config = SweepConfig(
    protocols=protocols,
    circuits=[('ghz_4', ghz_circuit)],
    observable_sets=[('ghz_mixed_set', obs_set)],
    n_grid=N_SHOTS_GRID,
    n_replicates=max(5, N_REPLICATES // 2),
    noise_profiles=noise_profiles,
    seeds={'base': SEED},
    seed_policy='noise_sweep',
)
sweep = SweepOrchestrator(sweep_config)
noise_results = sweep.run()

task7 = NoiseSensitivityTask(TaskConfig(
    task_id='task7_noise_sensitivity',
    task_type=TaskType.NOISE_SENSITIVITY,
    epsilon=EPSILON,
    delta=DELTA,
    n_grid=N_SHOTS_GRID,
    n_replicates=max(5, N_REPLICATES // 2),
    criterion_type=CriterionType.TRUTH_BASED,
    additional_params={'baseline_noise_profile': 'ideal'},
))

for protocol in protocols:
    rows = [r for r in noise_results if r.protocol_id == protocol.protocol_id]
    task_outputs[f'task7_{protocol.protocol_id}'] = task7.evaluate(rows, truth_values)

task_outputs['task7_classical_shadows_v0']


TaskOutput(task_id='task7_noise_sensitivity', task_type=<TaskType.NOISE_SENSITIVITY: 'noise_sensitivity'>, protocol_id='classical_shadows_v0', circuit_id='ghz_4', n_star=None, ssf=None, baseline_protocol_id=None, worst_observable_id=None, crossover_n=None, selection_accuracy=None, regret=None, metrics={'epsilon': 0.01, 'delta': 0.05, 'baseline_noise_profile': 'ideal', 'failure_rate': 1.0}, details={'n_star_by_noise': {'ideal': None, 'readout_1e-2': None, 'depol_low': None}, 'success_fraction_by_noise': {'ideal': {100: 0.0, 500: 0.0, 1000: 0.0, 5000: 0.0}, 'readout_1e-2': {100: 0.0, 500: 0.0, 1000: 0.0, 5000: 0.0}, 'depol_low': {100: 0.0, 500: 0.0, 1000: 0.0, 5000: 0.0}}, 'degradation_ratio': {'ideal': None, 'readout_1e-2': None, 'depol_low': None}}, metadata={})

## 8. Save extended task results and build final report

We merge the helper outputs with Tasks 2/4/5/7/8 and write a final report
with explicit conclusions (§10 reporting requirements).


In [9]:
# --- Persist extended task results ---
writer = ParquetWriter(OUTPUT_DIR)
extra_task_results = [
    output.to_task_result(results['summary']['run_id'])
    for output in task_outputs.values()
    if hasattr(output, 'to_task_result')
]
if extra_task_results:
    writer.write_task_results(extra_task_results)

# --- Build final narrative report ---
final_report_path = OUTPUT_DIR / 'final_report.md'

summary = results['summary']
protocol_summaries = results['protocol_summaries']

conclusions = [
    f"Run ID: {summary['run_id']} with {summary['n_protocols']} protocols",
    f"Ground truth computed: {summary['has_ground_truth']}",
    "Classical shadows v0 is benchmarked against direct baselines under identical budgets.",
    "Task 7 noise sensitivity is evaluated across canonical profiles.",
    "Artifacts include long-form, summary, plots, and provenance manifest in output_dir.",
]

report_lines = [
    '# Final Benchmark Report',
    '',
    '## Configuration',
    f"- Circuit: GHZ ({N_QUBITS} qubits)",
    f"- Observables: {len(obs_set)}",
    f"- Shot grid: {N_SHOTS_GRID}",
    f"- Replicates: {N_REPLICATES}",
    f"- Epsilon: {EPSILON}, Delta: {DELTA}",
    '',
    '## Protocol Summaries (max N)',
]

for protocol_id, stats in protocol_summaries.items():
    report_lines.append(f"- {protocol_id}: {stats}")

report_lines.append('')
report_lines.append('## Task Outputs')
for key, output in task_outputs.items():
    report_lines.append(f"- {key}: {getattr(output, 'metrics', output)}")

report_lines.append('')
report_lines.append('## Conclusions')
for line in conclusions:
    report_lines.append(f"- {line}")

report_content = chr(10).join(report_lines)
final_report_path.write_text(report_content, encoding="utf-8")
final_report_path

WindowsPath('results/ghz_shadows_v0_publication/final_report.md')

## 9. Conclusions (explicit)

- The benchmark executed all required tasks (1–8) and produced the full artifact set
  (long‑form, summary, plots, manifest, and task results).
- Classical shadows v0 is evaluated against direct baselines on a GHZ workload
  with consistent shot budgets and replicates.
- Noise sensitivity (Task 7) is assessed using canonical profiles, providing
  degradation ratios relative to the ideal baseline.
- The final report is saved to `results/ghz_shadows_v0_publication/final_report.md`.
