Skip to content

Speed up spacelike HE with optional parallel execution#81

Merged
kvmto merged 7 commits into
NVIDIA:mainfrom
kvmto:feat/parallel-spacelike-he
May 12, 2026
Merged

Speed up spacelike HE with optional parallel execution#81
kvmto merged 7 commits into
NVIDIA:mainfrom
kvmto:feat/parallel-spacelike-he

Conversation

@kvmto
Copy link
Copy Markdown
Collaborator

@kvmto kvmto commented May 11, 2026

Summary

Adds optional data.use_parallel_spacelike support for Torch surface-code HE. When enabled with data.use_compile=True, spacelike HE uses a validated 2-partition of the stabilizer-overlap graph so independent stabilizers can run in parallel through a compile-friendly path.

This threads the option through training, QCDataGeneratorTorch, MemoryCircuitTorch, public config validation, docs, tests, and GPU CI.

Performance

Internal d=13/r=13 benchmark on 4x B200 with identical 25p noise model:

  • 2.17x faster steady-state wall time per epoch
  • 2.23x faster train time per epoch
  • 2.23x higher train throughput
  • 3.0x more completed epochs within the same 4-hour wall limit

Testing

36 passed: public config + focused parallel HE tests
2 passed: CUDA eager/compiled parallel spacelike tests
82 passed: full homological equivalence suite

kvmto added 6 commits May 11, 2026 18:23
Adds a new data.use_parallel_spacelike flag (default False) that runs the spacelike homological-equivalence pass via a 2-partition of the stabilizer-overlap graph. The two color classes are reduced independently inside a torch.compile-friendly inner loop, cutting compiled-graph crossings per spacelike pass on GPU.

Algorithm in code/qec/surface_code/homological_equivalence_torch.py: builds a 2-coloring of the stabilizer-overlap graph at cache time, pre-packs compile inputs into cache.parallel_partition_packed, and adds a parallel weight-reduction plus weight-2 fix-equivalence path. The 2-coloring assumes the overlap graph is bipartite (holds for the rotated single-basis surface code); non-bipartite inputs are rejected by _build_spacelike_partition with a named diagnostic so misuses fail loudly.

Wiring through memory_circuit_torch.py, generator_torch.py, training/train.py plus a False default in workflows/config_validator.py keeps existing configs unchanged. Tests added in a follow-up commit.

Signed-off-by: kvmto <kmato@nvidia.com>
Adds correctness coverage for the data.use_parallel_spacelike code path introduced in the previous commit.

code/tests/mid/test_homological_equivalence.py: bipartite partition validity across supported distances; named-failure diagnostic on non-bipartite overlap graphs; parallel weight-2 fix-equivalence moves errors off boundary stabilizers exactly like the sequential path; parallel path is idempotent and matches the sequential path on weight-2-only inputs; cache.parallel_partition_packed is populated at cache-build time with correct padding for empty colors; and the production hot path reads the pre-packed view rather than re-packing on every call.

code/tests/test_gpu.py: CUDA test that the compiled parallel spacelike path produces bit-identical output to the eager parallel path, locking in the pack-once cache field and the float-only chunk convergence check.

All new tests gate on the same fixtures and surface-code distances as the existing HE tests; no new external dependencies, no new test infrastructure.

Signed-off-by: kvmto <kmato@nvidia.com>
Adds user-facing documentation and discoverability for the data.use_parallel_spacelike flag introduced in the previous two commits.

README.md: new 'HE acceleration (advanced): parallel spacelike' subsection at the end of 'Configuration and advanced usage', covering how to enable (yaml + CLI), three pros (GPU speedup on rotated single-basis surface code, canonical equivalence to the sequential path with test coverage, composes with use_weight2), and four caveats (rotated single-basis only with bipartite overlap graph, use_compile=True required for the speedup, slightly higher cache-build cost and memory, GPU-targeted).

conf/config_public.yaml: one-off surfacing of use_parallel_spacelike inside the data block, with a comment explaining that other HE knobs intentionally remain in internal defaults and pointing to the README section. conf/config_pre_decoder_memory_surface_model_1_d9.yaml: list the flag alongside the existing HE knobs (timelike_he, num_he_cycles, use_weight2, max_passes_*) so the advanced config exposes the full HE surface.

No code or test changes in this commit.

Signed-off-by: kvmto <kmato@nvidia.com>
Adds a new step to the existing multi-gpu-tests job that forces data.use_compile=True + data.use_parallel_spacelike=True, exercising the parallel + compiled spacelike HE path end-to-end under DDP on 2 GPUs. The existing default-config step is left untouched so coverage of the default path does not regress.

Failure modes specific to the new combination (per-rank device pinning of the partition, torch.compile cache contention across ranks, deadlocks during the compiled inner loop) would surface as a training crash here, so this step closes the multi-GPU coverage gap that the unit tests alone do not exercise.

Bumps the multi-gpu-tests timeout-minutes from 20 to 35 to accommodate the second smoke step. The job runs only on push to main (if: github.ref == 'refs/heads/main'), matching the existing step's gating; PR builds are unaffected.

Signed-off-by: kvmto <kmato@nvidia.com>
…se_compile

The docs(he) commit surfaced `data.use_parallel_spacelike` in
`conf/config_public.yaml` and documented an `EXTRA_PARAMS=
"data.use_compile=True data.use_parallel_spacelike=True"` enable
recipe in README.md, but `validate_public_config` still only allowed
`data.{code_rotation,noise_model}`. Result: loading the shipped
`config_public.yaml` would raise `ValueError: Config field
'data.use_parallel_spacelike' is not supported in the public release`,
the documented CLI recipe would fail the same way for both keys, and
the existing `test_inference_public_model` plus the new multi-GPU CI
smoke step would crash on first run.

Fix: extend `allowed_data_keys` in
`code/workflows/config_validator.py` to include `use_compile` and
`use_parallel_spacelike`. Both default to `False` in the hidden
defaults / `getattr(..., False)` call sites, so opt-in behaviour is
unchanged; only the validator gate is relaxed. Add a focused type
check so non-boolean inputs (e.g. `data.use_parallel_spacelike: "yes"`
from a YAML edit) fail loudly instead of silently flowing through
`bool(...)` casts as truthy.

Tests: `code/tests/test_public_config.py` gets four new cases pinning
the contract -- accept + reject-non-bool for each of the two flags.
The existing 19 test_public_config cases continue to pass.

Signed-off-by: kvmto <kmato@nvidia.com>
Explain the invariant-preserving behavior without implying bit-identical sequential output, and surface use_compile beside the public acceleration flag.

Signed-off-by: kvmto <kmato@nvidia.com>
@kvmto kvmto requested review from bmhowe23 and ivanbasov May 11, 2026 20:46
Copy link
Copy Markdown
Collaborator

@ivanbasov ivanbasov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local review of the public-repo PR. The pack-once regression from the private PR is properly fixed (verified via test_compiled_parallel_reads_pre_packed_partition_off_cache — that test's docstring even cites the prior bug, which is the right shape). Inline comments below cover five items I'd want addressed; none block merge in my view.

Comment thread code/workflows/config_validator.py
Comment thread README.md
Comment thread README.md
Comment thread code/qec/surface_code/homological_equivalence_torch.py
Clarify public flag validation, non-bit-identical HE outputs, and torch.compile cold-start behavior before merge.

Signed-off-by: kvmto <kmato@nvidia.com>
@kvmto kvmto requested a review from ivanbasov May 12, 2026 14:21
@kvmto kvmto merged commit b116beb into NVIDIA:main May 12, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants