Speed up spacelike HE with optional parallel execution by kvmto · Pull Request #81 · NVIDIA/Ising-Decoding

kvmto · 2026-05-11T20:46:50Z

Summary

Adds optional data.use_parallel_spacelike support for Torch surface-code HE. When enabled with data.use_compile=True, spacelike HE uses a validated 2-partition of the stabilizer-overlap graph so independent stabilizers can run in parallel through a compile-friendly path.

This threads the option through training, QCDataGeneratorTorch, MemoryCircuitTorch, public config validation, docs, tests, and GPU CI.

Performance

Internal d=13/r=13 benchmark on 4x B200 with identical 25p noise model:

2.17x faster steady-state wall time per epoch
2.23x faster train time per epoch
2.23x higher train throughput
3.0x more completed epochs within the same 4-hour wall limit

Testing

36 passed: public config + focused parallel HE tests
2 passed: CUDA eager/compiled parallel spacelike tests
82 passed: full homological equivalence suite

Adds a new data.use_parallel_spacelike flag (default False) that runs the spacelike homological-equivalence pass via a 2-partition of the stabilizer-overlap graph. The two color classes are reduced independently inside a torch.compile-friendly inner loop, cutting compiled-graph crossings per spacelike pass on GPU. Algorithm in code/qec/surface_code/homological_equivalence_torch.py: builds a 2-coloring of the stabilizer-overlap graph at cache time, pre-packs compile inputs into cache.parallel_partition_packed, and adds a parallel weight-reduction plus weight-2 fix-equivalence path. The 2-coloring assumes the overlap graph is bipartite (holds for the rotated single-basis surface code); non-bipartite inputs are rejected by _build_spacelike_partition with a named diagnostic so misuses fail loudly. Wiring through memory_circuit_torch.py, generator_torch.py, training/train.py plus a False default in workflows/config_validator.py keeps existing configs unchanged. Tests added in a follow-up commit. Signed-off-by: kvmto <kmato@nvidia.com>

Adds correctness coverage for the data.use_parallel_spacelike code path introduced in the previous commit. code/tests/mid/test_homological_equivalence.py: bipartite partition validity across supported distances; named-failure diagnostic on non-bipartite overlap graphs; parallel weight-2 fix-equivalence moves errors off boundary stabilizers exactly like the sequential path; parallel path is idempotent and matches the sequential path on weight-2-only inputs; cache.parallel_partition_packed is populated at cache-build time with correct padding for empty colors; and the production hot path reads the pre-packed view rather than re-packing on every call. code/tests/test_gpu.py: CUDA test that the compiled parallel spacelike path produces bit-identical output to the eager parallel path, locking in the pack-once cache field and the float-only chunk convergence check. All new tests gate on the same fixtures and surface-code distances as the existing HE tests; no new external dependencies, no new test infrastructure. Signed-off-by: kvmto <kmato@nvidia.com>

Adds user-facing documentation and discoverability for the data.use_parallel_spacelike flag introduced in the previous two commits. README.md: new 'HE acceleration (advanced): parallel spacelike' subsection at the end of 'Configuration and advanced usage', covering how to enable (yaml + CLI), three pros (GPU speedup on rotated single-basis surface code, canonical equivalence to the sequential path with test coverage, composes with use_weight2), and four caveats (rotated single-basis only with bipartite overlap graph, use_compile=True required for the speedup, slightly higher cache-build cost and memory, GPU-targeted). conf/config_public.yaml: one-off surfacing of use_parallel_spacelike inside the data block, with a comment explaining that other HE knobs intentionally remain in internal defaults and pointing to the README section. conf/config_pre_decoder_memory_surface_model_1_d9.yaml: list the flag alongside the existing HE knobs (timelike_he, num_he_cycles, use_weight2, max_passes_*) so the advanced config exposes the full HE surface. No code or test changes in this commit. Signed-off-by: kvmto <kmato@nvidia.com>

Adds a new step to the existing multi-gpu-tests job that forces data.use_compile=True + data.use_parallel_spacelike=True, exercising the parallel + compiled spacelike HE path end-to-end under DDP on 2 GPUs. The existing default-config step is left untouched so coverage of the default path does not regress. Failure modes specific to the new combination (per-rank device pinning of the partition, torch.compile cache contention across ranks, deadlocks during the compiled inner loop) would surface as a training crash here, so this step closes the multi-GPU coverage gap that the unit tests alone do not exercise. Bumps the multi-gpu-tests timeout-minutes from 20 to 35 to accommodate the second smoke step. The job runs only on push to main (if: github.ref == 'refs/heads/main'), matching the existing step's gating; PR builds are unaffected. Signed-off-by: kvmto <kmato@nvidia.com>

…se_compile The docs(he) commit surfaced `data.use_parallel_spacelike` in `conf/config_public.yaml` and documented an `EXTRA_PARAMS= "data.use_compile=True data.use_parallel_spacelike=True"` enable recipe in README.md, but `validate_public_config` still only allowed `data.{code_rotation,noise_model}`. Result: loading the shipped `config_public.yaml` would raise `ValueError: Config field 'data.use_parallel_spacelike' is not supported in the public release`, the documented CLI recipe would fail the same way for both keys, and the existing `test_inference_public_model` plus the new multi-GPU CI smoke step would crash on first run. Fix: extend `allowed_data_keys` in `code/workflows/config_validator.py` to include `use_compile` and `use_parallel_spacelike`. Both default to `False` in the hidden defaults / `getattr(..., False)` call sites, so opt-in behaviour is unchanged; only the validator gate is relaxed. Add a focused type check so non-boolean inputs (e.g. `data.use_parallel_spacelike: "yes"` from a YAML edit) fail loudly instead of silently flowing through `bool(...)` casts as truthy. Tests: `code/tests/test_public_config.py` gets four new cases pinning the contract -- accept + reject-non-bool for each of the two flags. The existing 19 test_public_config cases continue to pass. Signed-off-by: kvmto <kmato@nvidia.com>

Explain the invariant-preserving behavior without implying bit-identical sequential output, and surface use_compile beside the public acceleration flag. Signed-off-by: kvmto <kmato@nvidia.com>

ivanbasov

Local review of the public-repo PR. The pack-once regression from the private PR is properly fixed (verified via test_compiled_parallel_reads_pre_packed_partition_off_cache — that test's docstring even cites the prior bug, which is the right shape). Inline comments below cover five items I'd want addressed; none block merge in my view.

Clarify public flag validation, non-bit-identical HE outputs, and torch.compile cold-start behavior before merge. Signed-off-by: kvmto <kmato@nvidia.com>

kvmto added 6 commits May 11, 2026 18:23

docs(he): clarify parallel spacelike public opt-in

baa3940

Explain the invariant-preserving behavior without implying bit-identical sequential output, and surface use_compile beside the public acceleration flag. Signed-off-by: kvmto <kmato@nvidia.com>

kvmto requested review from bmhowe23 and ivanbasov May 11, 2026 20:46

ivanbasov reviewed May 11, 2026

View reviewed changes

Comment thread code/workflows/config_validator.py

Comment thread README.md

Comment thread README.md

Comment thread code/qec/surface_code/homological_equivalence_torch.py

ivanbasov approved these changes May 11, 2026

View reviewed changes

docs(he): address parallel spacelike review notes

865e851

Clarify public flag validation, non-bit-identical HE outputs, and torch.compile cold-start behavior before merge. Signed-off-by: kvmto <kmato@nvidia.com>

kvmto requested a review from ivanbasov May 12, 2026 14:21

bmhowe23 approved these changes May 12, 2026

View reviewed changes

kvmto merged commit b116beb into NVIDIA:main May 12, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up spacelike HE with optional parallel execution#81

Speed up spacelike HE with optional parallel execution#81
kvmto merged 7 commits into
NVIDIA:mainfrom
kvmto:feat/parallel-spacelike-he

kvmto commented May 11, 2026

Uh oh!

ivanbasov left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kvmto commented May 11, 2026

Summary

Performance

Testing

Uh oh!

ivanbasov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants