NVIDIA · kvmto · May 12, 2026 · May 11, 2026 · May 11, 2026 · May 11, 2026
diff --git a/.github/workflows/ci-gpu.yml b/.github/workflows/ci-gpu.yml
@@ -194,7 +194,7 @@ jobs:
       options: -u root --security-opt seccomp=unconfined --shm-size 16g
       env:
         NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
-    timeout-minutes: 20
+    timeout-minutes: 35
     steps:
       - name: Install system dependencies
         run: |
@@ -255,6 +255,41 @@ jobs:
           PREDECODER_TEST_SAMPLES: "2048"
           PREDECODER_TRAIN_EPOCHS: "2"
 
+      - name: Multi-GPU smoke training with parallel spacelike HE (2 GPUs, DDP)
+        # Additive coverage on top of the default-config multi-GPU smoke above.
+        # Forces data.use_compile=True + data.use_parallel_spacelike=True so the
+        # parallel + compiled spacelike HE path runs end-to-end under DDP on
+        # 2 GPUs. Failure modes specific to this combination (per-rank device
+        # pinning of the partition, torch.compile cache contention across
+        # ranks, deadlocks during the compiled inner loop) surface as a
+        # training crash here. The existing default-config step above is
+        # intentionally left untouched so we do not regress on coverage of
+        # the default path.
+        shell: bash
+        run: |
+          . .venv_multigpu/bin/activate
+          export PREDECODER_TIMING_RUN=1
+          export PREDECODER_DISABLE_SDR=1
+          export PREDECODER_LER_FINAL_ONLY=1
+          export PREDECODER_INFERENCE_NUM_SAMPLES=32
+          export PREDECODER_INFERENCE_LATENCY_SAMPLES=0
+          export PREDECODER_INFERENCE_MEAS_BASIS=both
+          export PREDECODER_INFERENCE_NUM_WORKERS=0
+          EXPERIMENT_NAME=ci_multi_gpu_he WORKFLOW=train GPUS=2 \
+            EXTRA_PARAMS="data.use_compile=True data.use_parallel_spacelike=True" \
+            bash code/scripts/local_run.sh 2>&1 | tee /tmp/ci_multigpu_he_train.log
+          r=${PIPESTATUS[0]}; [ $r -ne 0 ] && exit $r
+          EXPERIMENT_NAME=ci_multi_gpu_he WORKFLOW=inference GPUS=2 \
+            EXTRA_PARAMS="data.use_compile=True data.use_parallel_spacelike=True" \
+            bash code/scripts/local_run.sh 2>&1 | tee /tmp/ci_multigpu_he_infer.log
+          r=${PIPESTATUS[0]}; [ $r -ne 0 ] && exit $r
+          python code/scripts/check_ler_from_log.py /tmp/ci_multigpu_he_train.log --max-ler 0.35
+        env:
+          PREDECODER_TRAIN_SAMPLES: "16384"
+          PREDECODER_VAL_SAMPLES: "2048"
+          PREDECODER_TEST_SAMPLES: "2048"
+          PREDECODER_TRAIN_EPOCHS: "2"
+
   # ---------------------------------------------------------------------------
   # GPU coverage: captures GPU-specific code paths missed by the CPU coverage job
   # ---------------------------------------------------------------------------

diff --git a/README.md b/README.md
@@ -651,6 +651,66 @@ time.
 - **Inference uses the trained model from `outputs/<experiment_name>/models/`**, so keep the same `EXPERIMENT_NAME` when you switch from training to inference.
 - **Training auto-resumes**: if a run is interrupted, launching the same training command again (same `EXPERIMENT_NAME`) will automatically load the latest checkpoint it finds and continue training (up to the fixed 100 epochs). To force a clean restart, set `FRESH_START=1`, although we recommend changing `EXPERIMENT_NAME` instead.
 
+### HE acceleration (advanced): parallel spacelike
+
+The spacelike homological-equivalence (HE) pass canonicalises each
+`(batch, round)` diff frame independently. By default the canonicalisation
+processes stabilisers sequentially. With `data.use_parallel_spacelike: True`,
+the cache build computes a 2-partition of the stabiliser-overlap graph so the
+two colour classes are reduced in parallel inside a `torch.compile`-friendly
+inner loop. This cuts Python <-> compiled-graph crossings per HE pass and
+exposes more parallelism to the GPU.
+
+#### How to enable
+
+In any config:
+
+```yaml
+data:
+  use_compile: True            # required to see the speedup
+  use_parallel_spacelike: True
+```
+
+Or on the CLI:
+
+```bash
+EXTRA_PARAMS="data.use_compile=True data.use_parallel_spacelike=True" \
+  bash code/scripts/local_run.sh
+```
+
+#### Pros (when to enable)
+
+- **Faster spacelike HE on GPU** for the rotated single-basis surface code, by
+  amortising per-iteration Python overhead and running both colour classes
+  through `torch.compile` together.
+- **Syndrome-equivalent to the sequential path** on supported codes: the
+  parallel path preserves the HE invariants and produces valid non-increasing
+  representatives, while avoiding the sequential stabiliser order. Outputs are
+  not guaranteed bit-identical to the sequential path; both are valid
+  representatives of the same coset.
+  Coverage is added under `code/tests/mid/test_homological_equivalence.py`.
+- **Composes with `data.use_weight2`** — the weight-2 fix-equivalence pass is
+  applied per colour.
+
+#### Cons / caveats (when to leave it off)
+
+- **Rotated single-basis surface code only.** The 2-colouring assumes the
+  stabiliser-overlap graph is bipartite, which holds by construction for the
+  rotated surface code targeted here. Color codes, non-rotated layouts,
+  subsystem codes and mixed-basis matrices can produce odd cycles; in that
+  case the cache build refuses with a diagnostic naming the offending
+  stabiliser pair rather than silently falling back.
+- **`use_compile=True` is required** for the speedup; without it the partition
+  is built but the optimised compiled inner loop is not entered.
+- **`torch.compile` has cold-start cost.** The first compiled call can pause
+  while Inductor/CUDA graph capture runs, and shape changes such as different
+  batch sizes or round counts can trigger recompilation.
+- **Cache-build cost and memory grow slightly.** A packed
+  `parallel_partition_packed` view is materialised once at cache-build time so
+  the hot path only does dtype casts.
+- **GPU-targeted.** The parallel path is designed for CUDA; on CPU you may
+  not see a speedup over the sequential path.
+
 ## Logging and outputs
 
 ### What gets written where

diff --git a/code/data/generator_torch.py b/code/data/generator_torch.py
@@ -53,6 +53,7 @@ def __init__(
         use_coset_search=False,
         coset_max_generators=20,
         use_dense_overlap=False,
+        use_parallel_spacelike=False,
         **_ignored,
     ):
         if global_rank is None:
@@ -102,6 +103,7 @@ def __init__(
                         max_passes_w1=max_passes_w1,
                         use_weight2=use_weight2,
                         max_passes_w2=max_passes_w2,
+                        use_parallel_spacelike=use_parallel_spacelike,
                     ),
                     daemon=True,
                 )
@@ -211,6 +213,7 @@ def __init__(
             use_coset_search=use_coset_search,
             coset_max_generators=coset_max_generators,
             use_dense_overlap=use_dense_overlap,
+            use_parallel_spacelike=use_parallel_spacelike,
         )
 
         if self._mixed: