Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 36 additions & 1 deletion .github/workflows/ci-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ jobs:
options: -u root --security-opt seccomp=unconfined --shm-size 16g
env:
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
timeout-minutes: 20
timeout-minutes: 35
steps:
- name: Install system dependencies
run: |
Expand Down Expand Up @@ -255,6 +255,41 @@ jobs:
PREDECODER_TEST_SAMPLES: "2048"
PREDECODER_TRAIN_EPOCHS: "2"

- name: Multi-GPU smoke training with parallel spacelike HE (2 GPUs, DDP)
# Additive coverage on top of the default-config multi-GPU smoke above.
# Forces data.use_compile=True + data.use_parallel_spacelike=True so the
# parallel + compiled spacelike HE path runs end-to-end under DDP on
# 2 GPUs. Failure modes specific to this combination (per-rank device
# pinning of the partition, torch.compile cache contention across
# ranks, deadlocks during the compiled inner loop) surface as a
# training crash here. The existing default-config step above is
# intentionally left untouched so we do not regress on coverage of
# the default path.
shell: bash
run: |
. .venv_multigpu/bin/activate
export PREDECODER_TIMING_RUN=1
export PREDECODER_DISABLE_SDR=1
export PREDECODER_LER_FINAL_ONLY=1
export PREDECODER_INFERENCE_NUM_SAMPLES=32
export PREDECODER_INFERENCE_LATENCY_SAMPLES=0
export PREDECODER_INFERENCE_MEAS_BASIS=both
export PREDECODER_INFERENCE_NUM_WORKERS=0
EXPERIMENT_NAME=ci_multi_gpu_he WORKFLOW=train GPUS=2 \
EXTRA_PARAMS="data.use_compile=True data.use_parallel_spacelike=True" \
bash code/scripts/local_run.sh 2>&1 | tee /tmp/ci_multigpu_he_train.log
r=${PIPESTATUS[0]}; [ $r -ne 0 ] && exit $r
EXPERIMENT_NAME=ci_multi_gpu_he WORKFLOW=inference GPUS=2 \
EXTRA_PARAMS="data.use_compile=True data.use_parallel_spacelike=True" \
bash code/scripts/local_run.sh 2>&1 | tee /tmp/ci_multigpu_he_infer.log
r=${PIPESTATUS[0]}; [ $r -ne 0 ] && exit $r
python code/scripts/check_ler_from_log.py /tmp/ci_multigpu_he_train.log --max-ler 0.35
env:
PREDECODER_TRAIN_SAMPLES: "16384"
PREDECODER_VAL_SAMPLES: "2048"
PREDECODER_TEST_SAMPLES: "2048"
PREDECODER_TRAIN_EPOCHS: "2"

# ---------------------------------------------------------------------------
# GPU coverage: captures GPU-specific code paths missed by the CPU coverage job
# ---------------------------------------------------------------------------
Expand Down
60 changes: 60 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -651,6 +651,66 @@ time.
- **Inference uses the trained model from `outputs/<experiment_name>/models/`**, so keep the same `EXPERIMENT_NAME` when you switch from training to inference.
- **Training auto-resumes**: if a run is interrupted, launching the same training command again (same `EXPERIMENT_NAME`) will automatically load the latest checkpoint it finds and continue training (up to the fixed 100 epochs). To force a clean restart, set `FRESH_START=1`, although we recommend changing `EXPERIMENT_NAME` instead.

### HE acceleration (advanced): parallel spacelike

The spacelike homological-equivalence (HE) pass canonicalises each
`(batch, round)` diff frame independently. By default the canonicalisation
processes stabilisers sequentially. With `data.use_parallel_spacelike: True`,
the cache build computes a 2-partition of the stabiliser-overlap graph so the
two colour classes are reduced in parallel inside a `torch.compile`-friendly
inner loop. This cuts Python <-> compiled-graph crossings per HE pass and
exposes more parallelism to the GPU.

#### How to enable

In any config:

```yaml
data:
use_compile: True # required to see the speedup
use_parallel_spacelike: True
```

Or on the CLI:

```bash
EXTRA_PARAMS="data.use_compile=True data.use_parallel_spacelike=True" \
bash code/scripts/local_run.sh
```

#### Pros (when to enable)

- **Faster spacelike HE on GPU** for the rotated single-basis surface code, by
amortising per-iteration Python overhead and running both colour classes
through `torch.compile` together.
- **Syndrome-equivalent to the sequential path** on supported codes: the
Comment thread
kvmto marked this conversation as resolved.
parallel path preserves the HE invariants and produces valid non-increasing
representatives, while avoiding the sequential stabiliser order. Outputs are
not guaranteed bit-identical to the sequential path; both are valid
representatives of the same coset.
Coverage is added under `code/tests/mid/test_homological_equivalence.py`.
- **Composes with `data.use_weight2`** — the weight-2 fix-equivalence pass is
applied per colour.

#### Cons / caveats (when to leave it off)

- **Rotated single-basis surface code only.** The 2-colouring assumes the
stabiliser-overlap graph is bipartite, which holds by construction for the
rotated surface code targeted here. Color codes, non-rotated layouts,
subsystem codes and mixed-basis matrices can produce odd cycles; in that
case the cache build refuses with a diagnostic naming the offending
stabiliser pair rather than silently falling back.
- **`use_compile=True` is required** for the speedup; without it the partition
Comment thread
kvmto marked this conversation as resolved.
is built but the optimised compiled inner loop is not entered.
- **`torch.compile` has cold-start cost.** The first compiled call can pause
while Inductor/CUDA graph capture runs, and shape changes such as different
batch sizes or round counts can trigger recompilation.
- **Cache-build cost and memory grow slightly.** A packed
`parallel_partition_packed` view is materialised once at cache-build time so
the hot path only does dtype casts.
- **GPU-targeted.** The parallel path is designed for CUDA; on CPU you may
not see a speedup over the sequential path.

## Logging and outputs

### What gets written where
Expand Down
3 changes: 3 additions & 0 deletions code/data/generator_torch.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ def __init__(
use_coset_search=False,
coset_max_generators=20,
use_dense_overlap=False,
use_parallel_spacelike=False,
**_ignored,
):
if global_rank is None:
Expand Down Expand Up @@ -102,6 +103,7 @@ def __init__(
max_passes_w1=max_passes_w1,
use_weight2=use_weight2,
max_passes_w2=max_passes_w2,
use_parallel_spacelike=use_parallel_spacelike,
),
daemon=True,
)
Expand Down Expand Up @@ -211,6 +213,7 @@ def __init__(
use_coset_search=use_coset_search,
coset_max_generators=coset_max_generators,
use_dense_overlap=use_dense_overlap,
use_parallel_spacelike=use_parallel_spacelike,
)

if self._mixed:
Expand Down
Loading
Loading