Add dem_sampling CPU/GPU across C++ and Python by kvmto · Pull Request #479 · NVIDIA/cudaqx

kvmto · 2026-04-01T16:02:03Z

Summary

Add dem_sampling (C++) with CPU and cuStabilizer-backed GPU paths, and
expose it through nanobind and Python as cudaq_qec.dem_sampling with
backend="auto" | "cpu" | "gpu" and NumPy / PyTorch tensor support
(incl. CUDA device pointers on the GPU path).
Tests: new C++ unit tests (DemSamplingCPU, DemSamplingGPU) and
libs/qec/python/tests/test_dem_sampling.py covering CPU/GPU and
NumPy/PyTorch inputs.

Build / packaging

FindcuStabilizer.cmake auto-discovers the library via CUSTABILIZER_ROOT,
CUQUANTUM_ROOT, or the active Python environment's cuquantum-python-cuXX
wheel. If missing, CMake errors with a pip install hint.
cuquantum-python-cuXX>=26.03.0 is promoted to a core runtime dep of
cudaq-qec (it ships cuStabilizer); torch stays optional under
tensor_network_decoder / all.
Built wheels RPATH into the sibling cuquantum wheel, so
libcustabilizer.so.0 is found without LD_LIBRARY_PATH; CI drops the
previous CUSTABILIZER_ROOT / LD_LIBRARY_PATH plumbing.

Test plan

C++ DemSamplingCPU / DemSamplingGPU tests in CI.
test_dem_sampling.py in CI with CUDA-enabled torch.
Wheel builds green on cu12 and cu13.

Introduce dem_sampling implementations for CPU and cuStabilizer-backed GPU paths in C++, and expose them through pybind/Python with torch tensor and device-pointer support. Add C++/Python coverage for backend paths and wire build/packaging checks so cuStabilizer requirements are enforced for shipping. Signed-off-by: kvmto <kmato@nvidia.com>

Signed-off-by: kvmto <kmato@nvidia.com>

- Use cudaMallocAsync/cudaFreeAsync for all GPU temporaries to avoid implicit device synchronization that breaks multi-stream concurrency (critical for PyTorch CUDA stream integration) - Replace synchronous cudaMemcpy with cudaMemcpyAsync on the caller's stream for the probability D->H copy - Add grid dimension overflow guards before every CUDA kernel launch - Handle numShots=0 gracefully in both C++ CPU path and Python binding - Binarize check_matrix with & 1u in CPU path to match GPU kernel behavior and prevent uint8 dot-product overflow - Clear sticky CUDA errors (cudaGetLastError) on all failure paths in the Python binding's GPU allocation/copy helpers - Fix pre-existing test_non_default_cuda_stream assertion that compared torch.device("cuda") against torch.device("cuda", index=0) - Add 12 new tests covering zero-shot edge case, non-binary H matrix CPU/GPU parity, and seedless code path (5 C++, 7 Python) Signed-off-by: kvmto <kmato@nvidia.com>

Signed-off-by: kvmto <kmato@nvidia.com>

# Conflicts: # .github/actions/build-lib/action.yaml # .github/actions/build-lib/build_qec.sh # .github/workflows/lib_qec.yaml # libs/qec/unittests/CMakeLists.txt

cuStabilizer is now unconditionally required (no optional detection). PyTorch is user-installed only: numpy in = numpy out, torch CUDA in = torch CUDA out, torch CPU in = explicit error, anything else = fail. CMake / build: - Replace 85-line cuStabilizer detection with find_package(REQUIRED) - Remove CUDAQ_QEC_REQUIRE_CUSTABILIZER option and all HAS_CUSTABILIZER conditionals from lib/, python/, unittests/ CMakeLists and pyproject - Remove require-custabilizer input from action.yaml and dead REQUIRE_CUSTABILIZER env/shell logic from build_qec.sh, build_all.sh - Remove custabilizer matrix dimension from lib_qec.yaml - Revert docs.yaml to pre-branch state (no cuStabilizer in docs build) C++ bindings (py_dem_sampling.cpp): - Remove torch-to-numpy conversion from asNumpyUint8/asNumpyFloat64 - Add rejectTorchCpuTensors() to block silent numpy conversion - Add install-torch warning in tryTorchGpuSampling via PyErr_WarnEx - Remove #ifdef CUDAQ_QEC_HAS_CUSTABILIZER guards throughout - Remove always-true dem_sampling_has_gpu_compiled attribute - Remove redundant if(ok) block in tryGpuSampling - Update pybind docstring for numpy primary / torch CUDA optional Python layer (dem_sampling.py): - Add warnings.warn() hint when tensor-like input detected but torch not importable - Update module and function docstrings Tests: - Delete 3 redundant torch CPU tests, flip 2 to expect rejection - Keep 4 CUDA tests unchanged - Add test_torch_not_installed (pytest.warns) and test_random_object - Remove unused dem_sampling_has_gpu_compiled check and qec import - Remove #ifdef guards from test_dem_sampling.cpp CI (test_wheels.sh): - Run QEC tests without torch first, install torch, re-run QEC tests Signed-off-by: kvmto <kmato@nvidia.com>

…eakage The --upgrade flag was causing pip to upgrade all listed packages to their latest versions, breaking two CI workflows: - all_libs: numpy 1.26.4 -> 2.4.4 broke openfermion (np.string_ removed) - docs: sphinx 8.x -> 9.1.0 broke sphinx_toolbox (autodoc.logger removed) Pin sphinx<9 in docs workflow and add proper quoting for version specs. Signed-off-by: kvmto <kmato@nvidia.com>

Signed-off-by: kvmto <kmato@nvidia.com>

…list inputs - Throw RuntimeError immediately when CUDA tensor probability validation fails on the GPU path, instead of falling through to numpy conversion which crashes on device tensors - Accept plain Python lists as valid input (numpy auto-converts them) and update test accordingly Signed-off-by: kvmto <kmato@nvidia.com>

contractors.py unconditionally imported torch at module level, causing pytest collection to fail with ModuleNotFoundError when torch is not installed. This broke all QEC python tests in CI after torch was removed from the pip install in commit 509f82c. - contractors.py: lazy-import torch (from __future__ import annotations + TYPE_CHECKING for type hints, move import into einsum_torch body) - tensor_network_decoder.py: raise RuntimeError with install instructions when torch is missing on CPU (matches dem_sampling pattern) - test_tensor_network_decoder.py: skip 9 decoder-construction tests when no torch and no GPU; 21 utility tests run unconditionally. Wheel CI (test_wheels.sh) re-runs all tests with torch installed. Signed-off-by: kvmto <kmato@nvidia.com>

…I conflict Commit 55369ff removed the explicit torch==2.9.0 CUDA-indexed install from all_libs.yaml and all_libs_release.yaml while trying to make torch optional. However, both workflows still install lightning (needed for solvers GQE tests), which hard-depends on torch. Without the explicit install, pip pulls the CPU-only torch wheel from PyPI, which bundles an old libgfortran.so.5 (GCC 7/8, only GFORTRAN_8). libcudaq-solvers.so is compiled with GCC 11 and requires GFORTRAN_10, so it crashes at import time when torch's bundled copy shadows the system library. Fix: pre-install torch==2.9.0 from the CUDA wheel index before the main pip install in both workflows, so lightning finds torch already satisfied and never pulls the CPU wheel. Also fix all_libs_release.yaml: remove --upgrade (causes numpy 2.x / sphinx 9.x breakage, already fixed in all_libs.yaml by 4ccae9c) and quote >=version specs to prevent bash redirect misinterpretation. Signed-off-by: kvmto <kmato@nvidia.com>

Signed-off-by: kvmto <kmato@nvidia.com>

bmhowe23

I took a quick look at some of the packaging items and think we should align on them before going through with the full review. Specifically,

I don't think we should modify anything related to the tensor network decoder packaging as part of this PR. (The current changes remove torch from that decoder for some reason.)
I thought we had agreed that the maintenance and required testing infrastructure of the a new optional feature (dem_sampling) feature was too high, so we preferred to not have the optional dependency, no? Did that turn out to not be possible?

Signed-off-by: kvmto <kmato@nvidia.com>

Signed-off-by: kvmto <kmato@nvidia.com> # Conflicts: # .github/workflows/all_libs.yaml # .github/workflows/all_libs_release.yaml # .github/workflows/lib_qec.yaml # libs/core/include/cuda-qx/core/kwargs_utils.h # libs/qec/pyproject.toml.cu12 # libs/qec/pyproject.toml.cu13 # libs/qec/python/bindings/py_decoder.cpp

Signed-off-by: kvmto <kmato@nvidia.com>

Signed-off-by: Sachin Pisal <spisal@nvidia.com>

bmhowe23

Thanks, Kevin! My previous packaging issues look to be fully resolved...thank you. I have just a few minor comments below.

I manually kicked off the "Build wheels" job to exercise the wheel-based workflow: https://github.com/NVIDIA/cudaqx/actions/runs/24807169194. This workflow matrixes the tests by CPU/GPU, CUDA versions, and Python versions, so it is a bit more thorough. Will monitor this evening.

Signed-off-by: Ben Howe <bhowe@nvidia.com>

bmhowe23 · 2026-04-23T01:30:53Z

Thanks, Kevin! My previous packaging issues look to be fully resolved...thank you. I have just a few minor comments below.

I manually kicked off the "Build wheels" job to exercise the wheel-based workflow: https://github.com/NVIDIA/cudaqx/actions/runs/24807169194. This workflow matrixes the tests by CPU/GPU, CUDA versions, and Python versions, so it is a bit more thorough. Will monitor this evening.

The "Build wheels" job failed. Hopefully it's ok - I took the liberty of attempting a fix in b9a7270 and 9c5eced. If you object to them, we can remove them.

Signed-off-by: Ben Howe <bhowe@nvidia.com>

Signed-off-by: kvmto <kmato@nvidia.com>

bmhowe23

Thanks, Kevin! Before merging, please double check the PR summary (and therefore commit message) for accuracy in light of the review updates. In particular, I suspect this part should be removed?

Add CUDAQ_QEC_REQUIRE_CUSTABILIZER enforcement path for builds that must ship GPU support.

## Summary Dependent on the merge of PR #479 . - Add Python GPU test coverage to the QEC GPU CI job (`pr-build-gpu` in `lib_qec.yaml`) - The GPU job previously only ran C++ tests via `ctest`. Python tests with GPU-dependent code paths (cuQuantum tensor network contractor, DEM sampling GPU backend, TRT decoder inference) were skipped on CPU runners and never executed on GPU runners. - Install torch (CUDA wheel) and Python test dependencies on the GPU runner - Add a `pytest` step targeting `test_tensor_network_decoder.py`, `test_dem_sampling.py`, and `test_trt_decoder.py` - Expand the `ctest` regex to also include `TRTDecoder` C++ tests ## Test plan - [ ] `pr-build-gpu` job runs successfully on amd64 and arm64 - [ ] "Run GPU Python tests" step shows previously-skipped cuQuantum/CUDA tests now running - [ ] "Run GPU C++ tests" step includes TRTDecoder tests (amd64) - [ ] No regressions in existing CPU test jobs (`pr-build`) --------- Signed-off-by: kvmto <kmato@nvidia.com>

kvmto requested review from bmhowe23, ivanbasov and wsttiger April 1, 2026 16:02

kvmto added 2 commits April 1, 2026 16:03

fixed linting

06cca91

Signed-off-by: kvmto <kmato@nvidia.com>

build change

873dba3

Signed-off-by: kvmto <kmato@nvidia.com>

bmhowe23 reviewed Apr 1, 2026

View reviewed changes

Comment thread libs/qec/pyproject.toml.cu12 Outdated

kvmto added 14 commits April 1, 2026 22:08

refactoring of build system

96cc666

Signed-off-by: kvmto <kmato@nvidia.com>

pip upgrade

99d9df8

Signed-off-by: kvmto <kmato@nvidia.com>

pip upgrade in docs

0775a03

Signed-off-by: kvmto <kmato@nvidia.com>

safety net for the downgrade of custab

bba1889

Signed-off-by: kvmto <kmato@nvidia.com>

multi path custab integration

53f5549

Signed-off-by: kvmto <kmato@nvidia.com>

bug in tensor data type for degenerate H matrices

e8efde5

Signed-off-by: kvmto <kmato@nvidia.com>

Merge remote-tracking branch 'upstream/main' into dem_sampling_pr

ce3aace

more selectivity on having gpus and torch for custab

bdfb7f5

Signed-off-by: kvmto <kmato@nvidia.com>

attempt to make custabilizer and torch optional

55369ff

Signed-off-by: kvmto <kmato@nvidia.com>

quick lint

a9ea5c5

Signed-off-by: kvmto <kmato@nvidia.com>

fix docs CI: pin sphinx_toolbox>=4.1.2 for Sphinx 8.2+ compat

063d27d

Signed-off-by: kvmto <kmato@nvidia.com>

quick docs attempt fix

578b5cc

Signed-off-by: kvmto <kmato@nvidia.com>

nth attempt on docs dependency

40b186b

Signed-off-by: kvmto <kmato@nvidia.com>

bmhowe23 reviewed Apr 6, 2026

View reviewed changes

Comment thread libs/qec/lib/CMakeLists.txt Outdated

kvmto added 8 commits April 16, 2026 12:19

Merge branch 'main' into dem_sampling_pr

1b6939c

# Conflicts: # .github/actions/build-lib/action.yaml # .github/actions/build-lib/build_qec.sh # .github/workflows/lib_qec.yaml # libs/qec/unittests/CMakeLists.txt

linted

15aacad

Signed-off-by: kvmto <kmato@nvidia.com>

quick lint

cd539a8

Signed-off-by: kvmto <kmato@nvidia.com>

This was referenced Apr 20, 2026

[enhancement] QEC gpu python tests, follow-up #479 #510

Merged

[Docs] Dem Sampling Docs, follow-up #479 #511

Open

bmhowe23 reviewed Apr 20, 2026

View reviewed changes

Comment thread libs/qec/lib/CMakeLists.txt

bmhowe23 reviewed Apr 20, 2026

View reviewed changes

Comment thread .github/actions/build-lib/build_all.sh Outdated

bmhowe23 reviewed Apr 20, 2026

View reviewed changes

Comment thread .github/actions/build-lib/build_all.sh Outdated

bmhowe23 reviewed Apr 20, 2026

View reviewed changes

Comment thread .github/actions/build-lib/build_all.sh Outdated

kvmto and others added 9 commits April 21, 2026 15:41

Merge remote-tracking branch 'upstream/main' into dem_sampling_pr

41bd07f

Signed-off-by: kvmto <kmato@nvidia.com>

Fix GH token merge bug

3dea292

Signed-off-by: kvmto <kmato@nvidia.com>

reverted and fixed the torch optionality

b367943

Signed-off-by: kvmto <kmato@nvidia.com>

reverted improvements to tensor network for a possible followup PR

15bb1d1

Signed-off-by: kvmto <kmato@nvidia.com>

cleaned bash guard

b4ea67a

Signed-off-by: kvmto <kmato@nvidia.com>

migrating to nanobind

67accc8

Signed-off-by: kvmto <kmato@nvidia.com>

accepting row-major arrays with size-1 inner dims in toTensor

a320d49

Signed-off-by: Sachin Pisal <spisal@nvidia.com>

Merge branch 'main' into dem_sampling_pr

abb6f02

kvmto requested a review from bmhowe23 April 22, 2026 21:01

bmhowe23 reviewed Apr 22, 2026

View reviewed changes

Comment thread .github/workflows/all_libs_release.yaml

Comment thread .github/workflows/all_libs_release.yaml

Comment thread .github/workflows/all_libs_release.yaml

bmhowe23 added 2 commits April 23, 2026 01:14

Try automatically finding custabilizer

b9a7270

Signed-off-by: Ben Howe <bhowe@nvidia.com>

Fix wheel build

9c5eced

Signed-off-by: Ben Howe <bhowe@nvidia.com>

bmhowe23 and others added 4 commits April 23, 2026 02:38

Fix wheel builds (part 2)

2d11bbc

Signed-off-by: Ben Howe <bhowe@nvidia.com>

Fix wheel builds (part 3)

6dfed51

Signed-off-by: Ben Howe <bhowe@nvidia.com>

fix of test wheels fail

968f007

Signed-off-by: kvmto <kmato@nvidia.com>

quick lint

ce1d235

Signed-off-by: kvmto <kmato@nvidia.com>

bmhowe23 approved these changes Apr 24, 2026

View reviewed changes

kvmto enabled auto-merge (squash) April 24, 2026 13:39

kvmto merged commit 4c7201b into NVIDIA:main Apr 24, 2026
82 of 98 checks passed

bmhowe23 mentioned this pull request Apr 24, 2026

Add dem_sampling function for circuit-level noise simulation with per-mechanism error probabilities #365

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dem_sampling CPU/GPU across C++ and Python#479

Add dem_sampling CPU/GPU across C++ and Python#479
kvmto merged 41 commits intoNVIDIA:mainfrom
kvmto:dem_sampling_pr

kvmto commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

bmhowe23 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmhowe23 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmhowe23 commented Apr 23, 2026

Uh oh!

bmhowe23 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kvmto commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Build / packaging

Test plan

Uh oh!

Uh oh!

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmhowe23 commented Apr 23, 2026

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kvmto commented Apr 1, 2026 •

edited

Loading