Skip to content

Add dem_sampling CPU/GPU across C++ and Python#479

Merged
kvmto merged 41 commits intoNVIDIA:mainfrom
kvmto:dem_sampling_pr
Apr 24, 2026
Merged

Add dem_sampling CPU/GPU across C++ and Python#479
kvmto merged 41 commits intoNVIDIA:mainfrom
kvmto:dem_sampling_pr

Conversation

@kvmto
Copy link
Copy Markdown
Collaborator

@kvmto kvmto commented Apr 1, 2026

Summary

  • Add dem_sampling (C++) with CPU and cuStabilizer-backed GPU paths, and
    expose it through nanobind and Python as cudaq_qec.dem_sampling with
    backend="auto" | "cpu" | "gpu" and NumPy / PyTorch tensor support
    (incl. CUDA device pointers on the GPU path).
  • Tests: new C++ unit tests (DemSamplingCPU, DemSamplingGPU) and
    libs/qec/python/tests/test_dem_sampling.py covering CPU/GPU and
    NumPy/PyTorch inputs.

Build / packaging

  • FindcuStabilizer.cmake auto-discovers the library via CUSTABILIZER_ROOT,
    CUQUANTUM_ROOT, or the active Python environment's cuquantum-python-cuXX
    wheel. If missing, CMake errors with a pip install hint.
  • cuquantum-python-cuXX>=26.03.0 is promoted to a core runtime dep of
    cudaq-qec (it ships cuStabilizer); torch stays optional under
    tensor_network_decoder / all.
  • Built wheels RPATH into the sibling cuquantum wheel, so
    libcustabilizer.so.0 is found without LD_LIBRARY_PATH; CI drops the
    previous CUSTABILIZER_ROOT / LD_LIBRARY_PATH plumbing.

Test plan

  • C++ DemSamplingCPU / DemSamplingGPU tests in CI.
  • test_dem_sampling.py in CI with CUDA-enabled torch.
  • Wheel builds green on cu12 and cu13.

Introduce dem_sampling implementations for CPU and cuStabilizer-backed GPU paths in C++, and expose them through pybind/Python with torch tensor and device-pointer support. Add C++/Python coverage for backend paths and wire build/packaging checks so cuStabilizer requirements are enforced for shipping.

Signed-off-by: kvmto <kmato@nvidia.com>
@kvmto kvmto requested review from bmhowe23, ivanbasov and wsttiger April 1, 2026 16:02
kvmto added 2 commits April 1, 2026 16:03
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Comment thread libs/qec/pyproject.toml.cu12 Outdated
kvmto added 14 commits April 1, 2026 22:08
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
- Use cudaMallocAsync/cudaFreeAsync for all GPU temporaries to avoid
  implicit device synchronization that breaks multi-stream concurrency
  (critical for PyTorch CUDA stream integration)
- Replace synchronous cudaMemcpy with cudaMemcpyAsync on the caller's
  stream for the probability D->H copy
- Add grid dimension overflow guards before every CUDA kernel launch
- Handle numShots=0 gracefully in both C++ CPU path and Python binding
- Binarize check_matrix with & 1u in CPU path to match GPU kernel
  behavior and prevent uint8 dot-product overflow
- Clear sticky CUDA errors (cudaGetLastError) on all failure paths in
  the Python binding's GPU allocation/copy helpers
- Fix pre-existing test_non_default_cuda_stream assertion that compared
  torch.device("cuda") against torch.device("cuda", index=0)
- Add 12 new tests covering zero-shot edge case, non-binary H matrix
  CPU/GPU parity, and seedless code path (5 C++, 7 Python)

Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Comment thread libs/qec/lib/CMakeLists.txt Outdated
kvmto added 8 commits April 16, 2026 12:19
# Conflicts:
#	.github/actions/build-lib/action.yaml
#	.github/actions/build-lib/build_qec.sh
#	.github/workflows/lib_qec.yaml
#	libs/qec/unittests/CMakeLists.txt
cuStabilizer is now unconditionally required (no optional detection).
PyTorch is user-installed only: numpy in = numpy out, torch CUDA in =
torch CUDA out, torch CPU in = explicit error, anything else = fail.

CMake / build:
- Replace 85-line cuStabilizer detection with find_package(REQUIRED)
- Remove CUDAQ_QEC_REQUIRE_CUSTABILIZER option and all HAS_CUSTABILIZER
  conditionals from lib/, python/, unittests/ CMakeLists and pyproject
- Remove require-custabilizer input from action.yaml and dead
  REQUIRE_CUSTABILIZER env/shell logic from build_qec.sh, build_all.sh
- Remove custabilizer matrix dimension from lib_qec.yaml
- Revert docs.yaml to pre-branch state (no cuStabilizer in docs build)

C++ bindings (py_dem_sampling.cpp):
- Remove torch-to-numpy conversion from asNumpyUint8/asNumpyFloat64
- Add rejectTorchCpuTensors() to block silent numpy conversion
- Add install-torch warning in tryTorchGpuSampling via PyErr_WarnEx
- Remove #ifdef CUDAQ_QEC_HAS_CUSTABILIZER guards throughout
- Remove always-true dem_sampling_has_gpu_compiled attribute
- Remove redundant if(ok) block in tryGpuSampling
- Update pybind docstring for numpy primary / torch CUDA optional

Python layer (dem_sampling.py):
- Add warnings.warn() hint when tensor-like input detected but torch
  not importable
- Update module and function docstrings

Tests:
- Delete 3 redundant torch CPU tests, flip 2 to expect rejection
- Keep 4 CUDA tests unchanged
- Add test_torch_not_installed (pytest.warns) and test_random_object
- Remove unused dem_sampling_has_gpu_compiled check and qec import
- Remove #ifdef guards from test_dem_sampling.cpp

CI (test_wheels.sh):
- Run QEC tests without torch first, install torch, re-run QEC tests

Signed-off-by: kvmto <kmato@nvidia.com>
…eakage

The --upgrade flag was causing pip to upgrade all listed packages to
their latest versions, breaking two CI workflows:
- all_libs: numpy 1.26.4 -> 2.4.4 broke openfermion (np.string_ removed)
- docs: sphinx 8.x -> 9.1.0 broke sphinx_toolbox (autodoc.logger removed)

Pin sphinx<9 in docs workflow and add proper quoting for version specs.

Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
…list inputs

- Throw RuntimeError immediately when CUDA tensor probability validation
  fails on the GPU path, instead of falling through to numpy conversion
  which crashes on device tensors
- Accept plain Python lists as valid input (numpy auto-converts them)
  and update test accordingly

Signed-off-by: kvmto <kmato@nvidia.com>
contractors.py unconditionally imported torch at module level, causing
pytest collection to fail with ModuleNotFoundError when torch is not
installed. This broke all QEC python tests in CI after torch was removed
from the pip install in commit 509f82c.

- contractors.py: lazy-import torch (from __future__ import annotations +
  TYPE_CHECKING for type hints, move import into einsum_torch body)
- tensor_network_decoder.py: raise RuntimeError with install instructions
  when torch is missing on CPU (matches dem_sampling pattern)
- test_tensor_network_decoder.py: skip 9 decoder-construction tests when
  no torch and no GPU; 21 utility tests run unconditionally. Wheel CI
  (test_wheels.sh) re-runs all tests with torch installed.

Signed-off-by: kvmto <kmato@nvidia.com>
…I conflict

Commit 55369ff removed the explicit torch==2.9.0 CUDA-indexed install
from all_libs.yaml and all_libs_release.yaml while trying to make torch
optional. However, both workflows still install lightning (needed for
solvers GQE tests), which hard-depends on torch. Without the explicit
install, pip pulls the CPU-only torch wheel from PyPI, which bundles an
old libgfortran.so.5 (GCC 7/8, only GFORTRAN_8). libcudaq-solvers.so
is compiled with GCC 11 and requires GFORTRAN_10, so it crashes at
import time when torch's bundled copy shadows the system library.

Fix: pre-install torch==2.9.0 from the CUDA wheel index before the
main pip install in both workflows, so lightning finds torch already
satisfied and never pulls the CPU wheel.

Also fix all_libs_release.yaml: remove --upgrade (causes numpy 2.x /
sphinx 9.x breakage, already fixed in all_libs.yaml by 4ccae9c) and
quote >=version specs to prevent bash redirect misinterpretation.

Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at some of the packaging items and think we should align on them before going through with the full review. Specifically,

  1. I don't think we should modify anything related to the tensor network decoder packaging as part of this PR. (The current changes remove torch from that decoder for some reason.)
  2. I thought we had agreed that the maintenance and required testing infrastructure of the a new optional feature (dem_sampling) feature was too high, so we preferred to not have the optional dependency, no? Did that turn out to not be possible?

Comment thread libs/qec/pyproject.toml.cu12 Outdated
Comment thread libs/qec/pyproject.toml.cu13 Outdated
Comment thread libs/qec/pyproject.toml.cu13
Comment thread libs/qec/pyproject.toml.cu12
Comment thread libs/qec/pyproject.toml.cu12
Comment thread libs/qec/pyproject.toml.cu13
Comment thread .github/actions/build-lib/action.yaml Outdated
Comment thread libs/qec/lib/CMakeLists.txt
Comment thread .github/actions/build-lib/build_all.sh Outdated
Comment thread .github/actions/build-lib/build_all.sh Outdated
Comment thread .github/actions/build-lib/build_all.sh Outdated
kvmto and others added 9 commits April 21, 2026 15:41
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>

# Conflicts:
#	.github/workflows/all_libs.yaml
#	.github/workflows/all_libs_release.yaml
#	.github/workflows/lib_qec.yaml
#	libs/core/include/cuda-qx/core/kwargs_utils.h
#	libs/qec/pyproject.toml.cu12
#	libs/qec/pyproject.toml.cu13
#	libs/qec/python/bindings/py_decoder.cpp
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: Sachin Pisal <spisal@nvidia.com>
@kvmto kvmto requested a review from bmhowe23 April 22, 2026 21:01
Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Kevin! My previous packaging issues look to be fully resolved...thank you. I have just a few minor comments below.

I manually kicked off the "Build wheels" job to exercise the wheel-based workflow: https://github.com/NVIDIA/cudaqx/actions/runs/24807169194. This workflow matrixes the tests by CPU/GPU, CUDA versions, and Python versions, so it is a bit more thorough. Will monitor this evening.

Comment thread .github/workflows/all_libs_release.yaml
Comment thread .github/workflows/all_libs_release.yaml
Comment thread .github/workflows/all_libs_release.yaml
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
@bmhowe23
Copy link
Copy Markdown
Collaborator

Thanks, Kevin! My previous packaging issues look to be fully resolved...thank you. I have just a few minor comments below.

I manually kicked off the "Build wheels" job to exercise the wheel-based workflow: https://github.com/NVIDIA/cudaqx/actions/runs/24807169194. This workflow matrixes the tests by CPU/GPU, CUDA versions, and Python versions, so it is a bit more thorough. Will monitor this evening.

The "Build wheels" job failed. Hopefully it's ok - I took the liberty of attempting a fix in b9a7270 and 9c5eced. If you object to them, we can remove them.

bmhowe23 and others added 4 commits April 23, 2026 02:38
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Kevin! Before merging, please double check the PR summary (and therefore commit message) for accuracy in light of the review updates. In particular, I suspect this part should be removed?

Add CUDAQ_QEC_REQUIRE_CUSTABILIZER enforcement path for builds that must ship GPU support.

@kvmto kvmto enabled auto-merge (squash) April 24, 2026 13:39
@kvmto kvmto merged commit 4c7201b into NVIDIA:main Apr 24, 2026
82 of 98 checks passed
kvmto added a commit that referenced this pull request Apr 27, 2026
## Summary
Dependent on the merge of PR #479 .

- Add Python GPU test coverage to the QEC GPU CI job (`pr-build-gpu` in
`lib_qec.yaml`)
- The GPU job previously only ran C++ tests via `ctest`. Python tests
with GPU-dependent code paths (cuQuantum tensor network contractor, DEM
sampling GPU backend, TRT decoder inference) were skipped on CPU runners
and never executed on GPU runners.
- Install torch (CUDA wheel) and Python test dependencies on the GPU
runner
- Add a `pytest` step targeting `test_tensor_network_decoder.py`,
`test_dem_sampling.py`, and `test_trt_decoder.py`
- Expand the `ctest` regex to also include `TRTDecoder` C++ tests

## Test plan

- [ ] `pr-build-gpu` job runs successfully on amd64 and arm64
- [ ] "Run GPU Python tests" step shows previously-skipped
cuQuantum/CUDA tests now running
- [ ] "Run GPU C++ tests" step includes TRTDecoder tests (amd64)
- [ ] No regressions in existing CPU test jobs (`pr-build`)

---------

Signed-off-by: kvmto <kmato@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants