Add dem_sampling CPU/GPU across C++ and Python#479
Conversation
Introduce dem_sampling implementations for CPU and cuStabilizer-backed GPU paths in C++, and expose them through pybind/Python with torch tensor and device-pointer support. Add C++/Python coverage for backend paths and wire build/packaging checks so cuStabilizer requirements are enforced for shipping. Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
- Use cudaMallocAsync/cudaFreeAsync for all GPU temporaries to avoid
implicit device synchronization that breaks multi-stream concurrency
(critical for PyTorch CUDA stream integration)
- Replace synchronous cudaMemcpy with cudaMemcpyAsync on the caller's
stream for the probability D->H copy
- Add grid dimension overflow guards before every CUDA kernel launch
- Handle numShots=0 gracefully in both C++ CPU path and Python binding
- Binarize check_matrix with & 1u in CPU path to match GPU kernel
behavior and prevent uint8 dot-product overflow
- Clear sticky CUDA errors (cudaGetLastError) on all failure paths in
the Python binding's GPU allocation/copy helpers
- Fix pre-existing test_non_default_cuda_stream assertion that compared
torch.device("cuda") against torch.device("cuda", index=0)
- Add 12 new tests covering zero-shot edge case, non-binary H matrix
CPU/GPU parity, and seedless code path (5 C++, 7 Python)
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
# Conflicts: # .github/actions/build-lib/action.yaml # .github/actions/build-lib/build_qec.sh # .github/workflows/lib_qec.yaml # libs/qec/unittests/CMakeLists.txt
cuStabilizer is now unconditionally required (no optional detection). PyTorch is user-installed only: numpy in = numpy out, torch CUDA in = torch CUDA out, torch CPU in = explicit error, anything else = fail. CMake / build: - Replace 85-line cuStabilizer detection with find_package(REQUIRED) - Remove CUDAQ_QEC_REQUIRE_CUSTABILIZER option and all HAS_CUSTABILIZER conditionals from lib/, python/, unittests/ CMakeLists and pyproject - Remove require-custabilizer input from action.yaml and dead REQUIRE_CUSTABILIZER env/shell logic from build_qec.sh, build_all.sh - Remove custabilizer matrix dimension from lib_qec.yaml - Revert docs.yaml to pre-branch state (no cuStabilizer in docs build) C++ bindings (py_dem_sampling.cpp): - Remove torch-to-numpy conversion from asNumpyUint8/asNumpyFloat64 - Add rejectTorchCpuTensors() to block silent numpy conversion - Add install-torch warning in tryTorchGpuSampling via PyErr_WarnEx - Remove #ifdef CUDAQ_QEC_HAS_CUSTABILIZER guards throughout - Remove always-true dem_sampling_has_gpu_compiled attribute - Remove redundant if(ok) block in tryGpuSampling - Update pybind docstring for numpy primary / torch CUDA optional Python layer (dem_sampling.py): - Add warnings.warn() hint when tensor-like input detected but torch not importable - Update module and function docstrings Tests: - Delete 3 redundant torch CPU tests, flip 2 to expect rejection - Keep 4 CUDA tests unchanged - Add test_torch_not_installed (pytest.warns) and test_random_object - Remove unused dem_sampling_has_gpu_compiled check and qec import - Remove #ifdef guards from test_dem_sampling.cpp CI (test_wheels.sh): - Run QEC tests without torch first, install torch, re-run QEC tests Signed-off-by: kvmto <kmato@nvidia.com>
…eakage The --upgrade flag was causing pip to upgrade all listed packages to their latest versions, breaking two CI workflows: - all_libs: numpy 1.26.4 -> 2.4.4 broke openfermion (np.string_ removed) - docs: sphinx 8.x -> 9.1.0 broke sphinx_toolbox (autodoc.logger removed) Pin sphinx<9 in docs workflow and add proper quoting for version specs. Signed-off-by: kvmto <kmato@nvidia.com>
…list inputs - Throw RuntimeError immediately when CUDA tensor probability validation fails on the GPU path, instead of falling through to numpy conversion which crashes on device tensors - Accept plain Python lists as valid input (numpy auto-converts them) and update test accordingly Signed-off-by: kvmto <kmato@nvidia.com>
contractors.py unconditionally imported torch at module level, causing pytest collection to fail with ModuleNotFoundError when torch is not installed. This broke all QEC python tests in CI after torch was removed from the pip install in commit 509f82c. - contractors.py: lazy-import torch (from __future__ import annotations + TYPE_CHECKING for type hints, move import into einsum_torch body) - tensor_network_decoder.py: raise RuntimeError with install instructions when torch is missing on CPU (matches dem_sampling pattern) - test_tensor_network_decoder.py: skip 9 decoder-construction tests when no torch and no GPU; 21 utility tests run unconditionally. Wheel CI (test_wheels.sh) re-runs all tests with torch installed. Signed-off-by: kvmto <kmato@nvidia.com>
…I conflict Commit 55369ff removed the explicit torch==2.9.0 CUDA-indexed install from all_libs.yaml and all_libs_release.yaml while trying to make torch optional. However, both workflows still install lightning (needed for solvers GQE tests), which hard-depends on torch. Without the explicit install, pip pulls the CPU-only torch wheel from PyPI, which bundles an old libgfortran.so.5 (GCC 7/8, only GFORTRAN_8). libcudaq-solvers.so is compiled with GCC 11 and requires GFORTRAN_10, so it crashes at import time when torch's bundled copy shadows the system library. Fix: pre-install torch==2.9.0 from the CUDA wheel index before the main pip install in both workflows, so lightning finds torch already satisfied and never pulls the CPU wheel. Also fix all_libs_release.yaml: remove --upgrade (causes numpy 2.x / sphinx 9.x breakage, already fixed in all_libs.yaml by 4ccae9c) and quote >=version specs to prevent bash redirect misinterpretation. Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
bmhowe23
left a comment
There was a problem hiding this comment.
I took a quick look at some of the packaging items and think we should align on them before going through with the full review. Specifically,
- I don't think we should modify anything related to the tensor network decoder packaging as part of this PR. (The current changes remove torch from that decoder for some reason.)
- I thought we had agreed that the maintenance and required testing infrastructure of the a new optional feature (
dem_sampling) feature was too high, so we preferred to not have the optional dependency, no? Did that turn out to not be possible?
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com> # Conflicts: # .github/workflows/all_libs.yaml # .github/workflows/all_libs_release.yaml # .github/workflows/lib_qec.yaml # libs/core/include/cuda-qx/core/kwargs_utils.h # libs/qec/pyproject.toml.cu12 # libs/qec/pyproject.toml.cu13 # libs/qec/python/bindings/py_decoder.cpp
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: Sachin Pisal <spisal@nvidia.com>
bmhowe23
left a comment
There was a problem hiding this comment.
Thanks, Kevin! My previous packaging issues look to be fully resolved...thank you. I have just a few minor comments below.
I manually kicked off the "Build wheels" job to exercise the wheel-based workflow: https://github.com/NVIDIA/cudaqx/actions/runs/24807169194. This workflow matrixes the tests by CPU/GPU, CUDA versions, and Python versions, so it is a bit more thorough. Will monitor this evening.
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
The "Build wheels" job failed. Hopefully it's ok - I took the liberty of attempting a fix in b9a7270 and 9c5eced. If you object to them, we can remove them. |
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
Signed-off-by: kvmto <kmato@nvidia.com>
bmhowe23
left a comment
There was a problem hiding this comment.
Thanks, Kevin! Before merging, please double check the PR summary (and therefore commit message) for accuracy in light of the review updates. In particular, I suspect this part should be removed?
Add CUDAQ_QEC_REQUIRE_CUSTABILIZER enforcement path for builds that must ship GPU support.
## Summary Dependent on the merge of PR #479 . - Add Python GPU test coverage to the QEC GPU CI job (`pr-build-gpu` in `lib_qec.yaml`) - The GPU job previously only ran C++ tests via `ctest`. Python tests with GPU-dependent code paths (cuQuantum tensor network contractor, DEM sampling GPU backend, TRT decoder inference) were skipped on CPU runners and never executed on GPU runners. - Install torch (CUDA wheel) and Python test dependencies on the GPU runner - Add a `pytest` step targeting `test_tensor_network_decoder.py`, `test_dem_sampling.py`, and `test_trt_decoder.py` - Expand the `ctest` regex to also include `TRTDecoder` C++ tests ## Test plan - [ ] `pr-build-gpu` job runs successfully on amd64 and arm64 - [ ] "Run GPU Python tests" step shows previously-skipped cuQuantum/CUDA tests now running - [ ] "Run GPU C++ tests" step includes TRTDecoder tests (amd64) - [ ] No regressions in existing CPU test jobs (`pr-build`) --------- Signed-off-by: kvmto <kmato@nvidia.com>
Summary
dem_sampling(C++) with CPU and cuStabilizer-backed GPU paths, andexpose it through nanobind and Python as
cudaq_qec.dem_samplingwithbackend="auto" | "cpu" | "gpu"and NumPy / PyTorch tensor support(incl. CUDA device pointers on the GPU path).
DemSamplingCPU,DemSamplingGPU) andlibs/qec/python/tests/test_dem_sampling.pycovering CPU/GPU andNumPy/PyTorch inputs.
Build / packaging
FindcuStabilizer.cmakeauto-discovers the library viaCUSTABILIZER_ROOT,CUQUANTUM_ROOT, or the active Python environment'scuquantum-python-cuXXwheel. If missing, CMake errors with a pip install hint.
cuquantum-python-cuXX>=26.03.0is promoted to a core runtime dep ofcudaq-qec(it ships cuStabilizer);torchstays optional undertensor_network_decoder/all.cuquantumwheel, solibcustabilizer.so.0is found withoutLD_LIBRARY_PATH; CI drops theprevious
CUSTABILIZER_ROOT/LD_LIBRARY_PATHplumbing.Test plan
DemSamplingCPU/DemSamplingGPUtests in CI.test_dem_sampling.pyin CI with CUDA-enabled torch.