Use pinned host memory for GPU state-to-NumPy conversion (#2797) by mitchdz · Pull Request #4290 · NVIDIA/cuda-quantum

mitchdz · 2026-04-09T19:08:09Z

Summary

Fix performance bottleneck in np.array(cudaq.get_state(kernel)) for GPU-backed
states by using pinned (cudaMallocHost) host memory instead of pageable (new[])
for the device-to-host transfer in the pybind11 buffer protocol handler
Add SimulationState::toHostBuffer() virtual method so GPU backends can provide
optimized host allocation, with a pageable fallback for non-GPU backends and for
systems that cannot pin the requested amount of memory
Fix to_cupy() passing a hardcoded 1024-byte size to UnownedMemory instead of
the actual buffer size

Motivation

Closes #2797. When np.array() is called on a GPU-resident state, the buffer
protocol handler allocates pageable host memory and copies via cudaMemcpy. With
pageable destinations, CUDA must stage through an internal pinned buffer in chunks,
severely limiting effective bandwidth. On an L4 GPU with 25 qubits (537 MB state):

Allocation	Bandwidth	Time
Pageable (before)	1.8 GB/s	0.29s
Pinned (after)	13.2 GB/s	0.04s

For the original reporter's 32-qubit case on GH200, the ~5.4s overhead should drop
to under 1s.

- Fix performance bottleneck in `np.array(cudaq.get_state(kernel))` for GPU-backed states by using pinned (`cudaMallocHost`) host memory instead of pageable (`new[]`) for the device-to-host transfer in the pybind11 buffer protocol handler - Add `SimulationState::toHostBuffer()` virtual method so GPU backends can provide optimized host allocation, with a pageable fallback for non-GPU backends and for systems that cannot pin the requested amount of memory - Fix `to_cupy()` passing a hardcoded 1024-byte size to `UnownedMemory` instead of the actual buffer size Signed-off-by: mdzurick <mitch_dz@hotmail.com>

…umpy-conversion

mitchdz · 2026-04-09T19:08:57Z

Re-opening as accidentally closed last commit (my bad).

Also, I am not merging this in on purpose in order to do some more proper testing for larger qubit counts. My concern is this new alloc/dealloc may actually cause more overhead.

…umpy-conversion

github-actions · 2026-04-09T20:29:19Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

…umpy-conversion

github-actions · 2026-04-10T04:35:02Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

…umpy-conversion

mitchdz · 2026-04-13T12:31:49Z

I am blocking this for now. Testing locally on an L4 I do see a regression at higher qubit counts. I did a comparison between main and the new pinned memory logic.

  ┌────────┬────────────┬───────────┬─────────────┬─────────┐
  │ Qubits │ State size │ main (ms) │ pinned (ms) │ Speedup │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     20 │     16 MiB │      8.68 │        6.61 │   1.31x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     21 │     32 MiB │     17.51 │       11.24 │   1.56x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     22 │     64 MiB │     34.06 │       52.25 │   0.65x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     23 │    128 MiB │     67.58 │      102.57 │   0.66x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     24 │    256 MiB │    137.60 │      208.39 │   0.66x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     25 │    512 MiB │    282.67 │      407.51 │   0.69x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     26 │      1 GiB │    595.95 │      872.12 │   0.68x │
  └────────┴────────────┴───────────┴─────────────┴─────────┘

Signed-off-by: mdzurick <mitch_dz@hotmail.com>

github-actions · 2026-04-13T14:17:58Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

mitchdz added 2 commits April 8, 2026 18:53

Merge remote-tracking branch 'upstream/main' into fix/pinned-memory-n…

f2b989f

…umpy-conversion

Merge remote-tracking branch 'upstream/main' into fix/pinned-memory-n…

5db024e

…umpy-conversion

github-actions Bot pushed a commit that referenced this pull request Apr 9, 2026

Docs preview for PR #4290.

d22d8f2

Merge remote-tracking branch 'upstream/main' into fix/pinned-memory-n…

2e8f3a3

…umpy-conversion

github-actions Bot pushed a commit that referenced this pull request Apr 10, 2026

Docs preview for PR #4290.

fc3853e

Merge remote-tracking branch 'upstream/main' into fix/pinned-memory-n…

166e6ed

…umpy-conversion

reuse state data

80b17cf

Signed-off-by: mdzurick <mitch_dz@hotmail.com>

github-actions Bot pushed a commit that referenced this pull request Apr 13, 2026

Docs preview for PR #4290.

e806b13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pinned host memory for GPU state-to-NumPy conversion (#2797)#4290

Use pinned host memory for GPU state-to-NumPy conversion (#2797)#4290
mitchdz wants to merge 6 commits intoNVIDIA:mainfrom
mitchdz:fix/pinned-memory-numpy-conversion

mitchdz commented Apr 9, 2026

Uh oh!

mitchdz commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

mitchdz commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mitchdz commented Apr 9, 2026

Summary

Motivation

Uh oh!

mitchdz commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

mitchdz commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant