Skip to content

Use pinned host memory for GPU state-to-NumPy conversion (#2797)#4290

Open
mitchdz wants to merge 6 commits intoNVIDIA:mainfrom
mitchdz:fix/pinned-memory-numpy-conversion
Open

Use pinned host memory for GPU state-to-NumPy conversion (#2797)#4290
mitchdz wants to merge 6 commits intoNVIDIA:mainfrom
mitchdz:fix/pinned-memory-numpy-conversion

Conversation

@mitchdz
Copy link
Copy Markdown
Collaborator

@mitchdz mitchdz commented Apr 9, 2026

Summary

  • Fix performance bottleneck in np.array(cudaq.get_state(kernel)) for GPU-backed
    states by using pinned (cudaMallocHost) host memory instead of pageable (new[])
    for the device-to-host transfer in the pybind11 buffer protocol handler
  • Add SimulationState::toHostBuffer() virtual method so GPU backends can provide
    optimized host allocation, with a pageable fallback for non-GPU backends and for
    systems that cannot pin the requested amount of memory
  • Fix to_cupy() passing a hardcoded 1024-byte size to UnownedMemory instead of
    the actual buffer size

Motivation

Closes #2797. When np.array() is called on a GPU-resident state, the buffer
protocol handler allocates pageable host memory and copies via cudaMemcpy. With
pageable destinations, CUDA must stage through an internal pinned buffer in chunks,
severely limiting effective bandwidth. On an L4 GPU with 25 qubits (537 MB state):

Allocation Bandwidth Time
Pageable (before) 1.8 GB/s 0.29s
Pinned (after) 13.2 GB/s 0.04s

For the original reporter's 32-qubit case on GH200, the ~5.4s overhead should drop
to under 1s.

mitchdz added 2 commits April 8, 2026 18:53
- Fix performance bottleneck in `np.array(cudaq.get_state(kernel))` for
  GPU-backed states by using pinned (`cudaMallocHost`) host memory instead
  of pageable (`new[]`) for the device-to-host transfer in the pybind11
  buffer protocol handler
- Add `SimulationState::toHostBuffer()` virtual method so GPU backends can
  provide optimized host allocation, with a pageable fallback for non-GPU
  backends and for systems that cannot pin the requested amount of memory
- Fix `to_cupy()` passing a hardcoded 1024-byte size to `UnownedMemory`
  instead of the actual buffer size

Signed-off-by: mdzurick <mitch_dz@hotmail.com>
@mitchdz
Copy link
Copy Markdown
Collaborator Author

mitchdz commented Apr 9, 2026

Re-opening as accidentally closed last commit (my bad).

Also, I am not merging this in on purpose in order to do some more proper testing for larger qubit counts. My concern is this new alloc/dealloc may actually cause more overhead.

github-actions Bot pushed a commit that referenced this pull request Apr 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

github-actions Bot pushed a commit that referenced this pull request Apr 10, 2026
@github-actions
Copy link
Copy Markdown

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@mitchdz
Copy link
Copy Markdown
Collaborator Author

mitchdz commented Apr 13, 2026

I am blocking this for now. Testing locally on an L4 I do see a regression at higher qubit counts. I did a comparison between main and the new pinned memory logic.

  ┌────────┬────────────┬───────────┬─────────────┬─────────┐
  │ Qubits │ State size │ main (ms) │ pinned (ms) │ Speedup │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     20 │     16 MiB │      8.68 │        6.61 │   1.31x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     21 │     32 MiB │     17.51 │       11.24 │   1.56x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     22 │     64 MiB │     34.06 │       52.25 │   0.65x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     23 │    128 MiB │     67.58 │      102.57 │   0.66x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     24 │    256 MiB │    137.60 │      208.39 │   0.66x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     25 │    512 MiB │    282.67 │      407.51 │   0.69x │
  ├────────┼────────────┼───────────┼─────────────┼─────────┤
  │     26 │      1 GiB │    595.95 │      872.12 │   0.68x │
  └────────┴────────────┴───────────┴─────────────┴─────────┘

Signed-off-by: mdzurick <mitch_dz@hotmail.com>
github-actions Bot pushed a commit that referenced this pull request Apr 13, 2026
@github-actions
Copy link
Copy Markdown

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential performance issue when doing np.array on statevector representation

1 participant