Preserve memory pool CUDA errors and harden OOM tests by rwgk · Pull Request #2084 · NVIDIA/cuda-python

rwgk · 2026-05-14T03:57:19Z

This PR has two high-level aspects:

Preserve the original CUDA errors from memory pool handle helpers.
Add targeted xfails based on Preserve memory pool CUDA errors and harden OOM tests #2084 (comment).

1. Preserve Memory Pool CUDA Errors

This PR fixes several cuda.core memory resource paths so they preserve the original CUDA error returned by the underlying handle helper.

The C++ handle helpers return an empty handle on failure and store the CUDA result in thread-local error state. Some memory pool call sites were not immediately consuming that stored error before continuing, which could let a later operation on an empty handle report a different error. Other call sites detected the empty handle, but replaced the original CUDA result with a generic RuntimeError.

This change updates the memory pool call sites to immediately call get_last_error() and HANDLE_RETURN(...) when an empty handle is returned.

When a CUDA memory pool operation fails, users and tests should see the CUDA error that caused the failure. Reporting a later follow-up error can be misleading because it describes the consequence of an empty handle, not the operation that originally failed.

Preserving the original CUDA result makes failures easier to diagnose and keeps cuda.core behavior consistent with the handle-layer contract used elsewhere in the package.

The updated paths cover:

default device memory pool acquisition,
memory pool creation,
memory pool import from an IPC handle,
allocation from a memory pool, and
asynchronous graph memory allocation.

The fallback RuntimeError messages now only describe the unexpected case where an empty handle is returned but no CUDA error was recorded.

Relevant commits:

commit 4255902 — preserve memory pool CUDA errors
commit 52879fd — skip unsupported managed pool warnings

2. Add Targeted Mempool OOM Xfails

The error-preservation change makes additional existing mempool OOM surfaces visible as their original CUDA errors. Based on #2084 (comment), this PR adds targeted xfails so those known platform-specific failures report as expected failures instead of direct assertion failures or large cascades.

The added xfails cover:

cuda_bindings default mempool interoperability queries that can report CUDA_ERROR_OUT_OF_MEMORY, and
cuda_core graph allocation paths rooted in graph memory allocation nodes.

The cuda_core xfails are applied at shared graph-construction points where possible, rather than repeating markers across each parametrized test case.

Relevant commits:

commit 93e86ec — xfail default mempool interop OOM
commit 52a834f — xfail graph mempool allocation OOM

xrefs:

PR test: xfail Windows MCDM mempool OOM setup failures #2000 — related test workarounds for the same memory pool failure mode
nvbug 5815123

copy-pr-bot · 2026-05-14T03:57:23Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2026-05-14T03:58:26Z

/ok to test

github-actions · 2026-05-14T04:16:30Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2084/
https://nvidia.github.io/cuda-python/pr-preview/pr-2084/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2084/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2084/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk · 2026-05-14T15:45:10Z

Cursor GPT-5.5 1M High

Analysis of CI failures:

https://github.com/NVIDIA/cuda-python/actions/runs/25840795724?pr=2084

All checked failed jobs have the same two failing tests and the same exception: the explicit ManagedMemoryResource(options=...) path now raises CUDA_ERROR_NOT_SUPPORTED.

The failures are consistent and look like expected fallout from PR 2084 exposing a previously masked CUDA error.

Failed test jobs inspected all fail the same two tests:

tests/test_managed_memory_warning.py::test_warning_emitted
tests/test_managed_memory_warning.py::test_warning_emitted_only_once

The exception is the same everywhere:

CUDA_ERROR_NOT_SUPPORTED

It happens when the test calls:

ManagedMemoryResource(options=ManagedMemoryResourceOptions(preferred_location=device_id))

That explicit managed-pool creation goes through MP_init_create_pool(...) / create_mempool_handle(...). Before PR 2084, an empty handle from that helper could slip through, so these warning tests could pass even though the managed pool was not actually created. With PR 2084, the original CUDA error is now correctly surfaced.

This is not limited to the original Windows MCDM concern. I saw the same failure signature in:

Windows WDDM py3.11 jobs
Windows MCDM py3.13/py3.14/py3.14t jobs
Linux WSL job

So the issue is broader: on devices where concurrent_managed_access is false, the warning tests assume an explicit managed pool can still be created. CI shows that assumption is not valid on several configurations.

Recommended fix: update test_managed_memory_warning.py so the two warning tests skip when explicit managed pool creation returns CUDA_ERROR_NOT_SUPPORTED. That should be a skip, not xfail, because in these CI jobs the operation is reporting unsupported rather than an unexpected OOM/platform flake.

The PR 2084 code change itself is doing what it was supposed to do: preserving the original CUDA error instead of masking it. The tests need to stop relying on the old masked behavior.

Skip managed memory warning tests when explicit managed pool creation reports unsupported, now that cuda.core preserves the underlying CUDA error. xref: NVIDIA#2084 (comment)

rwgk · 2026-05-14T16:08:46Z

Analysis of 2026-05-14 QA Logs

Based on logs provided by the QA team on 2026-05-14 (thanks @rluo8!), there are two additional mempool-related OOM surfaces worth handling with targeted xfails.

1. cuda_bindings: one new xfail point

The cuda-bindings run shows one failure:

tests/test_interoperability.py::test_interop_memPool

The failing call is the driver-side default mempool query:

err_dr, pool = cuda.cuDeviceGetDefaultMemPool(0)
assert err_dr == cuda.CUresult.CUDA_SUCCESS

The observed result is CUDA_ERROR_OUT_OF_MEMORY, with pool returned as None. This matches the same class of mempool OOM behavior already handled elsewhere, but this interoperability test currently asserts success directly before giving the shared xfail helper a chance to classify the failure.

Recommended insertion point:

err_dr, pool = cuda.cuDeviceGetDefaultMemPool(0)
xfail_if_mempool_oom(err_dr, "cuDeviceGetDefaultMemPool", 0)
assert err_dr == cuda.CUresult.CUDA_SUCCESS

It would also be reasonable to apply the same pattern to the runtime-side cudaDeviceGetDefaultMemPool(0) call in the same test, in case that side reports cudaErrorMemoryAllocation under the same platform condition.

2. cuda_core: graph allocation cascade

The cuda-core run shows a large cascade of failures and setup errors rooted in graph allocation nodes. The common failing path is:

GraphDefinition.allocate()
GraphNode.allocate()
GN_alloc()
cuGraphAddMemAllocNode(...)
CUDA_ERROR_OUT_OF_MEMORY

Many later failures are downstream effects of this same root condition. The affected areas include graph definition topology tests, graph definition integration tests, graph memory resource tests, and object protocol tests that construct sample graph allocation nodes.

The xfail should be inserted at shared graph-construction points rather than repeated across each parametrized test case. Good candidates are:

cuda_core/tests/graph/test_graph_definition.py: wrap graph_spec, nonempty_graph_spec, and node_spec fixture construction.
cuda_core/tests/test_object_protocols.py: wrap graph-node sample fixtures that call GraphDefinition.allocate() directly or indirectly, such as sample_empty_node, sample_alloc_node, sample_free_node, sample_memset_node, and sample_memcpy_node.
cuda_core/tests/graph/test_graph_definition_integration.py: wrap the direct end-to-end graph-building bodies that allocate graph memory nodes.
cuda_core/tests/graph/test_graph_memory_resource.py: wrap graph-memory allocation paths using GraphMemoryResource.allocate(...).

A small cuda-core test helper or context manager would keep the fix localized and consistent. The helper should catch CUDAError, call the existing shared xfail_if_mempool_oom(...), and re-raise if the error is not the expected Windows MCDM mempool OOM condition.

Handle default mempool query OOMs with the shared mempool xfail helper so the interop test reports the known platform condition instead of a direct assertion failure. xref: NVIDIA#2084 (comment) section 1

Handle graph allocation OOMs with the shared mempool xfail helper so graph tests report the known platform condition instead of cascading failures. xref: NVIDIA#2084 (comment) section 2

rwgk · 2026-05-14T16:30:01Z

/ok to test

cuda.core: preserve memory pool CUDA errors

4255902

github-actions Bot added the cuda.core Everything related to the cuda.core module label May 14, 2026

rwgk self-assigned this May 14, 2026

rwgk added the P0 High priority - Must do! label May 14, 2026

rwgk added this to the cuda.core next milestone May 14, 2026

test: skip unsupported managed pool warnings

52879fd

Skip managed memory warning tests when explicit managed pool creation reports unsupported, now that cuda.core preserves the underlying CUDA error. xref: NVIDIA#2084 (comment)

test: xfail default mempool interop OOM

93e86ec

Handle default mempool query OOMs with the shared mempool xfail helper so the interop test reports the known platform condition instead of a direct assertion failure. xref: NVIDIA#2084 (comment) section 1

github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label May 14, 2026

rwgk added 2 commits May 14, 2026 09:27

test: xfail graph mempool allocation OOM

52a834f

Handle graph allocation OOMs with the shared mempool xfail helper so graph tests report the known platform condition instead of cascading failures. xref: NVIDIA#2084 (comment) section 2

Merge branch 'main' into nvbugs5815123_fix_masking_oom_as_invalid_value

883b1a7

rwgk changed the title ~~cuda.core: preserve CUDA errors from memory pool handle helpers~~ Preserve memory pool CUDA errors and harden OOM tests May 14, 2026

rwgk marked this pull request as ready for review May 14, 2026 16:51

rwgk requested a review from Andy-Jost May 14, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve memory pool CUDA errors and harden OOM tests#2084

Preserve memory pool CUDA errors and harden OOM tests#2084
rwgk wants to merge 5 commits into
NVIDIA:mainfrom
rwgk:nvbugs5815123_fix_masking_oom_as_invalid_value

rwgk commented May 14, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

rwgk commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented May 14, 2026

Uh oh!

rwgk commented May 14, 2026

Uh oh!

rwgk commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rwgk commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Preserve Memory Pool CUDA Errors

2. Add Targeted Mempool OOM Xfails

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

rwgk commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented May 14, 2026

Uh oh!

rwgk commented May 14, 2026

Analysis of 2026-05-14 QA Logs

1. cuda_bindings: one new xfail point

2. cuda_core: graph allocation cascade

Uh oh!

rwgk commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rwgk commented May 14, 2026 •

edited

Loading