Skip to content

Preserve memory pool CUDA errors and harden OOM tests#2084

Open
rwgk wants to merge 5 commits into
NVIDIA:mainfrom
rwgk:nvbugs5815123_fix_masking_oom_as_invalid_value
Open

Preserve memory pool CUDA errors and harden OOM tests#2084
rwgk wants to merge 5 commits into
NVIDIA:mainfrom
rwgk:nvbugs5815123_fix_masking_oom_as_invalid_value

Conversation

@rwgk
Copy link
Copy Markdown
Contributor

@rwgk rwgk commented May 14, 2026

This PR has two high-level aspects:

  1. Preserve the original CUDA errors from memory pool handle helpers.
  2. Add targeted xfails based on Preserve memory pool CUDA errors and harden OOM tests #2084 (comment).

1. Preserve Memory Pool CUDA Errors

This PR fixes several cuda.core memory resource paths so they preserve the original CUDA error returned by the underlying handle helper.

The C++ handle helpers return an empty handle on failure and store the CUDA result in thread-local error state. Some memory pool call sites were not immediately consuming that stored error before continuing, which could let a later operation on an empty handle report a different error. Other call sites detected the empty handle, but replaced the original CUDA result with a generic RuntimeError.

This change updates the memory pool call sites to immediately call get_last_error() and HANDLE_RETURN(...) when an empty handle is returned.

When a CUDA memory pool operation fails, users and tests should see the CUDA error that caused the failure. Reporting a later follow-up error can be misleading because it describes the consequence of an empty handle, not the operation that originally failed.

Preserving the original CUDA result makes failures easier to diagnose and keeps cuda.core behavior consistent with the handle-layer contract used elsewhere in the package.

The updated paths cover:

  • default device memory pool acquisition,
  • memory pool creation,
  • memory pool import from an IPC handle,
  • allocation from a memory pool, and
  • asynchronous graph memory allocation.

The fallback RuntimeError messages now only describe the unexpected case where an empty handle is returned but no CUDA error was recorded.

Relevant commits:

  • commit 4255902 — preserve memory pool CUDA errors
  • commit 52879fd — skip unsupported managed pool warnings

2. Add Targeted Mempool OOM Xfails

The error-preservation change makes additional existing mempool OOM surfaces visible as their original CUDA errors. Based on #2084 (comment), this PR adds targeted xfails so those known platform-specific failures report as expected failures instead of direct assertion failures or large cascades.

The added xfails cover:

  • cuda_bindings default mempool interoperability queries that can report CUDA_ERROR_OUT_OF_MEMORY, and
  • cuda_core graph allocation paths rooted in graph memory allocation nodes.

The cuda_core xfails are applied at shared graph-construction points where possible, rather than repeating markers across each parametrized test case.

Relevant commits:

  • commit 93e86ec — xfail default mempool interop OOM
  • commit 52a834f — xfail graph mempool allocation OOM

xrefs:

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 14, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label May 14, 2026
@rwgk rwgk self-assigned this May 14, 2026
@rwgk rwgk added the P0 High priority - Must do! label May 14, 2026
@rwgk rwgk added this to the cuda.core next milestone May 14, 2026
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 14, 2026

/ok to test

@github-actions
Copy link
Copy Markdown

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 14, 2026

Cursor GPT-5.5 1M High

Analysis of CI failures:


All checked failed jobs have the same two failing tests and the same exception: the explicit ManagedMemoryResource(options=...) path now raises CUDA_ERROR_NOT_SUPPORTED.

The failures are consistent and look like expected fallout from PR 2084 exposing a previously masked CUDA error.

Failed test jobs inspected all fail the same two tests:

  • tests/test_managed_memory_warning.py::test_warning_emitted
  • tests/test_managed_memory_warning.py::test_warning_emitted_only_once

The exception is the same everywhere:

CUDA_ERROR_NOT_SUPPORTED

It happens when the test calls:

ManagedMemoryResource(options=ManagedMemoryResourceOptions(preferred_location=device_id))

That explicit managed-pool creation goes through MP_init_create_pool(...) / create_mempool_handle(...). Before PR 2084, an empty handle from that helper could slip through, so these warning tests could pass even though the managed pool was not actually created. With PR 2084, the original CUDA error is now correctly surfaced.

This is not limited to the original Windows MCDM concern. I saw the same failure signature in:

  • Windows WDDM py3.11 jobs
  • Windows MCDM py3.13/py3.14/py3.14t jobs
  • Linux WSL job

So the issue is broader: on devices where concurrent_managed_access is false, the warning tests assume an explicit managed pool can still be created. CI shows that assumption is not valid on several configurations.

Recommended fix: update test_managed_memory_warning.py so the two warning tests skip when explicit managed pool creation returns CUDA_ERROR_NOT_SUPPORTED. That should be a skip, not xfail, because in these CI jobs the operation is reporting unsupported rather than an unexpected OOM/platform flake.

The PR 2084 code change itself is doing what it was supposed to do: preserving the original CUDA error instead of masking it. The tests need to stop relying on the old masked behavior.

Skip managed memory warning tests when explicit managed pool creation reports unsupported, now that cuda.core preserves the underlying CUDA error.

xref: NVIDIA#2084 (comment)
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 14, 2026

Analysis of 2026-05-14 QA Logs

Based on logs provided by the QA team on 2026-05-14 (thanks @rluo8!), there are two additional mempool-related OOM surfaces worth handling with targeted xfails.

1. cuda_bindings: one new xfail point

The cuda-bindings run shows one failure:

tests/test_interoperability.py::test_interop_memPool

The failing call is the driver-side default mempool query:

err_dr, pool = cuda.cuDeviceGetDefaultMemPool(0)
assert err_dr == cuda.CUresult.CUDA_SUCCESS

The observed result is CUDA_ERROR_OUT_OF_MEMORY, with pool returned as None. This matches the same class of mempool OOM behavior already handled elsewhere, but this interoperability test currently asserts success directly before giving the shared xfail helper a chance to classify the failure.

Recommended insertion point:

err_dr, pool = cuda.cuDeviceGetDefaultMemPool(0)
xfail_if_mempool_oom(err_dr, "cuDeviceGetDefaultMemPool", 0)
assert err_dr == cuda.CUresult.CUDA_SUCCESS

It would also be reasonable to apply the same pattern to the runtime-side cudaDeviceGetDefaultMemPool(0) call in the same test, in case that side reports cudaErrorMemoryAllocation under the same platform condition.

2. cuda_core: graph allocation cascade

The cuda-core run shows a large cascade of failures and setup errors rooted in graph allocation nodes. The common failing path is:

GraphDefinition.allocate()
GraphNode.allocate()
GN_alloc()
cuGraphAddMemAllocNode(...)
CUDA_ERROR_OUT_OF_MEMORY

Many later failures are downstream effects of this same root condition. The affected areas include graph definition topology tests, graph definition integration tests, graph memory resource tests, and object protocol tests that construct sample graph allocation nodes.

The xfail should be inserted at shared graph-construction points rather than repeated across each parametrized test case. Good candidates are:

  • cuda_core/tests/graph/test_graph_definition.py: wrap graph_spec, nonempty_graph_spec, and node_spec fixture construction.
  • cuda_core/tests/test_object_protocols.py: wrap graph-node sample fixtures that call GraphDefinition.allocate() directly or indirectly, such as sample_empty_node, sample_alloc_node, sample_free_node, sample_memset_node, and sample_memcpy_node.
  • cuda_core/tests/graph/test_graph_definition_integration.py: wrap the direct end-to-end graph-building bodies that allocate graph memory nodes.
  • cuda_core/tests/graph/test_graph_memory_resource.py: wrap graph-memory allocation paths using GraphMemoryResource.allocate(...).

A small cuda-core test helper or context manager would keep the fix localized and consistent. The helper should catch CUDAError, call the existing shared xfail_if_mempool_oom(...), and re-raise if the error is not the expected Windows MCDM mempool OOM condition.

Handle default mempool query OOMs with the shared mempool xfail helper so the interop test reports the known platform condition instead of a direct assertion failure.

xref: NVIDIA#2084 (comment) section 1
@github-actions github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label May 14, 2026
rwgk added 2 commits May 14, 2026 09:27
Handle graph allocation OOMs with the shared mempool xfail helper so graph tests report the known platform condition instead of cascading failures.

xref: NVIDIA#2084 (comment) section 2
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 14, 2026

/ok to test

@rwgk rwgk changed the title cuda.core: preserve CUDA errors from memory pool handle helpers Preserve memory pool CUDA errors and harden OOM tests May 14, 2026
@rwgk rwgk marked this pull request as ready for review May 14, 2026 16:51
@rwgk rwgk requested a review from Andy-Jost May 14, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant