Preserve memory pool CUDA errors and harden OOM tests#2084
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
|
Cursor GPT-5.5 1M High Analysis of CI failures: All checked failed jobs have the same two failing tests and the same exception: the explicit The failures are consistent and look like expected fallout from PR 2084 exposing a previously masked CUDA error. Failed test jobs inspected all fail the same two tests:
The exception is the same everywhere:
It happens when the test calls:
That explicit managed-pool creation goes through This is not limited to the original Windows MCDM concern. I saw the same failure signature in:
So the issue is broader: on devices where Recommended fix: update The PR 2084 code change itself is doing what it was supposed to do: preserving the original CUDA error instead of masking it. The tests need to stop relying on the old masked behavior. |
Skip managed memory warning tests when explicit managed pool creation reports unsupported, now that cuda.core preserves the underlying CUDA error. xref: NVIDIA#2084 (comment)
Analysis of 2026-05-14 QA LogsBased on logs provided by the QA team on 2026-05-14 (thanks @rluo8!), there are two additional mempool-related OOM surfaces worth handling with targeted xfails. 1. cuda_bindings: one new xfail pointThe cuda-bindings run shows one failure: The failing call is the driver-side default mempool query: err_dr, pool = cuda.cuDeviceGetDefaultMemPool(0)
assert err_dr == cuda.CUresult.CUDA_SUCCESSThe observed result is Recommended insertion point: err_dr, pool = cuda.cuDeviceGetDefaultMemPool(0)
xfail_if_mempool_oom(err_dr, "cuDeviceGetDefaultMemPool", 0)
assert err_dr == cuda.CUresult.CUDA_SUCCESSIt would also be reasonable to apply the same pattern to the runtime-side 2. cuda_core: graph allocation cascadeThe cuda-core run shows a large cascade of failures and setup errors rooted in graph allocation nodes. The common failing path is: Many later failures are downstream effects of this same root condition. The affected areas include graph definition topology tests, graph definition integration tests, graph memory resource tests, and object protocol tests that construct sample graph allocation nodes. The xfail should be inserted at shared graph-construction points rather than repeated across each parametrized test case. Good candidates are:
A small cuda-core test helper or context manager would keep the fix localized and consistent. The helper should catch |
Handle default mempool query OOMs with the shared mempool xfail helper so the interop test reports the known platform condition instead of a direct assertion failure. xref: NVIDIA#2084 (comment) section 1
Handle graph allocation OOMs with the shared mempool xfail helper so graph tests report the known platform condition instead of cascading failures. xref: NVIDIA#2084 (comment) section 2
|
/ok to test |
This PR has two high-level aspects:
1. Preserve Memory Pool CUDA Errors
This PR fixes several
cuda.corememory resource paths so they preserve the original CUDA error returned by the underlying handle helper.The C++ handle helpers return an empty handle on failure and store the CUDA result in thread-local error state. Some memory pool call sites were not immediately consuming that stored error before continuing, which could let a later operation on an empty handle report a different error. Other call sites detected the empty handle, but replaced the original CUDA result with a generic
RuntimeError.This change updates the memory pool call sites to immediately call
get_last_error()andHANDLE_RETURN(...)when an empty handle is returned.When a CUDA memory pool operation fails, users and tests should see the CUDA error that caused the failure. Reporting a later follow-up error can be misleading because it describes the consequence of an empty handle, not the operation that originally failed.
Preserving the original CUDA result makes failures easier to diagnose and keeps
cuda.corebehavior consistent with the handle-layer contract used elsewhere in the package.The updated paths cover:
The fallback
RuntimeErrormessages now only describe the unexpected case where an empty handle is returned but no CUDA error was recorded.Relevant commits:
2. Add Targeted Mempool OOM Xfails
The error-preservation change makes additional existing mempool OOM surfaces visible as their original CUDA errors. Based on #2084 (comment), this PR adds targeted xfails so those known platform-specific failures report as expected failures instead of direct assertion failures or large cascades.
The added xfails cover:
cuda_bindingsdefault mempool interoperability queries that can reportCUDA_ERROR_OUT_OF_MEMORY, andcuda_coregraph allocation paths rooted in graph memory allocation nodes.The
cuda_corexfails are applied at shared graph-construction points where possible, rather than repeating markers across each parametrized test case.Relevant commits:
xrefs: