Make checkpoint tests fail on missing required binding symbols#2150
Make checkpoint tests fail on missing required binding symbols#2150rwgk wants to merge 2 commits into
Conversation
Ensure checkpoint tests distinguish missing required cuda.bindings symbols from genuinely unsupported environments.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
PR 2150 first CI failure analysisWorkflow: https://github.com/NVIDIA/cuda-python/actions/runs/26591678170 Commit: 293258d Workflow result: failed. High-level resultThe build and non-test infrastructure mostly passed:
The failures are concentrated in test matrix jobs. There were 37 failed test jobs plus the final status aggregation job. Failure counts by CUDA version:
Failure counts by platform:
Failure mode 1: CUDA 13.3 missing
|
|
I looked into the CUDA 12.9 failures from the first PR #2150 CI run. The short version: these failures look separate from the CUDA 13.3 In grep -r -i GpuPair /usr/local/cuda-12.9returns no matches. The CUDA 12.9 typedef struct CUcheckpointRestoreArgs_st {
cuuint64_t reserved[8]; /**< Reserved for future use, must be zeroed */
} CUcheckpointRestoreArgs;That matches the CUDA 12.9 CI failure mode from https://github.com/NVIDIA/cuda-python/actions/runs/26591678170: Linux CUDA 12.9 jobs now fail during So my current interpretation is:
Possible follow-up direction: keep missing required symbols as failures for APIs that should exist in the active CUDA version, but treat the CUDA 12.9/no- |
Keep baseline CUDA checkpoint coverage active for CUDA versions whose headers do not expose GPU remapping structs, while still failing when required base checkpoint bindings such as CUcheckpointRestoreArgs are missing. Gate only the GPU migration path on CUcheckpointGpuPair so CUDA 12.9 can exercise state, lock, checkpoint, restore-without-mapping, and unlock.
|
/ok to test |
PR 2150 second CI failure analysisWorkflow: https://github.com/NVIDIA/cuda-python/actions/runs/26596635176 Commit: cd730c1 Current workflow state at inspection time:
High-level resultThe second CI run matches expectations after splitting baseline checkpoint support from GPU-remapping support. All completed failures are CUDA 13.3.0 test jobs. CUDA 12.9.1 and CUDA 13.0.2 jobs that completed are passing. Failure counts by CUDA version:
Failure counts by platform:
Remaining failure mode: CUDA 13.3 missing
|
Closes #2149
Summary
cuda.corecheckpoint test availability guard so it still skips true unsupported environments, but no longer skips missing requiredcuda.bindingssymbols.cuda.bindingscompleteness test for the checkpoint symbols required bycuda.core.checkpoint, includingCUcheckpointRestoreArgs.Context
This is a follow-up to #2144 and fixes the test coverage gap tracked in #2149.
The CUDA 13.3.0
CUcheckpointRestoreArgsgeneration issue fixed by #2144 could pass the existing test flow because thecuda.corecheckpoint tests treated allRuntimeErrors fromcheckpoint._get_driver()as an unsupported environment. That included:This PR keeps the intended skips for genuinely unsupported configurations, but lets missing required binding attributes propagate as test failures.
Validation
On the pre-#2144 base, these focused tests now expose the breakage:
fails during collection with:
and:
fails with:
After PR #2144 lands and this branch is rebased onto it, the focused checkpoint tests should pass and demonstrate that the original generation issue is fixed while the error-masking skip is closed.
Related