Move enum explanations and health checks from cuda_core to cuda_bindings#1805
Move enum explanations and health checks from cuda_core to cuda_bindings#1805rwgk wants to merge 12 commits intoNVIDIA:mainfrom
Conversation
…VIDIA#1712) The explanation dicts are fundamentally tied to the bindings version, so they belong in cuda_bindings. This copies them (keeping the cuda_core originals for backward compatibility) and adds the corresponding health tests under cuda_bindings/tests/. Made-with: Cursor
These tests now live in cuda_bindings/tests/test_enum_explanations.py, where they belong alongside the explanation dicts they verify. Made-with: Cursor
…llback (NVIDIA#1712) Each explanation module now tries to import the authoritative dict from cuda.bindings._utils (ModuleNotFoundError-guarded) and falls back to its own copy for older cuda-bindings that don't ship it yet. Smoke tests added for both dicts. Made-with: Cursor
NVIDIA#1712) Rename explanation dicts to _EXPLANATIONS / _FALLBACK_EXPLANATIONS, add _CTK_MAJOR_MINOR_PATCH to each module, and enforce that the cuda_core fallback copy is as new as (and in-sync with) cuda_bindings. Parametrize the smoke and version-check tests to cover both driver and runtime without duplication. Made-with: Cursor
…tring (NVIDIA#1712) Made-with: Cursor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
|
/ok to test |
|
For easy reference, the CI at commit fb12195 was successful: (I'm about to push git merge master, which will hide it. Not rerunning the CI for now, waiting for a review.) |
|
What's stopping us from moving this codegen into the code generator and re-exporting it here to avoid breaking stuff? We can't continue to live with steps like "copy x manually". Let's just do the work to move it to the generator. It doesn't really make sense that we've got tools parsing C headers in Python and producing code from that, and yet we're still copying dictionaries by hand. |
| RUNTIME_CUDA_ERROR_EXPLANATIONS = { | ||
| _FALLBACK_EXPLANATIONS = { | ||
| 0: ( | ||
| "The API call returned with no errors. In the case of query calls, this" |
There was a problem hiding this comment.
Do we have to duplicate this error text list? Can we hoist it into a central location?
There was a problem hiding this comment.
Do we have to duplicate this error text list?
Originally my proposal was to avoid this copy (see the #1712 issue description, Backward compatibility section), but @leofang argued for vendoring (see issue comments).
Can we hoist it into a central location?
Not if we want future cuda-core releases to produce the enhanced error messages even if used in combination with cuda-binding releases made before this PR was merged.
On balance, I still feel the better compromise is to delete this copy, and to change cuda_core/cuda/core/_utils/cuda_utils.pyx to skip enhancing the error messages if the dict is not in cuda-bindings. It's really only a nice-to-have that will be easy to get back by using the latest cuda-bindings.
rparolin
left a comment
There was a problem hiding this comment.
Please remove the duplicated error text array.
I totally agree, but this PR is about solving nvbug 5932944, which is related to but different from the code-gen question. I opened cuda-python-private issue 289 to track your suggestion. |
…gs (NVIDIA#1712) Remove the vendored explanation dicts from cuda_core. cuda_utils.pyx now imports directly from cuda.bindings._utils with a ModuleNotFoundError fallback to an empty dict, so error messages gracefully degrade when paired with older cuda-bindings that don't ship the dicts. Made-with: Cursor
Done, Cursor said this:
I converted this PR back to Draft mode while retesting. |
|
/ok to test |
…#1712) Restore DRIVER_CU_RESULT_EXPLANATIONS / RUNTIME_CUDA_ERROR_EXPLANATIONS as the dict names in cuda_bindings and remove the _CTK_MAJOR_MINOR_PATCH / _EXPLANATIONS indirection that is no longer needed without the cuda_core fallback copies. Made-with: Cursor
|
/ok to test |
Closes #1712
The
DRIVER_CU_RESULT_EXPLANATIONSandRUNTIME_CUDA_ERROR_EXPLANATIONSdicts are fundamentally tied to the cuda-bindings release (they must match the enums shipped in that release). Having them live exclusively in cuda_core meant the health-check tests failed whenever cuda_core was tested against a different version of cuda-bindings (nvbug 5932944).Changes
cuda_bindings/cuda/bindings/_utils/as the single authoritative source (renamed to_EXPLANATIONSwith a_CTK_MAJOR_MINOR_PATCHversion tag).cuda_utils.pyxnow imports directly fromcuda.bindings._utils, with aModuleNotFoundErrorfallback to an empty dict.cuda_bindings/tests/test_enum_explanations.py, where they belong alongside the dicts they verify.Impact on error messages for cuda-core users
When cuda-core raises a
CUDAError, it tries to include a human-readable explanation of the error code (e.g. "This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values").With this change:
_utils): Error messages fall back to the driver/runtime error name and description string obtained fromcuGetErrorString/cudaGetErrorString. The explanations are a nice-to-have supplement, and the error name + description are still informative. Upgrading to a current cuda-bindings release restores the full explanations.