Skip to content

cuda.core: audit of C-style error-code return patterns in public API #1951

@mdboom

Description

@mdboom

Motivation

cuda.core is intended to be a high-level Pythonic wrapper around lower-level bindings in cuda.bindings. In idiomatic Python, errors should be communicated via exceptions rather than requiring callers to inspect return values. This audit looked at all public functions and methods in cuda.core for places where the C convention of returning error/status codes leaks through — or more broadly, anywhere the caller must inspect the returned object for correctness rather than relying on exception flow.

Summary

The codebase is largely well-designed. The HANDLE_RETURN() macro and handle_return() function consistently convert CUDA error codes into Python exceptions across the vast majority of the API. However, there are several notable deviations.

Findings

1. Event.is_done — boolean derived from CUDA error code

_event.pyx: Converts CUDA_SUCCESSTrue and CUDA_ERROR_NOT_READYFalse. The caller must inspect the return value rather than relying on exception flow. This is a common idiom in async GPU APIs and is arguably reasonable for polling, but it is worth noting as a deliberate deviation from pure exception-based error handling.

2. Program.pch_status — string status code the caller must interpret

_program.pyx: Returns "created", "not_attempted", "failed", or None. The "failed" case is notable — PCH creation failure is reported as a string value rather than raised as an exception. The caller must know to check for "failed" and handle it. Internally, the helper _read_pch_status() also uses None as a sentinel for "heap exhausted, retry needed" (a classic C-style error pattern, though internal-only).

3. Linker.get_error_log() / get_info_log() — unchecked CUDA calls

_linker.pyx: These return diagnostic strings, but the underlying CUDA calls to nvJitLinkGetErrorLogSize / nvJitLinkGetErrorLog are not checked via HANDLE_RETURN — the results are used directly without error checking. If these calls fail, the failure is silently ignored.

4. _MP_deallocate silently swallows CUDA_ERROR_INVALID_CONTEXT

_memory_pool.pyx: The deallocation path explicitly suppresses CUDA_ERROR_INVALID_CONTEXT. The function is marked noexcept so it cannot raise, but this means a real error (e.g., deallocating after context destruction) is silently ignored. Callers have no way to know deallocation failed.

5. DeviceProperties._get_attribute() returns a default on CUDA_ERROR_INVALID_VALUE

_device.pyx: When querying device attributes, CUDA_ERROR_INVALID_VALUE (which often means "this attribute isn't supported on this GPU") is silently converted to a default value (typically 0) rather than raising. A caller reading device.properties.some_attribute could get 0 and not know whether the attribute is genuinely 0 or unsupported on their hardware.

6. Kernel._get_arguments_info() uses CUDA_ERROR_INVALID_VALUE as end-of-list sentinel

_module.pyx: Loops calling cuKernelGetParamInfo until it gets CUDA_ERROR_INVALID_VALUE, which it interprets as "no more parameters" rather than an error. This mirrors the C API convention. Any genuinely invalid-value error would also be silently consumed.

7. Device_resolve_device_id() returns 0 on CUDA_ERROR_INVALID_CONTEXT

_device.pyx: When no context exists, instead of raising, it defaults to device 0 (mimicking cudart behavior). This is an internal function but affects public API behavior — Device(None) silently falls back to device 0 rather than informing the caller there is no active context.

8. DMR_mempool_get_access() — returns magic strings instead of a typed enum

_device_memory_resource.pyx: Returns "rw", "r", or "". The empty string "" (meaning "no access") is a value the caller must check — attempting to use a buffer without access would only fail later at a less helpful point. A proper enum would make this more self-documenting and less error-prone.

Suggestions

Ranked roughly from most to least impactful:

  1. Program.pch_status returning "failed" — consider raising an exception (or at least a warning) during compile() when PCH creation fails, rather than silently storing a status string the user must remember to check.
  2. Linker.get_error_log() / get_info_log() — check the CUDA return values from the underlying log-retrieval calls via HANDLE_RETURN.
  3. _MP_deallocate suppressing CUDA_ERROR_INVALID_CONTEXT — at minimum log a warning so failures are observable.
  4. DeviceProperties returning 0 for unsupported attributes — consider raising AttributeError or returning a distinct sentinel so callers can distinguish "genuinely 0" from "not supported".
  5. DMR_mempool_get_access — return a proper enum rather than magic strings.
  6. Kernel._get_arguments_info() end-of-list sentinel — document or assert that CUDA_ERROR_INVALID_VALUE is only expected at the boundary, to avoid masking real errors.
  7. Device_resolve_device_id() defaulting to device 0 — consider raising when there is no active context, rather than silently choosing a device.

Not flagged (correct patterns)

For completeness, these were reviewed and found to handle errors properly:

  • Graph.update() — raises CUDAError with diagnostic info on GRAPH_EXEC_UPDATE_FAILURE
  • GraphBuilder.complete() / _instantiate_graph() — raises RuntimeError with error reason
  • Event.__sub__() — handles error codes inline but always raises exceptions with contextual messages
  • All close() methods — delegate to C++ RAII handles; idempotent no-op behavior is standard
  • All memory resource allocate() / deallocate() public methods — consistently use HANDLE_RETURN or raise_if_driver_error()
  • All Stream, Device, Context public methods — consistently raise via HANDLE_RETURN
  • All graph node factory methods — consistently raise via HANDLE_RETURN
  • system subpackage functions — consistently raise ValueError / RuntimeError on failure

Metadata

Metadata

Assignees

No one assigned

    Labels

    cuda.coreEverything related to the cuda.core moduleenhancementAny code-related improvements

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions