Re-wrap most of the enums in cuda.bindings.nvml for cuda.core.system. by mdboom · Pull Request #2014 · NVIDIA/cuda-python

mdboom · 2026-05-04T17:25:20Z

This rewraps most of the enums (except the extremely large and unorganized FieldId) for cuda.core.system rather than passing them directly through. This also creates an optional string interface for all of these enum values.

rwgk · 2026-05-04T20:27:47Z

PR 2014 Agent Review

Size: 16 files, +800 / -191
Reviewer: Claude Opus 4.7 (1M context, Max thinking) — initial review

With small manual edits: I reduced the Compatibility subsection. I deleted the Next steps section entirely. The rest looks useful to me.

The "Sync guard (structural gap)" finding wasn't in the initial automatic review; I asked about it in a follow-on prompt. It seems pretty easy to at least add basic protections.

High-level summary (asked for by reviewer)

This PR is the implementation of issue
#1995 ("Replace string
literals with enums in public API"), which is itself a sub-item of the
#1919 "Audit cuda.core
API for 1.0 release" epic. The motivation, captured in the issue thread
(especially the 2026-05-01 comment), is the cuda.core 1.0 milestone (#43, due
2026-05-07): once 1.0 ships, the team is locked into a multi-year support
window, so any awkward-to-fix public API surface must be cleaned up now. The
decision was to:

For "logically enum-like" things, accept either an enum or a string
everywhere, and return a string-flavored enum (StrEnum).
For NVML-derived enums (currently _FastEnum ints), wrap them in
cuda.core.system with hand-curated StrEnums so users see a uniform
Enum | str interface and don't get C-style names like
EVENT_TYPE_XID_CRITICAL_ERROR or TEMPERATURE_THRESHOLD_SHUTDOWN leaking
into the public API.

The PR rewraps most NVML enums (except FieldId, intentionally) into
StrEnum types, adds _<NAME>_MAPPING / _<NAME>_INV_MAPPING dicts for
round-tripping, accepts Enum | str everywhere, and cleans up a few adjacent
things while the file is open:

Pstates → plain int (0..15) with None for unknown.
PciInfo.get_throughput(counter) → rx_throughput / tx_throughput
properties.
device.brand → free-form str (with "Unknown" fallback).
NvLink version → (major, minor) tuple.

It also depends on backports.strenum for Python 3.10 (cuda_core supports
3.10+).

The direction is right and consistent with the design agreed in #1995.
However, the PR has several concrete bugs, a doc-build risk, a structural gap
(no mechanism to keep the wrappers in sync with the bindings — see "Sync
guard" below), and a few breaking-change footnotes that are not currently
called out in the PR body.

Findings (most severe first)

Bugs

Bug — test_nvlink references a removed symbol.
cuda_core/tests/system/test_system_device.py:759 still does
assert isinstance(version, system.NvlinkVersion). NvlinkVersion is no
longer in the system namespace (the __all__ change in _device.pyx
dropped it; _nvlink.pxi now returns a plain tuple[int, int]). This test
will fail with AttributeError: module 'cuda.core.system' has no attribute 'NvlinkVersion' on any host with NVLink. Replace with
assert isinstance(version, tuple) plus length / element checks.

Bug — supported_pstates can iterate beyond the valid range.
cuda_core/cuda/core/system/_device.pyx:1001 walks
nvml.device_get_supported_performance_states(...) and only filters out
PSTATE_UNKNOWN (= 32). The NVML header says unused trailing slots are
PSTATE_UNKNOWN, but if the driver ever returned a value outside 0..15
(e.g., a future PSTATE_*), _pstate_to_int would silently return
int(x) - 0, producing an out-of-contract integer. Either drop the value or
raise. The same risk applies to device.performance_state
(cuda_core/cuda/core/system/_device.pyx:971).

Bug — _pstate_to_enum name is wrong; takes an int and returns an int.
cuda_core/cuda/core/system/_device.pyx:29 is named _pstate_to_enum but
the body is
return int(pstate) + int(nvml.Pstates.PSTATE_0). It just shifts an int, and
since PSTATE_0 = 0 it's literally the identity for valid input. The
misnomer is confusing; rename to _int_to_pstate (or inline). The cast back
through int(...) is also redundant.

Bug — error messages report bit index, not the failing bit value.
In current_clock_event_reasons / supported_clock_event_reasons
(cuda_core/cuda/core/system/_device.pyx:670 and
cuda_core/cuda/core/system/_device.pyx:691), in CoolerInfo.target
(cuda_core/cuda/core/system/_cooler.pxi:76), and in
get_supported_event_types
(cuda_core/cuda/core/system/_device.pyx:811), the error path is:

for reason in _unpack_bitmask(reasons):       # reason is a bit *index* (0,1,2,...)
    try:
        output_reason = _CLOCKS_EVENT_REASONS_MAPPING[1 << reason]  # bit *value*
    except KeyError:
        raise ValueError(f"Unknown clock event reason bit: {reason}")  # reports index

The lookup uses 1 << reason but the error message reports reason. If a
future driver introduces a new bit, the user will see a confusing message
("bit: 9" when the real unmapped value is 0x200). Use 1 << reason in the
message and also include the bit index for context.

Bug — device.brand silently downgrades several brands to "Unknown".
_BRAND_TYPE_MAPPING (cuda_core/cuda/core/system/_device.pyx:95) is
missing BRAND_QUADRO_RTX, BRAND_NVIDIA_RTX, and BRAND_NVIDIA (all
defined in nvml.BrandType). The old code returned a typed BrandType enum
that could be matched against any of those; the new code uses
.get(..., "Unknown") so a Quadro RTX or "NVIDIA"-branded card will now
report as "Unknown" instead of its real brand. Either add the missing
entries or fall through to nvml.BrandType(brand).name.

Bug — pre-existing Device.__new__() call in EventData.device.
cuda_core/cuda/core/system/_event.pxi:64 reads
device = Device.__new__() (no cls argument). This is a pre-existing bug,
not introduced by this PR, but the PR is the right time to fix it because the
file is being touched. It will raise
TypeError: object.__new__(): not enough arguments the first time
event.device is accessed.

Sync guard (structural gap)

There is no mechanism today, and none added by this PR, to ensure the
cuda_core enum wrappers stay in sync with the underlying cuda_bindings
NVML enums. I checked for tests over __members__, code-gen hooks, CI
diff-checks, and runtime self-checks; none exist. The only thing close is the
runtime fallback pattern this PR introduces — _<NAME>_MAPPING.get(value, default) for the inbound direction and try: ... except KeyError: raise ValueError(...) for the outbound direction. That catches missing mappings
at the point of first encounter (a property access on a real device), but
only when the NVML driver actually returns the new value on the test host —
which CI labs don't control. Several of the bugs above (the missing
BRAND_* entries, the dropped THERMAL_GPU_RELATED) survived precisely
because nothing else flags them.

This is a real risk for cuda.core 1.0:

cuda_bindings is auto-generated from the NVML header. The header at the
top of cuda_bindings/cuda/bindings/_internal/_fast_enum.py confirms it is
generated, and the nvml.pyx comment says
"automatically generated across versions from 12.9.1 to 13.2.0". NVML
enums grow on every CUDA toolkit refresh.
This PR turns the wrappers into hand-curated StrEnums. So every NVML
release will silently widen the binding's _FastEnum, while the
cuda.core wrapper stays stuck at whatever was current when the wrapper
was written. The fallback .get() pattern then either reports None /
"Unknown", or — for outbound (user → driver) — raises ValueError for a
value that NVML actually supports.
The repo already has a closed bug
(#1712) for a similar
problem (the explanation dicts going out of sync between cuda_core and
cuda_bindings), and a known related bug
(#1663) about Cython
type-redefinition. So this category of drift is a known-recurring footgun
in the codebase.

What a guard could look like, in rough order of cost vs. coverage:

Cheapest — a single parametrized test in
cuda_core/tests/system/ that imports both cuda.bindings.nvml and
cuda.core.system, and for each wrapper asserts that every NVML member
has a corresponding entry in _<NAME>_MAPPING (or is on a documented
allow-list). Sketch:

import pytest
from cuda.bindings import nvml
from cuda.core.system import _device

WRAPPER_TO_BINDING = [
    (_device._ADDRESSING_MODE_MAPPING, nvml.DeviceAddressingModeType,
     {"DEVICE_ADDRESSING_MODE_NONE"}),
    (_device._AFFINITY_SCOPE_MAPPING, nvml.AffinityScope, set()),
    (_device._GPU_TOPOLOGY_LEVEL_MAPPING, nvml.GpuTopologyLevel, set()),
    (_device._EVENT_TYPE_MAPPING, nvml.EventType, set()),
    (_device._BRAND_TYPE_MAPPING, nvml.BrandType, {"BRAND_COUNT"}),
    # ... one entry per wrapper
]

@pytest.mark.parametrize("mapping, binding, intentionally_unmapped",
                         WRAPPER_TO_BINDING)
def test_wrapper_covers_all_binding_members(mapping, binding,
                                            intentionally_unmapped):
    binding_keys = set(binding.__members__) - intentionally_unmapped
    mapped_keys = (
        {m.name for m in mapping.keys() if isinstance(m, binding)}
        | {m.name for m in mapping.values() if isinstance(m, binding)}
    )
    missing = binding_keys - mapped_keys
    assert not missing, (
        f"{binding.__name__} is missing wrapper entries for: {missing}"
    )

The intentionally_unmapped set is the explicit allow-list (e.g. *_COUNT
sentinels, deprecated APP_CLOCK_*, the typo'd
P2P_STATUS_CHIPSET_NOT_SUPPORED, etc.). When NVML adds a new member,
this test fails on every CI host (no GPU required), and a maintainer
either adds a wrapper entry or extends the allow-list with a comment
explaining why.

Medium — a small import-time _validate_mappings() behind
if __debug__: or behind a CUDA_PYTHON_VALIDATE_ENUMS=1 env var. Same
idea as Renaming #1, but lives next to the mappings so the failure mode is
ImportError on cuda.core.system. I'd lean against this for a 1.0
library — too noisy at import — but it's an option.
Heavier — code-gen the mappings. Since cuda_bindings is already
generator-driven, the _<NAME>_MAPPING dicts could be too: feed the same
NVML header → emit a _generated_mappings.pxi with the forward dict, plus
a side file listing "human-curated" StrEnum names that humans then
maintain. The generator emits a placeholder # TODO map <new_member>
comment when a new NVML member appears, which fails CI via a regex check.
This is the strongest guarantee but is a much bigger change.
Adjacent — pin a binding floor. cuda_core already has
cuda-bindings[all]==12.* / 13.* in extras; if the wrappers target a
specific NVML enum surface, pin cuda-bindings>=X,<Y so a user who
mixes cuda-core 1.0.0 with a newer cuda-bindings gets a clean failure
rather than silent "Unknown" / ValueError. (Issue
#1715 was exactly the
inverse problem — cuda.core demanding an unreleased bindings version —
so the team already has scar tissue here.)

Given the milestone is May 7 and this review already turned up missing brand
entries and a missing cooler target, I'd push for option #1 as a blocker
for 1.0: it's ~80 lines, table-driven, no GPU required, and the allow-list
doubles as documentation for why certain NVML members are deliberately not
surfaced (deprecated / sentinel / typo / composite bitmask). I'm happy to
draft that test against the current branch so it's ready to drop in on top of
this PR.

Behavior / compatibility

Behavior — get_supported_event_types includes EventType.NONE mapping,
but the bitmask path can never produce it.
EventType.NONE = "none" is added to the wrapper but the bitmask path can
never produce it (no bit is set). _EVENT_TYPE_MAPPING includes
nvml.EventType.NONE → EventType.NONE only because it is needed for the
EventData.event_type property. That's fine, but consider documenting that
EventType.NONE is reserved for "no event" and isn't a registrable type.
Currently device.register_events([EventType.NONE]) is a silent no-op
(bitmask stays 0), which is surprising; consider raising.

Compatibility — Call out that this PR is a breaking change in the PR description /
release notes. The breaking label is missing.

Behavior — device.brand: BrandType → str (also see the bug above).
Switching to a free-form str (with "Unknown" fallback) means callers
can't reliably enumerate brands or do == against a name they didn't see in
CI. Was a Brand StrEnum considered and rejected? If yes, mention it in
the PR body so reviewers don't ask again.

Behavior — NvLink.version: NvlinkVersion → tuple[int, int]. This is a
clean improvement, but it's not in the issue #1995 scope. It would be good to
call out in the PR description that it's an additional change so it doesn't
surprise users.

Behavior — PciInfo.get_throughput(counter) →
rx_throughput / tx_throughput properties. Same — this is good
cleanup, but it changes the public API shape and isn't in the PR title or
description. Worth noting.

Docs

Docs — three private-doc enums still listed as cyclasses.
cuda_core/docs/source/api_private.rst:74 keeps
system._device.GpuP2PCapsIndex, system._device.GpuP2PStatus, and
system._device.GpuTopologyLevel under
:template: autosummary/cyclass.rst. After this PR they are pure-Python
StrEnums, not Cython classes. They should be moved to the lower section
(cuda_core/docs/source/api_private.rst:92+) alongside AddressingMode,
AffinityScope, etc., or rendered with the default template; otherwise the
docs build is likely to warn or render incorrectly.

Docs — Device.performance_state documentation feels split.
cuda_core/cuda/core/system/_device.pyx:957 returns int | None. The
current doc and runtime contract say "0 is highest, 15 is lowest, None if
unknown". That's reasonable, but dynamic_pstates_info and
register_events([EventType.PSTATE]) still use Pstate concepts; users who
read those docstrings have to mentally context-switch. Consider documenting
once on system.Device and cross-referencing.

Docs — stale references to old enum names.

cuda_core/cuda/core/system/_event.pxi:79,
cuda_core/cuda/core/system/_event.pxi:92, and
cuda_core/cuda/core/system/_event.pxi:105 still reference
EventType.EVENT_TYPE_XID_CRITICAL_ERROR in docstrings. The new value is
EventType.XID_CRITICAL_ERROR. The Sphinx links will fail (or worse,
render as broken refs).
cuda_core/cuda/core/system/_system_events.pyx:168 example uses
SystemEventType.SYSTEM_EVENT_TYPE_GPU_DRIVER_UNBIND. The new name is
SystemEventType.UNBIND.
cuda_core/cuda/core/system/_event.pxi:95 (gpu_instance_id) and
cuda_core/cuda/core/system/_event.pxi:108 (compute_instance_id)
docstrings still use the old EVENT_TYPE_XID_CRITICAL_ERROR name.

Tests

Test — temperature-thresholds slice loses one threshold.
cuda_core/tests/system/test_system_device.py:680 is
for threshold in list(system.TemperatureThresholds)[:-1]:. Pre-PR, [:-1]
stripped TEMPERATURE_THRESHOLD_COUNT. Post-PR, the new StrEnum has 8 real
values and no sentinel; [:-1] now strips GPS_CURR instead. Drop the
[:-1]. Same audit for any other [:-1] over an enum in tests.

Test — register_events test misses positive case for str input.
cuda_core/tests/system/test_system_device.py:265-286 was simplified, but it
only checks register_events([]), register_events(0) (now correctly
rejected), and a pre-existing typed-list path on the systems test. It would
be good to also assert that device.register_events("xid_critical_error")
and device.register_events([EventType.XID_CRITICAL_ERROR]) both work, since
the whole point of this PR is dual support.

Style / nits

Style — module-level docstring assignment is repeated and noisy.
The pattern

class ClockId(StrEnum):
    """..."""
    CURRENT = "current"
ClockId.CURRENT.__doc__ = "Current actual clock value."

is duplicated across 8+ enums. Per #1995 mdboom called this "ugly but
works." Consider a small helper, e.g.
_set_member_docs(ClockId, {"CURRENT": "...", "CUSTOMER_BOOST_MAX": "..."}),
both for readability and to make it easy to lift these into autodoc later.

Style — import sys + version branch duplicated in two .pyx files.
cuda_core/cuda/core/system/_device.pyx:8 and
cuda_core/cuda/core/system/_system_events.pyx:8 have identical
if sys.version_info >= (3, 11): guards. With
cuda_core/pyproject.toml:22 requiring >=3.10, this is necessary. Consider
centralizing in a shared _compat.pxi (or in _device_utils.pxi) and
include-ing it from both, so future bumps to 3.11 are a single edit.

Style — inverse mappings constructed inconsistently.
_GPU_TOPOLOGY_LEVEL_INV_MAPPING, _EVENT_TYPE_INV_MAPPING,
_THERMAL_TARGET_INV_MAPPING, _SYSTEM_EVENT_TYPE_INV_MAPPING are all built
as {v: k for k, v in <forward>.items()}. Several other dicts (e.g.,
_CLOCK_ID_MAPPING, _GPU_P2P_CAPS_INDEX_MAPPING) are forward-only and used
only one way. Be explicit and consistent: either always have both, or comment
why the asymmetry exists.

Style — get_thermal_settings has a stranded TODO comment.
cuda_core/cuda/core/system/_temperature.pxi:257 has
# TODO: The above docstring is from the NVML header, but it doesn't seem to make sense. after the docstring. It would be clearer in the docstring
itself, or fixed properly while you're here.

Misc — nvml.GpuP2PStatus typo handled but with redundant key.
_GPU_P2P_STATUS_MAPPING (cuda_core/cuda/core/system/_device.pyx:164) maps
both P2P_STATUS_CHIPSET_NOT_SUPPORED (typo) and
P2P_STATUS_CHIPSET_NOT_SUPPORTED. Per the binding, these are aliases (same
int value), so one of these dict entries silently overwrites the other at
construction time. That's harmless, but worth a comment so a future cleanup
doesn't accidentally drop the typo'd alias and break callers using older
NVML.

Misc — CoolerTarget.GPU_RELATED not exposed.
nvml.CoolerTarget.THERMAL_GPU_RELATED is a composite bitmask
(GPU | MEMORY | POWER_SUPPLY); the new CoolerTarget StrEnum drops it
entirely. If the underlying NVML field is set to THERMAL_GPU_RELATED (and
the device reports it as a single composite rather than three individual
bits), the new code may iterate three bits and produce a longer list than
before. That's probably what's wanted, but worth a sentence in the property
docstring — and is exactly the sort of thing the sync-guard test would have
flagged.

Open questions

Was a Brand StrEnum considered for device.brand and rejected? The
current free-form str makes typed comparisons in user code impossible.
Were the NvlinkVersion → tuple[int, int] and
PciInfo.get_throughput → rx_throughput / tx_throughput changes intended
to land here, or as separate PRs? They expand the scope beyond cuda.core: Replace string literals with enums in public API #1995.
Is the team OK with the silent breaking change for callers who were
passing nvml.<EnumType>.X directly? If yes, please call it out in the PR
body and add the breaking label; if no, consider a one-release deprecation
path that accepts the old _FastEnum value with a DeprecationWarning.

mdboom · 2026-05-04T21:31:22Z

<rant>I think these epic auto-reviews are really hard to address, because they include a lot of reasonable comments with a lot of noise. Can whatever robo-tool being used actually comment on lines in the PR? I already pre-reviewed this with the same model, so a lot of these ideas I already rejected, but then of course there are mistakes that this found that my run didn't so it's not entirely without value. Just seems like a lot of extra time vs. how GitHub was designed to be used 🤷</rant>

I am going to respond to everything I think is wrong. If not mentioned, assume I have addressed and fixed it.

Bug — test_nvlink references a removed symbol. cuda_core/tests/system/test_system_device.py:759 still does
assert isinstance(version, system.NvlinkVersion)

Nope.

 rg system\.NvLinkVersion

Bug — supported_pstates can iterate beyond the valid range.
cuda_core/cuda/core/system/device.pyx:1001 walks
nvml.device_get_supported_performance_states(...) and only filters out
PSTATE_UNKNOWN (= 32). The NVML header says unused trailing slots are
PSTATE_UNKNOWN, but if the driver ever returned a value outside 0..15
(e.g., a future PSTATE*), _pstate_to_int would silently return
int(x) - 0, producing an out-of-contract integer.

Disagree. If NVML doesn't follow the enum contract, all bets are off. I don't think we do this kind of defensive programming elsewhere. And it's Python -- it will raise, not segfault.

Bug — _pstate_to_enum name is wrong; takes an int and returns an int.
cuda_core/cuda/core/system/_device.pyx:29 is named _pstate_to_enum but
the body is
return int(pstate) + int(nvml.Pstates.PSTATE_0)

Since the enums used are Python enums, not Cython enums, we need to type it this way. Confusing, sure, but it's an internal convenience function. The fact that it's a no-op is fine -- it's to be resilient to changes in the underlying enum.

Bug — device.brand silently downgrades several brands to "Unknown".

Yep. Looks like your Opus 4.7 caught the error from my Opus 4.7 🙃

Behavior — get_supported_event_types includes EventType.NONE mapping,
but the bitmask path can never produce it. ... Currently device.register_events([EventType.NONE]) is a silent no-op
(bitmask stays 0), which is surprising; consider raising.

I think this is fine as-is.

Behavior — device.brand: BrandType → str

Yes, brands aren't really designed to be acted on programmatically -- they are primarily just a name to display to the user.

Docs — Device.performance_state documentation feels split.

Yes, this is the downside of moving away from numerical enums to an int. I'm not sure I agree with the solution.

Style — module-level docstring assignment is repeated and noisy.

Yes, but it's very explicit. I think the model's suggestion here is too magical.

Style — import sys + version branch duplicated in two .pyx files.

Again, this is fine, and pretty standard practice. If we do want to unify Python compat code, we should do it as a separate sweep.

Style — inverse mappings constructed inconsistently.

Python doesn't have dead code elimination, so we shouldn't create private objects that we would never use.

Style — get_thermal_settings has a stranded TODO comment.

No. This is where it should go. It can't go above the docstring.

Misc — nvml.GpuP2PStatus typo handled but with redundant key.
_GPU_P2P_STATUS_MAPPING (cuda_core/cuda/core/system/_device.pyx:164) maps
both P2P_STATUS_CHIPSET_NOT_SUPPORED (typo) and
P2P_STATUS_CHIPSET_NOT_SUPPORTED. Per the binding, these are aliases (same
int value), so one of these dict entries silently overwrites the other at
construction time.

I don't think we can guarantee that will always be the case, so it's safer as-is.

Was a Brand StrEnum considered for device.brand and rejected? The
current free-form str makes typed comparisons in user code impossible.

Yes, I think that's the right choice.

Were the NvlinkVersion → tuple[int, int] and
PciInfo.get_throughput → rx_throughput / tx_throughput changes intended
to land here, or as separate PRs? They expand the scope beyond #1995.

Yes. The scope was really to "be Pythonic", not to strictly adhere to wrapping enums as-is. If that were the goal, why bother with this at all?

Is the team OK with the silent breaking change for callers who were
passing nvml..X directly? If yes, please call it out in the PR
body and add the breaking label; if no, consider a one-release deprecation
path that accepts the old _FastEnum value with a DeprecationWarning.

Yes, prior to 1.0 we are fine with any breaking change without notice.

Sync guards

The agent's analysis of this problem really gets to the heart of why I wasn't sure any of this was a good idea. Though a lot of its analysis is based on an incomplete understanding about version compatibility and guarantees between cuda_core and cuda_bindings. However, I plan to experiment with its first suggestion. If that doesn't bear immediate fruit, we can separate it out into a separate issue, and consider sync guards across all of cuda_core. That is not blocking for 1.0

rwgk · 2026-05-04T22:03:00Z

Can whatever robo-tool being used actually comment on lines in the PR?

It can, but I intentionally didn't make use of that feature. I believe it'll make it even harder to know what came from where and when. — I figure an agent can easily go from the details in the comments to code edits, so figured it's better to have one review in a one-piece comment.

I already pre-reviewed this with the same model, so a lot of these ideas I already rejected,

We could post such things as comments on the PRs before sending them out for review. — But again, I wouldn't want to have the tool auto-post comments. I agree there is a lot of noise, I want to weed out what I can, and actually read at least once what it wrote. So I always post my comments manually. (Maybe one day that'll be futile, as the tools get better, but I don't think we're there yet.)

rwgk · 2026-05-04T22:21:43Z

+cdef object _pstate_to_int(object pstate):
+    if pstate == nvml.Pstates.PSTATE_UNKNOWN:
+        return None
+    return int(pstate) - int(nvml.Pstates.PSTATE_0)


I wouldn't let this go unchecked (bugs happen), but keep it simple:

assert int(pstate) >= int(nvml.Pstates.PSTATE_0)

This is the only item my agent still pulled out as noteworthy.

leofang

Thanks, Mike, no major issues found.

leofang · 2026-05-04T22:46:50Z

+import warnings

 from cuda.bindings import nvml
+from cuda.bindings._internal._fast_enum import FastEnum


Q: Do we need a try-except to check if FastEnum exists (and fall back to IntEnum if not)?

Oh, yeah, I suppose this limits us to a recent-ish cuda_bindings. But all of cuda.core.system is already limited in the same way... Anyway, it can't hurt to be careful.

That is actually a good point -- IIRC FastEnum was introduced at the same time when the nvml bindings were released?

It was a little bit after. So I think doing a try/except ImportError thing is not a bad idea, at least for a little while.

rparolin · 2026-05-05T01:01:16Z

        return nvml.device_get_max_customer_boost_clock(self._handle, self._clock_type)

-    def get_min_max_clock_of_pstate_mhz(self, pstate: Pstates) -> tuple[int, int]:
+    def get_min_max_clock_of_pstate_mhz(self, pstate: int) -> tuple[int, int]:


Are we intending to move to a integral type here: Pstates -> int?

rparolin · 2026-05-05T01:03:56Z


+import sys
+if sys.version_info >= (3, 11):
+    from enum import StrEnum


Nit: Would it better to have a common .py file that holds this implementation and all other cuda-python code can just import it? Rather than copy this version check to all the sites that need it?

rparolin · 2026-05-05T01:04:23Z


 from cuda.bindings import nvml
+try:
+    from cuda.bindings._internal._fast_enum import FastEnum


Nit: Ditto above about hoisting this backwards compat code into a common location that can be referenced.

rparolin · 2026-05-05T01:18:36Z

        The type of event that was triggered.
        """
-        return EventType(self._event_data.event_type)
+        return _EVENT_TYPE_MAPPING[self._event_data.event_type]


Nit: Bare dict lookup will raise KeyError if the driver returns an event type not in _EVENT_TYPE_MAPPING. Consider .get(...) with a sentinel/None (or wrap with a clearer error) so a property accessor doesn't blow up on values introduced by newer drivers.

rparolin · 2026-05-05T01:18:38Z

        The :obj:`~SystemEventType` that was triggered.
        """
-        return SystemEventType(self._event_data.event_type)
+        return _SYSTEM_EVENT_TYPE_MAPPING[self._event_data.event_type]


Nit: Same as in _event.pxi — bare lookup raises KeyError on unmapped values. Consider .get(...) with a fallback for forward-compat with newer drivers.

This is fine and not worth the performance impact. The value comes from C++ code. It would be a runtime, not a user error, if that were ever to happen.

rparolin · 2026-05-05T01:18:41Z

        For all CUDA-capable discrete products with fans.
        """
-        return FanControlPolicy(nvml.device_get_fan_control_policy_v2(self._handle, self._fan))
+        return _FAN_CONTROL_POLICY_MAPPING[nvml.device_get_fan_control_policy_v2(self._handle, self._fan)]


Nit: Bare _FAN_CONTROL_POLICY_MAPPING[...] lookup will raise KeyError for any policy value the wrapper doesn't know. Other wrappers in this PR use .get(..., fallback) — worth being consistent.

The other wrappers use .get because the value is either explicitly unbounded in the NVML docs or comes from the user. That is not the case here.

rparolin · 2026-05-05T01:18:43Z

+    assert (
+        int(pstate) >= 0 and int(pstate) <= 15
+    ), f"Invalid P-state: {pstate}. Must be between 0 and 15 inclusive, or PSTATE_UNKNOWN."


assert is stripped under python -O, so this bounds check silently disappears in optimized runs. Prefer raising ValueError for runtime input validation.

Again, we don't need to validate values coming from NVML, IMHO.

rparolin · 2026-05-05T01:18:44Z

-        return NvlinkVersion(nvml.device_get_nvlink_version(self._device._handle, self._link))
+        version = nvml.device_get_nvlink_version(self._device._handle, self._link)
+        if version == nvml.NvlinkVersion.VERSION_INVALID:
+            raise RuntimeError(f"Invalid NvLink version returned for device")


f-string with no {} interpolation — either drop the f prefix or include the offending version value in the message (more useful for debugging). Ruff would flag this as F541; cython-lint won't.

rparolin · 2026-05-05T01:18:47Z

+    # Typo in upstream library
+    nvml.GpuP2PStatus.P2P_STATUS_CHIPSET_NOT_SUPPORED: GpuP2PStatus.CHIPSET_NOT_SUPPORTED,
+    nvml.GpuP2PStatus.P2P_STATUS_CHIPSET_NOT_SUPPORTED: GpuP2PStatus.CHIPSET_NOT_SUPPORTED,


The typo'd P2P_STATUS_CHIPSET_NOT_SUPPORED and the corrected P2P_STATUS_CHIPSET_NOT_SUPPORTED alias to the same integer value, so the second entry overwrites the first in the dict — the first line is dead. Either drop one or add a comment noting it's intentional dedup coverage.

rparolin · 2026-05-05T01:18:49Z

+if sys.version_info >= (3, 11):
+    from enum import StrEnum
+else:
+    from backports.strenum import StrEnum


Nit: This StrEnum / backports.strenum compat block is duplicated yet again here. Folds into the hoist suggested in the existing threads on _device.pyx (r3185426568, r3185428204).

github-actions · 2026-05-05T01:38:56Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

Re-wrap most of the enums in cuda.bindings.nvml for cuda.core.system.

e4baa64

mdboom requested review from leofang and rwgk May 4, 2026 17:25

github-actions Bot added the cuda.core Everything related to the cuda.core module label May 4, 2026

mdboom self-assigned this May 4, 2026

mdboom added this to the cuda.core v1.0.0 milestone May 4, 2026

mdboom added the P0 High priority - Must do! label May 4, 2026

This comment has been minimized.

Sign in to view

mdboom added 4 commits May 4, 2026 14:11

Fixes for Python 3.10

65d2d11

Fix Windows test

f97eca7

Line wrapping

b178f4a

Fix typo

12e9607

leofang added breaking Breaking changes are introduced labels May 4, 2026

rwgk mentioned this pull request May 4, 2026

Fix #1995: Use StrEnum for enum-like strings #2016

Merged

Address comments in robo-review

b323587

mdboom mentioned this pull request May 4, 2026

Handle Python backward compatibility shims in a single file #2019

Open

rwgk approved these changes May 4, 2026

View reviewed changes

Add assert

15352f3

leofang reviewed May 4, 2026

View reviewed changes

mdboom added 4 commits May 4, 2026 18:54

Add sync checks

8e0756e

Fix spacing

0067c11

Merge remote-tracking branch 'upstream/main' into re-expose-enums

7a7d74e

Clean up test

618e8a8

mdboom enabled auto-merge (squash) May 5, 2026 00:38

This isn't NVML-specific

fecbf4e

rparolin reviewed May 5, 2026

View reviewed changes

mdboom merged commit ecd558a into NVIDIA:main May 5, 2026
94 checks passed

mdboom added a commit that referenced this pull request May 5, 2026

Address a couple of comments in #2014 (#2021)

7633824

This was referenced May 5, 2026

Address a couple of comments in #2014 #2021

Merged

Move all cuda.core.system enums into cuda.core.system.typing #2022

Merged

Conversation

mdboom commented May 4, 2026

Uh oh!

This comment has been minimized.

rwgk commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR 2014 Agent Review

High-level summary (asked for by reviewer)

Findings (most severe first)

Bugs

Sync guard (structural gap)

Behavior / compatibility

Docs

Tests

Style / nits

Open questions

Uh oh!

mdboom commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sync guards

Uh oh!

rwgk commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rwgk commented May 4, 2026 •

edited

Loading

mdboom commented May 4, 2026 •

edited

Loading