Re-wrap most of the enums in cuda.bindings.nvml for cuda.core.system.#2014
Re-wrap most of the enums in cuda.bindings.nvml for cuda.core.system.#2014mdboom merged 12 commits intoNVIDIA:mainfrom
Conversation
This comment has been minimized.
This comment has been minimized.
PR 2014 Agent Review
With small manual edits: I reduced the Compatibility subsection. I deleted the Next steps section entirely. The rest looks useful to me. The "Sync guard (structural gap)" finding wasn't in the initial automatic review; I asked about it in a follow-on prompt. It seems pretty easy to at least add basic protections. High-level summary (asked for by reviewer)This PR is the implementation of issue
The PR rewraps most NVML enums (except
It also depends on The direction is right and consistent with the design agreed in #1995. Findings (most severe first)BugsBug — Bug — Bug — Bug — error messages report bit index, not the failing bit value. for reason in _unpack_bitmask(reasons): # reason is a bit *index* (0,1,2,...)
try:
output_reason = _CLOCKS_EVENT_REASONS_MAPPING[1 << reason] # bit *value*
except KeyError:
raise ValueError(f"Unknown clock event reason bit: {reason}") # reports indexThe lookup uses Bug — Bug — pre-existing Sync guard (structural gap)There is no mechanism today, and none added by this PR, to ensure the This is a real risk for
What a guard could look like, in rough order of cost vs. coverage:
Given the milestone is May 7 and this review already turned up missing brand Behavior / compatibilityBehavior — Compatibility — Call out that this PR is a breaking change in the PR description / Behavior — Behavior — Behavior — DocsDocs — three private-doc enums still listed as cyclasses. Docs — Docs — stale references to old enum names.
TestsTest — temperature-thresholds slice loses one threshold. Test — Style / nitsStyle — module-level docstring assignment is repeated and noisy. class ClockId(StrEnum):
"""..."""
CURRENT = "current"
ClockId.CURRENT.__doc__ = "Current actual clock value."is duplicated across 8+ enums. Per #1995 mdboom called this "ugly but Style — Style — inverse mappings constructed inconsistently. Style — Misc — Misc — Open questions
|
|
I am going to respond to everything I think is wrong. If not mentioned, assume I have addressed and fixed it.
Nope.
Disagree. If NVML doesn't follow the enum contract, all bets are off. I don't think we do this kind of defensive programming elsewhere. And it's Python -- it will raise, not segfault.
Since the enums used are Python enums, not Cython enums, we need to type it this way. Confusing, sure, but it's an internal convenience function. The fact that it's a no-op is fine -- it's to be resilient to changes in the underlying enum.
Yep. Looks like your Opus 4.7 caught the error from my Opus 4.7 🙃
I think this is fine as-is.
Yes, brands aren't really designed to be acted on programmatically -- they are primarily just a name to display to the user.
Yes, this is the downside of moving away from numerical enums to an int. I'm not sure I agree with the solution.
Yes, but it's very explicit. I think the model's suggestion here is too magical.
Again, this is fine, and pretty standard practice. If we do want to unify Python compat code, we should do it as a separate sweep.
Python doesn't have dead code elimination, so we shouldn't create private objects that we would never use.
No. This is where it should go. It can't go above the docstring.
I don't think we can guarantee that will always be the case, so it's safer as-is.
Yes, I think that's the right choice.
Yes. The scope was really to "be Pythonic", not to strictly adhere to wrapping enums as-is. If that were the goal, why bother with this at all?
Yes, prior to 1.0 we are fine with any breaking change without notice. Sync guardsThe agent's analysis of this problem really gets to the heart of why I wasn't sure any of this was a good idea. Though a lot of its analysis is based on an incomplete understanding about version compatibility and guarantees between |
It can, but I intentionally didn't make use of that feature. I believe it'll make it even harder to know what came from where and when. — I figure an agent can easily go from the details in the comments to code edits, so figured it's better to have one review in a one-piece comment.
We could post such things as comments on the PRs before sending them out for review. — But again, I wouldn't want to have the tool auto-post comments. I agree there is a lot of noise, I want to weed out what I can, and actually read at least once what it wrote. So I always post my comments manually. (Maybe one day that'll be futile, as the tools get better, but I don't think we're there yet.) |
| cdef object _pstate_to_int(object pstate): | ||
| if pstate == nvml.Pstates.PSTATE_UNKNOWN: | ||
| return None | ||
| return int(pstate) - int(nvml.Pstates.PSTATE_0) |
There was a problem hiding this comment.
I wouldn't let this go unchecked (bugs happen), but keep it simple:
assert int(pstate) >= int(nvml.Pstates.PSTATE_0)
This is the only item my agent still pulled out as noteworthy.
leofang
left a comment
There was a problem hiding this comment.
Thanks, Mike, no major issues found.
| import warnings | ||
|
|
||
| from cuda.bindings import nvml | ||
| from cuda.bindings._internal._fast_enum import FastEnum |
There was a problem hiding this comment.
Q: Do we need a try-except to check if FastEnum exists (and fall back to IntEnum if not)?
There was a problem hiding this comment.
Oh, yeah, I suppose this limits us to a recent-ish cuda_bindings. But all of cuda.core.system is already limited in the same way... Anyway, it can't hurt to be careful.
There was a problem hiding this comment.
That is actually a good point -- IIRC FastEnum was introduced at the same time when the nvml bindings were released?
There was a problem hiding this comment.
It was a little bit after. So I think doing a try/except ImportError thing is not a bad idea, at least for a little while.
| return nvml.device_get_max_customer_boost_clock(self._handle, self._clock_type) | ||
|
|
||
| def get_min_max_clock_of_pstate_mhz(self, pstate: Pstates) -> tuple[int, int]: | ||
| def get_min_max_clock_of_pstate_mhz(self, pstate: int) -> tuple[int, int]: |
There was a problem hiding this comment.
Are we intending to move to a integral type here: Pstates -> int?
|
|
||
| import sys | ||
| if sys.version_info >= (3, 11): | ||
| from enum import StrEnum |
There was a problem hiding this comment.
Nit: Would it better to have a common .py file that holds this implementation and all other cuda-python code can just import it? Rather than copy this version check to all the sites that need it?
|
|
||
| from cuda.bindings import nvml | ||
| try: | ||
| from cuda.bindings._internal._fast_enum import FastEnum |
There was a problem hiding this comment.
Nit: Ditto above about hoisting this backwards compat code into a common location that can be referenced.
| The type of event that was triggered. | ||
| """ | ||
| return EventType(self._event_data.event_type) | ||
| return _EVENT_TYPE_MAPPING[self._event_data.event_type] |
There was a problem hiding this comment.
Nit: Bare dict lookup will raise KeyError if the driver returns an event type not in _EVENT_TYPE_MAPPING. Consider .get(...) with a sentinel/None (or wrap with a clearer error) so a property accessor doesn't blow up on values introduced by newer drivers.
| The :obj:`~SystemEventType` that was triggered. | ||
| """ | ||
| return SystemEventType(self._event_data.event_type) | ||
| return _SYSTEM_EVENT_TYPE_MAPPING[self._event_data.event_type] |
There was a problem hiding this comment.
Nit: Same as in _event.pxi — bare lookup raises KeyError on unmapped values. Consider .get(...) with a fallback for forward-compat with newer drivers.
There was a problem hiding this comment.
This is fine and not worth the performance impact. The value comes from C++ code. It would be a runtime, not a user error, if that were ever to happen.
| For all CUDA-capable discrete products with fans. | ||
| """ | ||
| return FanControlPolicy(nvml.device_get_fan_control_policy_v2(self._handle, self._fan)) | ||
| return _FAN_CONTROL_POLICY_MAPPING[nvml.device_get_fan_control_policy_v2(self._handle, self._fan)] |
There was a problem hiding this comment.
Nit: Bare _FAN_CONTROL_POLICY_MAPPING[...] lookup will raise KeyError for any policy value the wrapper doesn't know. Other wrappers in this PR use .get(..., fallback) — worth being consistent.
There was a problem hiding this comment.
The other wrappers use .get because the value is either explicitly unbounded in the NVML docs or comes from the user. That is not the case here.
| assert ( | ||
| int(pstate) >= 0 and int(pstate) <= 15 | ||
| ), f"Invalid P-state: {pstate}. Must be between 0 and 15 inclusive, or PSTATE_UNKNOWN." |
There was a problem hiding this comment.
assert is stripped under python -O, so this bounds check silently disappears in optimized runs. Prefer raising ValueError for runtime input validation.
There was a problem hiding this comment.
Again, we don't need to validate values coming from NVML, IMHO.
| return NvlinkVersion(nvml.device_get_nvlink_version(self._device._handle, self._link)) | ||
| version = nvml.device_get_nvlink_version(self._device._handle, self._link) | ||
| if version == nvml.NvlinkVersion.VERSION_INVALID: | ||
| raise RuntimeError(f"Invalid NvLink version returned for device") |
There was a problem hiding this comment.
f-string with no {} interpolation — either drop the f prefix or include the offending version value in the message (more useful for debugging). Ruff would flag this as F541; cython-lint won't.
| # Typo in upstream library | ||
| nvml.GpuP2PStatus.P2P_STATUS_CHIPSET_NOT_SUPPORED: GpuP2PStatus.CHIPSET_NOT_SUPPORTED, | ||
| nvml.GpuP2PStatus.P2P_STATUS_CHIPSET_NOT_SUPPORTED: GpuP2PStatus.CHIPSET_NOT_SUPPORTED, |
There was a problem hiding this comment.
The typo'd P2P_STATUS_CHIPSET_NOT_SUPPORED and the corrected P2P_STATUS_CHIPSET_NOT_SUPPORTED alias to the same integer value, so the second entry overwrites the first in the dict — the first line is dead. Either drop one or add a comment noting it's intentional dedup coverage.
| if sys.version_info >= (3, 11): | ||
| from enum import StrEnum | ||
| else: | ||
| from backports.strenum import StrEnum |
There was a problem hiding this comment.
Nit: This StrEnum / backports.strenum compat block is duplicated yet again here. Folds into the hoist suggested in the existing threads on _device.pyx (r3185426568, r3185428204).
|
This rewraps most of the enums (except the extremely large and unorganized
FieldId) forcuda.core.systemrather than passing them directly through. This also creates an optional string interface for all of these enum values.