-
Notifications
You must be signed in to change notification settings - Fork 224
Check Hardware Accelerated GPU Scheduling (HAGS) status on Windows #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e-fixer On Linux (including WSL) the file cuda_python/README.md is a real symlink, whereas on Windows Git it is checked out as a plain file containing "../README.md" (without a trailing LF). When pre-commit runs under WSL, the end-of-file-fixer hook rewrites this file, and Git Bash can no longer handle the symlink-emulation file correctly, resulting in errors on subsequent git operations.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
|
Is any of the functionality of CUDA Python impacted by Hardware Accelerated Scheduling? Outside of the timing due to implications of when the work is performed by the driver, is there anything in non-test code that we need to check for Hardware Accelerated Scheduling and do something different if it's enabled vs not enabled? |
|
@rwgk What is the issue that you are trying to resolve with |
|
The main purpose of this PR is to protect end-users from silent failures. See commit 7e340af, which puts the safety guard right where it matters. Current status of this PR: The protection works, but if it fires ( The next step is to figure out how to make the test recover cleanly. (Note that I'm planning to remove the bindings for the (I'm also thinking of adding caching of the |
|
/ok to test |
|
/ok to test |
nvml.dll is not part of the CTK but is installed with the CUDA driver. Adding --exclude nvml.dll to the delvewheel repair command prevents delvewheel from searching for this DLL during wheel repair, since it will be available system-wide at runtime.
|
/ok to test |
Add a comment block documenting the return codes, matching the style of hags_status.h. Also converts the file from binary to text mode per .gitattributes rules.
Use plural 'timings' to better match the verb 'obtain', which pairs more naturally with concrete measurements/results rather than an abstract concept.
Restore the symlink that was accidentally converted to a regular file, likely due to Windows/WSL2 interaction. This restores the file mode from 100644 (regular file) back to 120000 (symlink) to match main.
Replace os.name == 'nt' with sys.platform == 'win32' in: - cuda_core/build_hooks.py (2 occurrences) - cuda_core/tests/test_event.py (1 occurrence) This matches the pattern used throughout the rest of the codebase for consistency.
|
/ok to test |
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
@kkraus14 asked:
I don't really know, but I know that SWQA ran into only this one error when they (evidently) missed turning on HAGS. This PR is ready for review now, i.e. the work on the safety guard is complete, and it should be very easy to insert it elsewhere, if we find more APIs that need to be guarded. |
kkraus14
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the right approach. We shouldn't disable using timed Events on Windows if the CUDA driver doesn't disable them.
For now, we should change the test to not be sensitive to the underlying scheduling of the driver in regards to creating and recording CUDA events.
The This PR only prompts users to properly configure their system so that they get meaningful results. That's not disabling. We could just show a warning and continue, but that'll most likely get overlooked in many situations. Simply adjusting the test code would also leave users falling into a trap where timings appear valid but are actually incorrect. |
| def __sub__(self, other: Event): | ||
| # return self - other (in milliseconds) | ||
| if not self.is_timing_disabled and not other.is_timing_disabled: | ||
| ensure_wddm_with_hags() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will throw if hardware accelerated scheduling isn't enabled, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to this question in isolation:
Yes, it'll show this message:
"Hardware Accelerated GPU Scheduling (HAGS) must be fully enabled when the "
"Windows WDDM driver model is in use in order to obtain reliable CUDA event "
"timings. Please enable HAGS in the Windows graphics settings or switch to a "
"non-WDDM driver model."
|
|
||
| for (i = 0; EnumDisplayDevicesW(NULL, i, &dd, 0); ++i) { | ||
| if (dd.StateFlags & DISPLAY_DEVICE_PRIMARY_DEVICE) { | ||
| foundPrimary = TRUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about cases where a) CUDA compute is being done on the non-primary GPU (i.e. you've got a separate GPU with no display that you use for compute), or b) laptops with the integrated GPUs as default (that are non-NVIDIA), where the discrete GPU is the NVIDIA one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That granularity is indeed missing in this PR, because I didn't want to add the complexity to query based on the actual device active when the cuEventElapsedTime() time call is reached.
How likely are those cases? Could it cause unwanted side-effects if HAGS is enabled?
They are not meaningless. Without HAGS, Windows does some level of batching in the scheduling, which means that there could be some inherent delay due to the batching, but it accurately captures what is actually happening under the hood as far as when the actual driver calls are executed.
Not using HAGS is still a properly configured system. There are compatibility reasons to not use HAGS. If the underlying CUDA driver that we're calling into allows the calls as expected, why wouldn't we?
Left a comment, but we're throwing when trying to compare event timing which is effectively disabling, no? In theory someone could use those events timings outside of Python, but in that case why would we block it from being used in Python as well? |
Ah, I didn't consider that. The errors reported by SWQA made me think they are meaningless, e.g.: We're expecting 500 ms, but measured a value close to zero. I could reproduce this consistently: If we assume those measurements could be meaningful, I could simply change the test to Do we even need to check for WDDM/HAGS in that case? What we really want in our test is exercise the bindings. If we get back any value, we know already that our bindings are working. The values are purely what comes back from code we don't actually intent to exercise. Unit tests for that code most likely exist elsewhere. Maybe moot, but just to explain what was on my mind:
Expectations for safety/correctness are generally much higher for Python, compared to C/C++. |
|
Connected offline with @rwgk, summarizing the conversation here. The issue here is that without HAGS, Windows will batch up calls into the CUDA driver and introduce delays that are hidden from the CPU. What this means is that calling the event record doesn't guarantee that the event recording actually runs then, but that it's added to a batch that will be submitted at a later time. Meanwhile, the Given we're just calling into underlying C++ libraries here, we think the best path forward is to just ensure the elapsed time is greater than 0 and avoid being in the business of handling scheduling or timing tolerances. |
|
Based on discussions on chat: Closing this PR, in favor of the far simpler PR #1285 |
Removed preview folders for the following PRs: - PR #1279
The main purpose of this PR is to guard against silent failures:
cuEventElapsedTime()results with the WindowsWDDMdriver model, but Hardware Accelerated GPU Scheduling (HAGS) not enabled.Incorrect timing measurements can silently produce misleading performance metrics, causing developers to draw wrong conclusions about optimization effectiveness or leading to incorrect timeout calculations in production code. Without HAGS enabled, CUDA events on WDDM systems may report unreliable timings that appear valid but are fundamentally incorrect.
This change also eliminates recurring issues in QA testing where accidentally disabled HAGS settings caused spurious test failures that were time-consuming to diagnose.
Q:
A:
Short answer: the risk is effectively zero on "normal" supported Windows desktops/servers. If cuda-core can run there at all,
user32.dllandgdi32.dllwill be present, so the_eventmodule will load.Why this is safe
What
libraries=["user32", "gdi32"]does:At build time, MSVC links your extension against the import libs user32.lib and gdi32.lib. At runtime, Windows will require the DLLs:
USER32.dllGDI32.dllThese DLLs are core Windows components.
On any supported Windows Client or Server SKU where people realistically:
both
USER32.dllandGDI32.dllare always present inC:\Windows\System32. They’ve been there for decades; even most non-GUI server configurations still ship them.pip users normally get wheels, not rebuilds.
For most end users:
pip install cuda-corewill download a prebuilt wheel.user32andgdi32on our build machines.pipjust unpacks the.pyd; it does not relink..pyd; ifUSER32.dll/GDI32.dllwere missing, the entire Windows install would already be badly broken (and a ton of other software, including Python itself in many cases, would also fail).The only realistic way this could fail is:
sdist) on a very stripped-down Windows image/container where the Windows SDK is available butuser32.lib/gdi32.libor the corresponding DLLs have been removed.