Check Hardware Accelerated GPU Scheduling (HAGS) status on Windows #1279

rwgk · 2025-11-20T21:03:58Z

The main purpose of this PR is to guard against silent failures:

Invalid cuEventElapsedTime() results with the Windows WDDM driver model, but Hardware Accelerated GPU Scheduling (HAGS) not enabled.

Incorrect timing measurements can silently produce misleading performance metrics, causing developers to draw wrong conclusions about optimization effectiveness or leading to incorrect timeout calculations in production code. Without HAGS enabled, CUDA events on WDDM systems may report unreliable timings that appear valid but are fundamentally incorrect.

This change also eliminates recurring issues in QA testing where accidentally disabled HAGS settings caused spurious test failures that were time-consuming to diagnose.

Q:

How likely is it that the _event module fails to load because the required libs (user32, gdi32) are not available on external workstations?

To clarify: cuda-core is distributed via pypi, and installed with pip install cuda-core onto more-or-less random Windows systems.

A:

Short answer: the risk is effectively zero on "normal" supported Windows desktops/servers. If cuda-core can run there at all, user32.dll and gdi32.dll will be present, so the _event module will load.

Why this is safe

What libraries=["user32", "gdi32"] does:

At build time, MSVC links your extension against the import libs user32.lib and gdi32.lib. At runtime, Windows will require the DLLs:

USER32.dll
GDI32.dll

These DLLs are core Windows components.

On any supported Windows Client or Server SKU where people realistically:

run Python
install CUDA
and use GPUs

both USER32.dll and GDI32.dll are always present in C:\Windows\System32. They’ve been there for decades; even most non-GUI server configurations still ship them.

pip users normally get wheels, not rebuilds.

For most end users:

pip install cuda-core will download a prebuilt wheel.
That wheel was linked against user32 and gdi32 on our build machines.
At install time, pip just unpacks the .pyd; it does not relink.
At import time, Python loads the .pyd; if USER32.dll / GDI32.dll were missing, the entire Windows install would already be badly broken (and a ton of other software, including Python itself in many cases, would also fail).
Source builds on “weird” environments.

The only realistic way this could fail is:

Someone installs from source (sdist) on a very stripped-down Windows image/container where the Windows SDK is available but user32.lib / gdi32.lib or the corresponding DLLs have been removed.
That’s an extremely exotic setup, and such an environment is already a bad fit for CUDA and Python GPU workloads.

…e-fixer On Linux (including WSL) the file cuda_python/README.md is a real symlink, whereas on Windows Git it is checked out as a plain file containing "../README.md" (without a trailing LF). When pre-commit runs under WSL, the end-of-file-fixer hook rewrites this file, and Git Bash can no longer handle the symlink-emulation file correctly, resulting in errors on subsequent git operations.

copy-pr-bot · 2025-11-20T21:04:01Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2025-11-20T21:13:08Z

/ok to test

github-actions · 2025-11-20T21:22:34Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1279/
https://nvidia.github.io/cuda-python/pr-preview/pr-1279/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1279/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1279/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

….pyx

kkraus14 · 2025-11-21T20:51:55Z

Is any of the functionality of CUDA Python impacted by Hardware Accelerated Scheduling? Outside of the timing due to implications of when the work is performed by the driver, is there anything in non-test code that we need to check for Hardware Accelerated Scheduling and do something different if it's enabled vs not enabled?

rparolin · 2025-11-21T21:14:22Z

@rwgk What is the issue that you are trying to resolve with cudaEventElapsedTime? Are we seeing hardcoded timeouts failing in the code somewhere?

rwgk · 2025-11-21T21:29:57Z

@kkraus14 @rparolin

The main purpose of this PR is to protect end-users from silent failures.

See commit 7e340af, which puts the safety guard right where it matters.

Current status of this PR:

The protection works, but if it fires (xfail added with commit 7e340af), the stream handling runs off the tracks. See below for completeness.

The next step is to figure out how to make the test recover cleanly.

(Note that I'm planning to remove the bindings for the hags_status() and wddm_driver_mode_is_in_use() helper functions.)

(I'm also thinking of adding caching of the ensure_hags_is_enabled_if_wddm_driver_model_is_in_use() result, after everything else is fully tested.)

============================= test session starts =============================
platform win32 -- Python 3.13.9, pytest-9.0.1, pluggy-1.6.0 -- C:\Users\rgrossekunst\wrk\forked\cuda-python\TestVenv\Scripts\python.exe
cachedir: .pytest_cache
benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: C:\Users\rgrossekunst\wrk\forked\cuda-python\cuda_core
configfile: pytest.ini
plugins: benchmark-5.2.3
collecting ... collected 18 items

tests/test_event.py::test_event_init_disabled PASSED
tests/test_event.py::test_timing_success
LOOOK hags=2

LOOOK wddm=1
XFAIL (HAGS is not fully en...)
tests/test_event.py::test_is_sync_busy_waited PASSED
tests/test_event.py::test_sync PASSED
tests/test_event.py::test_is_done PASSED
tests/test_event.py::test_error_timing_disabled PASSED
tests/test_event.py::test_error_timing_recorded XFAIL (HAGS is not f...)
tests/test_event.py::test_error_timing_incomplete XFAIL (HAGS is not...)
tests/test_event.py::test_event_device PASSED
tests/test_event.py::test_event_context PASSED
tests/test_event.py::test_event_subclassing PASSED
tests/test_event.py::test_event_equality_reflexive PASSED
tests/test_event.py::test_event_inequality_different_events PASSED
tests/test_event.py::test_event_type_safety PASSED
tests/test_event.py::test_event_hash_consistency PASSED
tests/test_event.py::test_event_hash_equality PASSED
tests/test_event.py::test_event_dict_key PASSED
pytest : Traceback (most recent call last):
At line:1 char:1
+ pytest -ra -s -v .\tests\test_event.py *>&1 | Tee-Object -FilePath '. ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "cuda/core/experimental/_memory/_buffer.pyx", line 117, in cuda.core.experimental._memory._buffer.Buffer.close
    Buffer_close(self, stream)
  File "cuda/core/experimental/_memory/_buffer.pyx", line 288, in cuda.core.experimental._memory._buffer.Buffer_close
    self._memory_resource.deallocate(self._ptr, self._size, s)
  File "C:\Users\rgrossekunst\wrk\forked\cuda-python\cuda_core\cuda\core\experimental\_memory\_legacy.py", line 65, in deallocate
    stream.sync()
    ~~~~~~~~~~~^^
  File "cuda/core/experimental/_stream.pyx", line 258, in cuda.core.experimental._stream.Stream.sync
    HANDLE_RETURN(cydriver.cuStreamSynchronize(self._handle))
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 60, in cuda.core.experimental._utils.cuda_utils.HANDLE_RETURN
    return _check_driver_error(err)
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 78, in cuda.core.experimental._utils.cuda_utils._check_driver_error
    raise CUDAError(f"{name.decode()}: {expl}")
cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if
the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e.
3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using
::cuCtxFromGreenCtx API.
C:\Users\rgrossekunst\wrk\forked\cuda-python\TestVenv\Lib\site-packages\_pytest\unraisableexception.py:67: PytestUnraisableExceptionWarning: Exception ignored in:
'cuda.core.experimental._memory._buffer.Buffer.__dealloc__'

Traceback (most recent call last):
  File "cuda/core/experimental/_memory/_buffer.pyx", line 117, in cuda.core.experimental._memory._buffer.Buffer.close
    Buffer_close(self, stream)
  File "cuda/core/experimental/_memory/_buffer.pyx", line 288, in cuda.core.experimental._memory._buffer.Buffer_close
    self._memory_resource.deallocate(self._ptr, self._size, s)
  File "C:\Users\rgrossekunst\wrk\forked\cuda-python\cuda_core\cuda\core\experimental\_memory\_legacy.py", line 65, in deallocate
    stream.sync()
    ~~~~~~~~~~~^^
  File "cuda/core/experimental/_stream.pyx", line 258, in cuda.core.experimental._stream.Stream.sync
    HANDLE_RETURN(cydriver.cuStreamSynchronize(self._handle))
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 60, in cuda.core.experimental._utils.cuda_utils.HANDLE_RETURN
    return _check_driver_error(err)
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 78, in cuda.core.experimental._utils.cuda_utils._check_driver_error
    raise CUDAError(f"{name.decode()}: {expl}")
cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if
the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e.
3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using
::cuCtxFromGreenCtx API.

Enable tracemalloc to get traceback where the object was allocated.
See https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings for more info.
  warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
tests/test_event.py::test_event_set_membership PASSED

=========================== short test summary info ===========================
XFAIL tests/test_event.py::test_timing_success - HAGS is not fully enabled while the Windows WDDM driver model is in use; event timing tests are expected to fail in this configuration.
XFAIL tests/test_event.py::test_error_timing_recorded - HAGS is not fully enabled while the Windows WDDM driver model is in use; event timing tests are expected to fail in this configuration.
XFAIL tests/test_event.py::test_error_timing_incomplete - HAGS is not fully enabled while the Windows WDDM driver model is in use; event timing tests are expected to fail in this configuration.
======================== 15 passed, 3 xfailed in 1.42s ========================

…e per process.

rwgk · 2025-11-22T22:58:35Z

/ok to test

rwgk · 2025-11-23T04:43:22Z

/ok to test

nvml.dll is not part of the CTK but is installed with the CUDA driver. Adding --exclude nvml.dll to the delvewheel repair command prevents delvewheel from searching for this DLL during wheel repair, since it will be available system-wide at runtime.

rwgk · 2025-11-23T05:23:08Z

/ok to test

Add a comment block documenting the return codes, matching the style of hags_status.h. Also converts the file from binary to text mode per .gitattributes rules.

Use plural 'timings' to better match the verb 'obtain', which pairs more naturally with concrete measurements/results rather than an abstract concept.

Restore the symlink that was accidentally converted to a regular file, likely due to Windows/WSL2 interaction. This restores the file mode from 100644 (regular file) back to 120000 (symlink) to match main.

Replace os.name == 'nt' with sys.platform == 'win32' in: - cuda_core/build_hooks.py (2 occurrences) - cuda_core/tests/test_event.py (1 occurrence) This matches the pattern used throughout the rest of the codebase for consistency.

rwgk · 2025-11-23T07:44:25Z

/ok to test

copy-pr-bot · 2025-11-23T18:28:53Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2025-11-23T18:29:10Z

@kkraus14 asked:

Is any of the functionality of CUDA Python impacted by Hardware Accelerated Scheduling? Outside of the timing due to implications of when the work is performed by the driver, is there anything in non-test code that we need to check for Hardware Accelerated Scheduling and do something different if it's enabled vs not enabled?

I don't really know, but I know that SWQA ran into only this one error when they (evidently) missed turning on HAGS.

This PR is ready for review now, i.e. the work on the safety guard is complete, and it should be very easy to insert it elsewhere, if we find more APIs that need to be guarded.

kkraus14

This isn't the right approach. We shouldn't disable using timed Events on Windows if the CUDA driver doesn't disable them.

For now, we should change the test to not be sensitive to the underlying scheduling of the driver in regards to creating and recording CUDA events.

rwgk · 2025-11-24T19:55:51Z

This isn't the right approach. We shouldn't disable using timed Events on Windows if the CUDA driver doesn't disable them.

The cuEventElapsedTime() results are meaningless on Windows when using the WDDM driver model without HAGS enabled.

This PR only prompts users to properly configure their system so that they get meaningful results. That's not disabling. We could just show a warning and continue, but that'll most likely get overlooked in many situations. Simply adjusting the test code would also leave users falling into a trap where timings appear valid but are actually incorrect.

kkraus14 · 2025-11-24T20:12:44Z

cuda_core/cuda/core/experimental/_event.pyx

    def __sub__(self, other: Event):
        # return self - other (in milliseconds)
+        if not self.is_timing_disabled and not other.is_timing_disabled:
+            ensure_wddm_with_hags()


This will throw if hardware accelerated scheduling isn't enabled, no?

Responding to this question in isolation:

Yes, it'll show this message:

"Hardware Accelerated GPU Scheduling (HAGS) must be fully enabled when the " "Windows WDDM driver model is in use in order to obtain reliable CUDA event " "timings. Please enable HAGS in the Windows graphics settings or switch to a " "non-WDDM driver model."

tpn · 2025-11-24T20:24:03Z

cuda_core/cuda/core/experimental/_utils/hags_status.c

+
+    for (i = 0; EnumDisplayDevicesW(NULL, i, &dd, 0); ++i) {
+        if (dd.StateFlags & DISPLAY_DEVICE_PRIMARY_DEVICE) {
+            foundPrimary = TRUE;


What about cases where a) CUDA compute is being done on the non-primary GPU (i.e. you've got a separate GPU with no display that you use for compute), or b) laptops with the integrated GPUs as default (that are non-NVIDIA), where the discrete GPU is the NVIDIA one.

That granularity is indeed missing in this PR, because I didn't want to add the complexity to query based on the actual device active when the cuEventElapsedTime() time call is reached.

How likely are those cases? Could it cause unwanted side-effects if HAGS is enabled?

kkraus14 · 2025-11-24T20:24:11Z

The cuEventElapsedTime() results are meaningless on Windows when using the WDDM driver model without HAGS enabled.

They are not meaningless. Without HAGS, Windows does some level of batching in the scheduling, which means that there could be some inherent delay due to the batching, but it accurately captures what is actually happening under the hood as far as when the actual driver calls are executed.

This PR only prompts users to properly configure their system so that they get meaningful results.

Not using HAGS is still a properly configured system. There are compatibility reasons to not use HAGS. If the underlying CUDA driver that we're calling into allows the calls as expected, why wouldn't we?

That's not disabling. We could just show a warning and continue, but that'll most likely get overlooked in many situations. Simply adjusting the test code would also leave users falling into a trap where timings appear valid but are actually incorrect.

Left a comment, but we're throwing when trying to compare event timing which is effectively disabling, no? In theory someone could use those events timings outside of Python, but in that case why would we block it from being used in Python as well?

rwgk · 2025-11-24T20:50:06Z

They are not meaningless.

Ah, I didn't consider that.

The errors reported by SWQA made me think they are meaningless, e.g.:

>       assert delay_ms - generous_tolerance <= elapsed_time_ms < delay_ms + generous_tolerance
E       assert (500.0 - 100) <= 0.003871999913826585

We're expecting 500 ms, but measured a value close to zero.

I could reproduce this consistently:

qa_bindings_windows_tests_log_2025-11-13+173730.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.003135999897494912
qa_bindings_windows_tests_log_2025-11-13+213245.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.003488000016659498
qa_bindings_windows_tests_log_2025-11-13+225102.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.002208000048995018
qa_bindings_windows_tests_log_2025-11-13+225745.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.002623999956995249
qa_bindings_windows_tests_log_2025-11-13+235819.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.004191999789327383
qa_bindings_windows_tests_log_2025-11-14+002909.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.0023360000923275948
qa_bindings_windows_tests_log_2025-11-14+011036.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.004255999810993671

If we assume those measurements could be meaningful, I could simply change the test to xfail.

Do we even need to check for WDDM/HAGS in that case?

What we really want in our test is exercise the bindings. If we get back any value, we know already that our bindings are working. The values are purely what comes back from code we don't actually intent to exercise. Unit tests for that code most likely exist elsewhere.

Maybe moot, but just to explain what was on my mind:

In theory someone could use those events timings outside of Python, but in that case why would we block it from being used in Python as well?

Expectations for safety/correctness are generally much higher for Python, compared to C/C++.

kkraus14 · 2025-11-24T22:51:04Z

Connected offline with @rwgk, summarizing the conversation here.

The issue here is that without HAGS, Windows will batch up calls into the CUDA driver and introduce delays that are hidden from the CPU. What this means is that calling the event record doesn't guarantee that the event recording actually runs then, but that it's added to a batch that will be submitted at a later time. Meanwhile, the time.sleep(0.5) will start immediately and can actually finish before that batch actually runs. If we replaced the time.sleep with a basic 1 thread kernel using something like __nanosleep, then the event timing would be accurate (ignoring kernel pre-emption from graphics workloads, in which case the actual event timing would be correct, but wouldn't match the amount of time the kernel is crafted for).

Given we're just calling into underlying C++ libraries here, we think the best path forward is to just ensure the elapsed time is greater than 0 and avoid being in the business of handling scheduling or timing tolerances.

rwgk · 2025-11-24T23:11:33Z

Based on discussions on chat: Closing this PR, in favor of the far simpler PR #1285

Removed preview folders for the following PRs: - PR #1279

rwgk added 3 commits November 20, 2025 10:58

First step adding cuda_core/cuda/core/experimental/_utils/hags_status.c

a6124ef

cuda_core/build_hooks.py: Extension get_libraries()

3dbb0fc

Merge branch 'main' into cuda_core_hags

d79fb29

rwgk changed the title ~~WIP — Check Windows Hardware Acceleration setting (critical for event timings)~~ Check Hardware Accelerated GPU Scheduling (HAGS) status on Windows Nov 20, 2025

rwgk added 5 commits November 20, 2025 14:18

Add drvmodel.c prototype (with main)

2b08492

Add wddm_driver_model_is_in_use.c

235c122

Update build_hooks.py to link in nvml.lib

4fa00ee

Call wddm_driver_model_is_in_use from test_event.py

25060c6

Add ensure_hags_is_enabled_if_wddm_driver_model_is_in_use() in _event…

7e340af

….pyx

Work _xfail_if_hags_runtime_error into tests/test_event.py

2eebbca

rwgk added 7 commits November 22, 2025 12:55

Clean out inspect_hags_status(), add _is_hags_timing_usable() helper.

3c2a1c7

Refactor as _get_wddm_hags_error()

066d928

Shorten name: ensure_wddm_with_hags()

a80f553

Remove hags_status(), wddm_driver_model_is_in_use() Python bindings

35cccf3

.gitattributes: _utils/*.h text

58e1578

Caching: Run wddm_driver_model_is_in_use() and hags_status() only onc…

8d11440

…e per process.

Stub wddm_driver_model_is_in_use on non-Windows

dfd4d13

Add cuda_nvml_dev to cuda-components in fetch_ctk/action.yml

a415810

Add documentation comment to wddm_driver_model_is_in_use.h

85a1550

Add a comment block documenting the return codes, matching the style of hags_status.h. Also converts the file from binary to text mode per .gitattributes rules.

rwgk added 3 commits November 22, 2025 23:02

Change 'event timing' to 'event timings' in WDDM HAGS error message

63ee3f3

Use plural 'timings' to better match the verb 'obtain', which pairs more naturally with concrete measurements/results rather than an abstract concept.

Restore cuda_python/README.md as symlink

43d7c8f

Restore the symlink that was accidentally converted to a regular file, likely due to Windows/WSL2 interaction. This restores the file mode from 100644 (regular file) back to 120000 (symlink) to match main.

rwgk marked this pull request as ready for review November 23, 2025 18:28

rwgk requested a review from mdboom November 23, 2025 18:29

leofang added the triage Needs the team's attention label Nov 24, 2025

leofang requested review from kkraus14 and tpn November 24, 2025 00:47

leofang assigned rwgk Nov 24, 2025

kkraus14 requested changes Nov 24, 2025

View reviewed changes

kkraus14 reviewed Nov 24, 2025

View reviewed changes

tpn reviewed Nov 24, 2025

View reviewed changes

rwgk mentioned this pull request Nov 24, 2025

Replace OS sleep with GPU nanosleep kernel in event timing test #1285

Merged

rwgk closed this Nov 24, 2025

rwgk deleted the cuda_core_hags branch November 24, 2025 23:16

github-actions bot pushed a commit that referenced this pull request Nov 25, 2025

Clean up PR preview folders for 1 closed/merged PRs

f9c246f

Removed preview folders for the following PRs: - PR #1279

Check Hardware Accelerated GPU Scheduling (HAGS) status on Windows #1279

Check Hardware Accelerated GPU Scheduling (HAGS) status on Windows #1279

Uh oh!

Conversation

rwgk commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 20, 2025

Uh oh!

rwgk commented Nov 20, 2025

Uh oh!

github-actions bot commented Nov 20, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

kkraus14 commented Nov 21, 2025

Uh oh!

rparolin commented Nov 21, 2025

Uh oh!

rwgk commented Nov 21, 2025

Uh oh!

rwgk commented Nov 22, 2025

Uh oh!

rwgk commented Nov 23, 2025

Uh oh!

rwgk commented Nov 23, 2025

Uh oh!

rwgk commented Nov 23, 2025

Uh oh!

copy-pr-bot bot commented Nov 23, 2025

Uh oh!

rwgk commented Nov 23, 2025

Uh oh!

kkraus14 left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk commented Nov 24, 2025

Uh oh!

kkraus14 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

rwgk Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

tpn Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

rwgk Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

kkraus14 commented Nov 24, 2025

Uh oh!

rwgk commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkraus14 commented Nov 24, 2025

Uh oh!

rwgk commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rwgk commented Nov 20, 2025 •

edited

Loading

rwgk commented Nov 24, 2025 •

edited

Loading