Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Nov 20, 2025

The main purpose of this PR is to guard against silent failures:

  • Invalid cuEventElapsedTime() results with the Windows WDDM driver model, but Hardware Accelerated GPU Scheduling (HAGS) not enabled.

Incorrect timing measurements can silently produce misleading performance metrics, causing developers to draw wrong conclusions about optimization effectiveness or leading to incorrect timeout calculations in production code. Without HAGS enabled, CUDA events on WDDM systems may report unreliable timings that appear valid but are fundamentally incorrect.

This change also eliminates recurring issues in QA testing where accidentally disabled HAGS settings caused spurious test failures that were time-consuming to diagnose.


Q:

How likely is it that the _event module fails to load because the required libs (user32, gdi32) are not available on external workstations?

To clarify: cuda-core is distributed via pypi, and installed with pip install cuda-core onto more-or-less random Windows systems.

A:

Short answer: the risk is effectively zero on "normal" supported Windows desktops/servers. If cuda-core can run there at all, user32.dll and gdi32.dll will be present, so the _event module will load.

Why this is safe

What libraries=["user32", "gdi32"] does:

At build time, MSVC links your extension against the import libs user32.lib and gdi32.lib. At runtime, Windows will require the DLLs:

  • USER32.dll
  • GDI32.dll

These DLLs are core Windows components.

On any supported Windows Client or Server SKU where people realistically:

  • run Python
  • install CUDA
  • and use GPUs

both USER32.dll and GDI32.dll are always present in C:\Windows\System32. They’ve been there for decades; even most non-GUI server configurations still ship them.

pip users normally get wheels, not rebuilds.

For most end users:

  • pip install cuda-core will download a prebuilt wheel.
  • That wheel was linked against user32 and gdi32 on our build machines.
  • At install time, pip just unpacks the .pyd; it does not relink.
  • At import time, Python loads the .pyd; if USER32.dll / GDI32.dll were missing, the entire Windows install would already be badly broken (and a ton of other software, including Python itself in many cases, would also fail).
  • Source builds on “weird” environments.

The only realistic way this could fail is:

  • Someone installs from source (sdist) on a very stripped-down Windows image/container where the Windows SDK is available but user32.lib / gdi32.lib or the corresponding DLLs have been removed.
  • That’s an extremely exotic setup, and such an environment is already a bad fit for CUDA and Python GPU workloads.

rwgk added 3 commits November 20, 2025 10:58
…e-fixer

On Linux (including WSL) the file cuda_python/README.md is a real symlink,
whereas on Windows Git it is checked out as a plain file containing
"../README.md" (without a trailing LF). When pre-commit runs under WSL, the
end-of-file-fixer hook rewrites this file, and Git Bash can no longer handle
the symlink-emulation file correctly, resulting in errors on subsequent git
operations.
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 20, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 20, 2025

/ok to test

@rwgk rwgk changed the title WIP — Check Windows Hardware Acceleration setting (critical for event timings) Check Hardware Accelerated GPU Scheduling (HAGS) status on Windows Nov 20, 2025
@github-actions
Copy link

@kkraus14
Copy link
Collaborator

Is any of the functionality of CUDA Python impacted by Hardware Accelerated Scheduling? Outside of the timing due to implications of when the work is performed by the driver, is there anything in non-test code that we need to check for Hardware Accelerated Scheduling and do something different if it's enabled vs not enabled?

@rparolin
Copy link
Collaborator

@rwgk What is the issue that you are trying to resolve with cudaEventElapsedTime? Are we seeing hardcoded timeouts failing in the code somewhere?

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 21, 2025

@kkraus14 @rparolin

The main purpose of this PR is to protect end-users from silent failures.

See commit 7e340af, which puts the safety guard right where it matters.

Current status of this PR:

The protection works, but if it fires (xfail added with commit 7e340af), the stream handling runs off the tracks. See below for completeness.

The next step is to figure out how to make the test recover cleanly.

(Note that I'm planning to remove the bindings for the hags_status() and wddm_driver_mode_is_in_use() helper functions.)

(I'm also thinking of adding caching of the ensure_hags_is_enabled_if_wddm_driver_model_is_in_use() result, after everything else is fully tested.)

============================= test session starts =============================
platform win32 -- Python 3.13.9, pytest-9.0.1, pluggy-1.6.0 -- C:\Users\rgrossekunst\wrk\forked\cuda-python\TestVenv\Scripts\python.exe
cachedir: .pytest_cache
benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: C:\Users\rgrossekunst\wrk\forked\cuda-python\cuda_core
configfile: pytest.ini
plugins: benchmark-5.2.3
collecting ... collected 18 items

tests/test_event.py::test_event_init_disabled PASSED
tests/test_event.py::test_timing_success
LOOOK hags=2

LOOOK wddm=1
XFAIL (HAGS is not fully en...)
tests/test_event.py::test_is_sync_busy_waited PASSED
tests/test_event.py::test_sync PASSED
tests/test_event.py::test_is_done PASSED
tests/test_event.py::test_error_timing_disabled PASSED
tests/test_event.py::test_error_timing_recorded XFAIL (HAGS is not f...)
tests/test_event.py::test_error_timing_incomplete XFAIL (HAGS is not...)
tests/test_event.py::test_event_device PASSED
tests/test_event.py::test_event_context PASSED
tests/test_event.py::test_event_subclassing PASSED
tests/test_event.py::test_event_equality_reflexive PASSED
tests/test_event.py::test_event_inequality_different_events PASSED
tests/test_event.py::test_event_type_safety PASSED
tests/test_event.py::test_event_hash_consistency PASSED
tests/test_event.py::test_event_hash_equality PASSED
tests/test_event.py::test_event_dict_key PASSED
pytest : Traceback (most recent call last):
At line:1 char:1
+ pytest -ra -s -v .\tests\test_event.py *>&1 | Tee-Object -FilePath '. ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (Traceback (most recent call last)::String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

  File "cuda/core/experimental/_memory/_buffer.pyx", line 117, in cuda.core.experimental._memory._buffer.Buffer.close
    Buffer_close(self, stream)
  File "cuda/core/experimental/_memory/_buffer.pyx", line 288, in cuda.core.experimental._memory._buffer.Buffer_close
    self._memory_resource.deallocate(self._ptr, self._size, s)
  File "C:\Users\rgrossekunst\wrk\forked\cuda-python\cuda_core\cuda\core\experimental\_memory\_legacy.py", line 65, in deallocate
    stream.sync()
    ~~~~~~~~~~~^^
  File "cuda/core/experimental/_stream.pyx", line 258, in cuda.core.experimental._stream.Stream.sync
    HANDLE_RETURN(cydriver.cuStreamSynchronize(self._handle))
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 60, in cuda.core.experimental._utils.cuda_utils.HANDLE_RETURN
    return _check_driver_error(err)
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 78, in cuda.core.experimental._utils.cuda_utils._check_driver_error
    raise CUDAError(f"{name.decode()}: {expl}")
cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if
the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e.
3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using
::cuCtxFromGreenCtx API.
C:\Users\rgrossekunst\wrk\forked\cuda-python\TestVenv\Lib\site-packages\_pytest\unraisableexception.py:67: PytestUnraisableExceptionWarning: Exception ignored in:
'cuda.core.experimental._memory._buffer.Buffer.__dealloc__'

Traceback (most recent call last):
  File "cuda/core/experimental/_memory/_buffer.pyx", line 117, in cuda.core.experimental._memory._buffer.Buffer.close
    Buffer_close(self, stream)
  File "cuda/core/experimental/_memory/_buffer.pyx", line 288, in cuda.core.experimental._memory._buffer.Buffer_close
    self._memory_resource.deallocate(self._ptr, self._size, s)
  File "C:\Users\rgrossekunst\wrk\forked\cuda-python\cuda_core\cuda\core\experimental\_memory\_legacy.py", line 65, in deallocate
    stream.sync()
    ~~~~~~~~~~~^^
  File "cuda/core/experimental/_stream.pyx", line 258, in cuda.core.experimental._stream.Stream.sync
    HANDLE_RETURN(cydriver.cuStreamSynchronize(self._handle))
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 60, in cuda.core.experimental._utils.cuda_utils.HANDLE_RETURN
    return _check_driver_error(err)
  File "cuda/core/experimental/_utils/cuda_utils.pyx", line 78, in cuda.core.experimental._utils.cuda_utils._check_driver_error
    raise CUDAError(f"{name.decode()}: {expl}")
cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if
the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e.
3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using
::cuCtxFromGreenCtx API.

Enable tracemalloc to get traceback where the object was allocated.
See https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings for more info.
  warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
tests/test_event.py::test_event_set_membership PASSED

=========================== short test summary info ===========================
XFAIL tests/test_event.py::test_timing_success - HAGS is not fully enabled while the Windows WDDM driver model is in use; event timing tests are expected to fail in this configuration.
XFAIL tests/test_event.py::test_error_timing_recorded - HAGS is not fully enabled while the Windows WDDM driver model is in use; event timing tests are expected to fail in this configuration.
XFAIL tests/test_event.py::test_error_timing_incomplete - HAGS is not fully enabled while the Windows WDDM driver model is in use; event timing tests are expected to fail in this configuration.
======================== 15 passed, 3 xfailed in 1.42s ========================

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 22, 2025

/ok to test

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 23, 2025

/ok to test

nvml.dll is not part of the CTK but is installed with the CUDA driver.
Adding --exclude nvml.dll to the delvewheel repair command prevents
delvewheel from searching for this DLL during wheel repair, since it
will be available system-wide at runtime.
@rwgk
Copy link
Collaborator Author

rwgk commented Nov 23, 2025

/ok to test

Add a comment block documenting the return codes, matching the style
of hags_status.h. Also converts the file from binary to text mode
per .gitattributes rules.
rwgk added 3 commits November 22, 2025 23:02
Use plural 'timings' to better match the verb 'obtain', which pairs
more naturally with concrete measurements/results rather than an
abstract concept.
Restore the symlink that was accidentally converted to a regular file,
likely due to Windows/WSL2 interaction. This restores the file mode
from 100644 (regular file) back to 120000 (symlink) to match main.
Replace os.name == 'nt' with sys.platform == 'win32' in:
- cuda_core/build_hooks.py (2 occurrences)
- cuda_core/tests/test_event.py (1 occurrence)

This matches the pattern used throughout the rest of the codebase
for consistency.
@rwgk
Copy link
Collaborator Author

rwgk commented Nov 23, 2025

/ok to test

@rwgk rwgk marked this pull request as ready for review November 23, 2025 18:28
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 23, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 23, 2025

@kkraus14 asked:

Is any of the functionality of CUDA Python impacted by Hardware Accelerated Scheduling? Outside of the timing due to implications of when the work is performed by the driver, is there anything in non-test code that we need to check for Hardware Accelerated Scheduling and do something different if it's enabled vs not enabled?

I don't really know, but I know that SWQA ran into only this one error when they (evidently) missed turning on HAGS.

This PR is ready for review now, i.e. the work on the safety guard is complete, and it should be very easy to insert it elsewhere, if we find more APIs that need to be guarded.

@rwgk rwgk requested a review from mdboom November 23, 2025 18:29
@leofang leofang added the triage Needs the team's attention label Nov 24, 2025
@leofang leofang requested review from kkraus14 and tpn November 24, 2025 00:47
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the right approach. We shouldn't disable using timed Events on Windows if the CUDA driver doesn't disable them.

For now, we should change the test to not be sensitive to the underlying scheduling of the driver in regards to creating and recording CUDA events.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 24, 2025

This isn't the right approach. We shouldn't disable using timed Events on Windows if the CUDA driver doesn't disable them.

The cuEventElapsedTime() results are meaningless on Windows when using the WDDM driver model without HAGS enabled.

This PR only prompts users to properly configure their system so that they get meaningful results. That's not disabling. We could just show a warning and continue, but that'll most likely get overlooked in many situations. Simply adjusting the test code would also leave users falling into a trap where timings appear valid but are actually incorrect.

def __sub__(self, other: Event):
# return self - other (in milliseconds)
if not self.is_timing_disabled and not other.is_timing_disabled:
ensure_wddm_with_hags()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will throw if hardware accelerated scheduling isn't enabled, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to this question in isolation:

Yes, it'll show this message:

    "Hardware Accelerated GPU Scheduling (HAGS) must be fully enabled when the "
    "Windows WDDM driver model is in use in order to obtain reliable CUDA event "
    "timings. Please enable HAGS in the Windows graphics settings or switch to a "
    "non-WDDM driver model."


for (i = 0; EnumDisplayDevicesW(NULL, i, &dd, 0); ++i) {
if (dd.StateFlags & DISPLAY_DEVICE_PRIMARY_DEVICE) {
foundPrimary = TRUE;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about cases where a) CUDA compute is being done on the non-primary GPU (i.e. you've got a separate GPU with no display that you use for compute), or b) laptops with the integrated GPUs as default (that are non-NVIDIA), where the discrete GPU is the NVIDIA one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That granularity is indeed missing in this PR, because I didn't want to add the complexity to query based on the actual device active when the cuEventElapsedTime() time call is reached.

How likely are those cases? Could it cause unwanted side-effects if HAGS is enabled?

@kkraus14
Copy link
Collaborator

The cuEventElapsedTime() results are meaningless on Windows when using the WDDM driver model without HAGS enabled.

They are not meaningless. Without HAGS, Windows does some level of batching in the scheduling, which means that there could be some inherent delay due to the batching, but it accurately captures what is actually happening under the hood as far as when the actual driver calls are executed.

This PR only prompts users to properly configure their system so that they get meaningful results.

Not using HAGS is still a properly configured system. There are compatibility reasons to not use HAGS. If the underlying CUDA driver that we're calling into allows the calls as expected, why wouldn't we?

That's not disabling. We could just show a warning and continue, but that'll most likely get overlooked in many situations. Simply adjusting the test code would also leave users falling into a trap where timings appear valid but are actually incorrect.

Left a comment, but we're throwing when trying to compare event timing which is effectively disabling, no? In theory someone could use those events timings outside of Python, but in that case why would we block it from being used in Python as well?

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 24, 2025

They are not meaningless.

Ah, I didn't consider that.

The errors reported by SWQA made me think they are meaningless, e.g.:

>       assert delay_ms - generous_tolerance <= elapsed_time_ms < delay_ms + generous_tolerance
E       assert (500.0 - 100) <= 0.003871999913826585

We're expecting 500 ms, but measured a value close to zero.

I could reproduce this consistently:

qa_bindings_windows_tests_log_2025-11-13+173730.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.003135999897494912
qa_bindings_windows_tests_log_2025-11-13+213245.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.003488000016659498
qa_bindings_windows_tests_log_2025-11-13+225102.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.002208000048995018
qa_bindings_windows_tests_log_2025-11-13+225745.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.002623999956995249
qa_bindings_windows_tests_log_2025-11-13+235819.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.004191999789327383
qa_bindings_windows_tests_log_2025-11-14+002909.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.0023360000923275948
qa_bindings_windows_tests_log_2025-11-14+011036.txt:FAILED tests\test_event.py::test_timing_success - assert (500.0 - 100) <= 0.004255999810993671

If we assume those measurements could be meaningful, I could simply change the test to xfail.

Do we even need to check for WDDM/HAGS in that case?

What we really want in our test is exercise the bindings. If we get back any value, we know already that our bindings are working. The values are purely what comes back from code we don't actually intent to exercise. Unit tests for that code most likely exist elsewhere.

Maybe moot, but just to explain what was on my mind:

In theory someone could use those events timings outside of Python, but in that case why would we block it from being used in Python as well?

Expectations for safety/correctness are generally much higher for Python, compared to C/C++.

@kkraus14
Copy link
Collaborator

Connected offline with @rwgk, summarizing the conversation here.

The issue here is that without HAGS, Windows will batch up calls into the CUDA driver and introduce delays that are hidden from the CPU. What this means is that calling the event record doesn't guarantee that the event recording actually runs then, but that it's added to a batch that will be submitted at a later time. Meanwhile, the time.sleep(0.5) will start immediately and can actually finish before that batch actually runs. If we replaced the time.sleep with a basic 1 thread kernel using something like __nanosleep, then the event timing would be accurate (ignoring kernel pre-emption from graphics workloads, in which case the actual event timing would be correct, but wouldn't match the amount of time the kernel is crafted for).

Given we're just calling into underlying C++ libraries here, we think the best path forward is to just ensure the elapsed time is greater than 0 and avoid being in the business of handling scheduling or timing tolerances.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 24, 2025

Based on discussions on chat: Closing this PR, in favor of the far simpler PR #1285

@rwgk rwgk closed this Nov 24, 2025
@rwgk rwgk deleted the cuda_core_hags branch November 24, 2025 23:16
github-actions bot pushed a commit that referenced this pull request Nov 25, 2025
Removed preview folders for the following PRs:
- PR #1279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

triage Needs the team's attention

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants