Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Nov 24, 2025

This change replaces the timing-based event test's use of time.sleep() with a GPU-side nanosleep kernel, eliminating flakiness from OS/driver timing characteristics while maintaining deterministic test behavior.

Changes

  • Added NanosleepKernel helper that uses __nanosleep() to create a guaranteed 20 ms GPU-side delay
  • Updated test_timing_success to use the nanosleep kernel instead of time.sleep()
  • Removed OS-specific timing tolerance logic (Windows/WSL special cases)
  • Simplified assertions to check for finite elapsed time and a minimum threshold (>10ms)

Benefits

  • Deterministic: GPU-side delay is consistent across platforms, eliminating flakiness on Windows/WDDM and WSL
  • Simpler: Removes platform-specific tolerance calculations and OS timing dependencies
  • Reliable: Tests Event.__sub__ functionality without depending on OS timer resolution or driver scheduling behavior

This PR makes PR #1279 obsolete.

The previous test attempted to measure a real sleep delay between two
event records, which introduced flakiness (especially on Windows/WDDM)
and tested OS/driver timing behavior rather than the __sub__ implementation
itself.

This change replaces the test with a minimal, deterministic version that:

  * records two back-to-back events on the same stream
  * synchronizes on the second event to ensure both timestamps are valid
  * asserts that cuEventElapsedTime returns a finite, non-negative float

This exercises the success path of Event.__sub__ without depending on
actual GPU/OS timing characteristics, or requiring artificial GPU work.
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 24, 2025

/ok to test

@github-actions

This comment has been minimized.

@rwgk rwgk marked this pull request as ready for review November 25, 2025 00:40
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 25, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@leofang leofang self-assigned this Nov 25, 2025
@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Nov 25, 2025
@leofang leofang added this to the cuda.core beta 10 milestone Nov 25, 2025
@leofang
Copy link
Member

leofang commented Nov 25, 2025

LGTM but it'd be nice for @kkraus14 to take another look.

kkraus14
kkraus14 previously approved these changes Nov 25, 2025
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @rwgk

rwgk added 2 commits November 24, 2025 22:33
…ic timing

Replace the back-to-back event record test with a version that uses a
__nanosleep kernel between events. This ensures a guaranteed positive
elapsed time (delta_ms > 10) without depending on OS/driver timing
characteristics or requiring artificial GPU work beyond the minimal
nanosleep delay.

The kernel sleeps for 20ms (double the assertion threshold of 10ms),
providing a large safety margin above the ~0.5 microsecond resolution
of cudaEventElapsedTime, making this test deterministic and non-flaky
across platforms including Windows/WDDM.
Replace single __nanosleep() call with clock64()-based loop to ensure the
kernel actually waits for the full 20ms duration. A single __nanosleep()
call doesn't guarantee the full sleep duration, which caused measured
times to be orders of magnitude less than expected (~0.2ms instead of
~20ms).

The new implementation:
- Uses clock64() to measure actual elapsed time
- Loops until 20ms worth of clock cycles have elapsed
- Uses __nanosleep(1000000) inside the loop to yield and avoid 100% CPU spin

This ensures delta_ms > 10 assertion is reliable and the test passes
deterministically.
@rwgk
Copy link
Collaborator Author

rwgk commented Nov 25, 2025

/ok to test

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 25, 2025

I (actually cursor-agent) added a couple commits to insert a kernel that causes a delay of 20 ms. I was surprised to see that __nanosleep by itself didn't lead to predictable behavior, is that expected?

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 25, 2025

LGTM! Thanks @rwgk

Oops, sorry, I somehow missed this response.

I was curious to see how much trouble it is to add the kernel, and apart from the nanosleep surprise, cursor did that in seconds.

I'll let the test finish, then we can still decide if we want to keep the kernel, or remove it again.

@kkraus14
Copy link
Collaborator

My 2c: is that we should remove it in the name of simplicity. This adds a lot of machinery and a lot of things that can go wrong in order to test event timing, but I don't have a strong opinion if you think that this is valuable.

I would move all of the kernel definition and compilation out of the test into the module if we decide to keep it though.

@leofang leofang assigned rwgk and unassigned leofang Nov 25, 2025
# Using a 10 ms threshold (half the sleep duration) provides a large safety margin above
# the ~0.5 microsecond resolution of cudaEventElapsedTime, making this test deterministic
# and non-flaky.
assert delta_ms > 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should equality be included?

Suggested change
assert delta_ms > 10
assert delta_ms >= 10

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically: Because of the large safety margin (expected 10 ms) it shouldn't matter at all.

Readability aspect: Making an effort to be precise here would send the wrong message, by distracting from the large safety margin.

@kkraus14 kkraus14 dismissed their stale review November 25, 2025 18:15

significant changes

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 25, 2025

I verified conclusively that test_event_elapsed_time_basic (as of commit 5eba5ac)
PASSES with the Windows WDDM driver model, but HAGS OFF.

Command used:

python -m pytest -ra -s -vv tests\test_event.py::test_event_elapsed_time_basic --count=1000

Complete test script: test_event_elapsed_time_basic.cmd

Full log: test_event_elapsed_time_basic_hags_off_2025-11-25+105446_log.txt

For completeness: The same pytest command also passes with HAGS ON.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 25, 2025

I would move all of the kernel definition and compilation out of the test into the module if we decide to keep it though.

Given all the effort that went into this already (mostly before this PR, including the back and forth with SWQA) I don't want to stop a few inches before the finish line.

I'll move the new kernel code into a module and address Leo's comments.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 25, 2025

I would move all of the kernel definition and compilation out of the test into the module if we decide to keep it though.

Done. That particular step was again just seconds worth of cursor-agent effort.

I changed the naming back to be more similar to the original code, and polished the comments manually, to convince myself it's all accurate.

Interactive testing with HAGS OFF still passes (pytest -ra -s -v tests\test_event.py::test_timing_success --count=1000).

I'll reboot my machine to get it back to HAGS ON (where I want to keep it) and then test again, while the CI is running here.

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 25, 2025

/ok to test

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 26, 2025

/ok to test

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rwgk rwgk changed the title Replace timing-based event test with deterministic elapsed-time check Replace OS sleep with GPU nanosleep kernel in event timing test Nov 26, 2025
@rwgk
Copy link
Collaborator Author

rwgk commented Nov 26, 2025

Thanks for all the feedback!

@rwgk rwgk merged commit f0af76d into NVIDIA:main Nov 26, 2025
61 checks passed
@rwgk rwgk deleted the cuda_core_test_event_sub_basic branch November 26, 2025 16:35
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants