Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WDDM and cudaEventElapsedTime #451

Closed
ptheywood opened this issue Feb 23, 2021 · 1 comment · Fixed by #640
Closed

WDDM and cudaEventElapsedTime #451

ptheywood opened this issue Feb 23, 2021 · 1 comment · Fixed by #640

Comments

@ptheywood
Copy link
Member

util/CUDAEventTimer.h uses cudaEvent_t created in the default stream and cudaEventElapsedTime to get record elapsed time with high precision.

Under linux (and presumably TCC windows) this appears to be accurate regardless of what happens between the events.

Under windows on WDDM devices, the cudaEventRecord calls appear to be handled through the WDDM command buffers (not too suprising). This however prevents accurate timing of host code.
Timing of device code between the events looks OK based on profiler output, but host code after the first cudaEventRecord is issued and before the WDDM command buffer is flushed may be skipped (i.e. a thread sleep).

For now, the timing is probably fine as it'll almost always be used on linux or tcc mode devices in our case.
We should investigate this, and potentially drop back to using chrono::stead_clock on windows WDDM, which will be fine for timing full simulations, but may not be good enough for timing individual steps or individual layers.

The precision of steady_clock is implementation specific. std::chrono::high_resolution clock at a glance looks like a better choice, however it is also implementation specific, either being std::chrono::steady_clock or std::chrono::system_clock depending on the compiler. We want a monotonic clock for timing purposes, so system_clock is not a good choice.

@ptheywood
Copy link
Member Author

ptheywood commented Jul 15, 2021

util::detail::CUDAEventTimer and util::detail::event::SteadClockTimer provide similar but not exact APIs, as CUDA Event Timers require syncrhonisation prior to storing / accessing the timer, rather than it being immediatly available after stop.

Instead, if the CUDAEventTimer's sync() method beame part of getElapsedMilliseconds, then a common base class util::detail::Timer could be created. This would then allow cudaSimulation to select the timer based on the selected device, favouring the CUDAEventTimer, but falling back to SteadyClockTimer if required. I.e. if windows + selected device is WDDM, or just if Windows in general.

It would probably also be worth making the precision/accuracy of the timer used available to the user via stdout/stderr, or via a method somewhere (bool CUDAsimulation::timerIsHighPrecision? float CUDASimulation::timerPrecisionMilliseconds?)? As this might be useful if this timing is being used to report performance in a paper for instance.

  • Move CUDAEventTimer::sync into the body of CUDAEventTimer::getElapsedMilliiseconds
  • Create common base/virtual class util::detail::Timer
    • Default ctor/dtor
    • void start()
    • void stop()
    • float getElapsedMilliseconds()
  • Update use of CUDAEventTimers to instead get the appropriate timer for the active GPU?
    • Maybe a static factory method on the base timer class?
  • Optionally Add a method/stdout logging for the precision of the timer to be made available.
    • Static method / constexpr on each timer implementation reporting the precision value or if it is "high" precision as a bool?
    • Method on CUDASimulation either outputting to stdout/stderr (if -v -t) or getting the precision of the timer(s) used.
    • Method on CUDAEnsemble outting the precision of the ensemble timer?
  • Enure docstrings / onlne documetnation reports this somwhere.

It appears difficult to get the accuracy / resolution of stead_clock timers in real terms. The std::ratio accessible from the class is just std::nano, which is just the numbers it could represent, not the actual minimium value between two timing events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant