Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing #84

Open
Robadob opened this issue Apr 22, 2022 · 1 comment
Open

Timing #84

Robadob opened this issue Apr 22, 2022 · 1 comment
Assignees

Comments

@Robadob
Copy link
Member

Robadob commented Apr 22, 2022

There should probably be some coverage of timing.

  • Performance trouble shooting should reference how timing can be collected within FGPU2.
  • Timing/performance metrics should be detailed in logging pages (emphasis perf metrics always logged in ensembles??).
  • Maybe some advanced discussion regarding accuracy/wddm, impact of cuda event timers etc.
@ptheywood
Copy link
Member

ptheywood commented Apr 22, 2022

Maybe some advanced discussion regarding accuracy/wddm, impact of cuda event timers etc.

CUDA event timers have a resolution of "around 0.5 microscends", and timing only behaves as intended when the event's are recorded in the NULL (default) stream:

Computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds).

If either event was last recorded in a non-NULL stream, the resulting time may be greater than expected (even if both used the same stream handle). This happens because the cudaEventRecord() operation takes place asynchronously and there is no guarantee that the measured latency is actually just between the two events. Any number of other different stream operations could execute in between the two measured events, thus altering the timing in a significant way.

source

Under WDDM, due to how the WDDM command buffers work, cudaEvent based timing is only meaningful for pure device code (unless you add immediate stream/event/device sync after recording). See FLAMEGPU/FLAMEGPU2#451.
The current implementation in FLAME GPU uses std::steady_clock timers when the gpu is running under WDDM.

std::steady_clock timers are generally not as good, but they are implementation and hardware specific, so can't document a known accuracy / precision. It might be possible to calculate one at runtime though. They might not be precise enough to give useful per step or per layer timing depending on the model.
std::high_resolution_clock sounds like it should be better, but its implementation defined. MSVC it is just a std::steady_clock, but gcc uses std::system_clock which is not good for performance timing (it's not monotonic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants