Perform synchronization on a worker thread #2025

maleadt · 2023-08-11T13:20:58Z

As recommended by NVIDIA, instead of polling the context/stream/event, use a dedicated thread to perform the synchronization on. This is supported on 1.9+, where we have support for foreign threads. It's not particularly fast, 5us per call, but it's significantly better than the previous slow path (which was at least 25us, and could sometimes stall for much longer when the event loop was busy).

TODO: try to improve performance of the core mechanism.

cc @vchuravy

Alternative to #2014; @lcw could you test whether this is acceptable? Note that it requires 1.9.2 or 1.10.

…ging nthreads.

lib/cudadrv/events.jl

vchuravy · 2023-08-11T14:35:25Z

lib/cudadrv/synchronization.jl

+    #  any user will just submit work that makes it block
+
+    # we don't know what the size of uv_thread_t is, so reserve enough space
+    tid = Ref{NTuple{32, UInt8}}(ntuple(i -> 0, 32))


We should export an accessor from Julia to get this sizeof

Yeah... 32 bytes ought to be enough for anybody? 😅

codecov · 2023-08-11T14:46:24Z

Codecov Report

Patch coverage: 95.89% and project coverage change: +0.24% 🎉

Comparison is base (38fb707) 62.31% compared to head (e56c6a8) 62.55%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2025      +/-   ##
==========================================
+ Coverage   62.31%   62.55%   +0.24%     
==========================================
  Files         151      152       +1     
  Lines       12842    12920      +78     
==========================================
+ Hits         8002     8082      +80     
+ Misses       4840     4838       -2

Files Changed	Coverage Δ
lib/cudadrv/events.jl	`94.11% <ø> (-1.81%)`	⬇️
lib/cudadrv/state.jl	`80.87% <ø> (+2.18%)`	⬆️
lib/cudadrv/stream.jl	`95.12% <ø> (+0.18%)`	⬆️
src/pool.jl	`69.03% <ø> (ø)`
lib/utils/memoization.jl	`90.56% <80.00%> (+0.36%)`	⬆️
lib/cudadrv/synchronization.jl	`96.06% <96.06%> (ø)`
lib/cudadrv/context.jl	`72.07% <100.00%> (-1.21%)`	⬇️
lib/utils/threading.jl	`92.30% <100.00%> (+8.09%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lib/utils/memoization.jl

lcw · 2023-08-12T05:08:43Z

Thanks for working on this!

I had @aaustin141 rerun the benchmark we were using at JuliaCon. Here is
what he found.

This is pr is a significant improvement: the amount of dead time spent in
epoll_wait() is down from 100+ μs (and often a lot more than that) to
about 15-20 μs. See the before profile

and after profile

But it's still not as good as switching to blocking synchronization, in
which we have a scant 5 μs or so of dead time in cuStreamSynchronize()
before MPI starts. See the blocking profile below.

Here are some times required to run 20 V-cycles of @aaustin141's multigrid code
(1,048,576 degree-5 elements for 37,748,736 DG and 26,224,641 CG DoFs on
the finest level; times averaged over 3 runs):

Scheme                       Time (s)    Speedup (vs. 1 Rank)
-------------------------------------------------------------
1 Rank                       2.71        1.00
2 Ranks, Old Non-Blocking    1.85        1.46
2 Ranks, New Non-Blocking    1.71        1.58
2 Ranks, Blocking            1.58        1.72

So, we recommend adding the new non-blocking code but would still like a
blocking option.

maleadt · 2023-08-12T08:03:25Z

So, we recommend adding the new non-blocking code but would still like a
blocking option.

OK, that's too bad. I added a preference to control the synchronization kind, which feels like a more idiomatic way than an environment variable (despite what I said earlier). Does that work too?

lcw · 2023-08-12T14:50:50Z

Yes, the preference is fine. Thanks.

maleadt added 5 commits August 11, 2023 14:14

Add thread-based nonblocking synchronization.

d5048f2

Make LazyInitialized/at-memoize resilient to thread adoption and chan…

5d95bfc

…ging nthreads.

Always check for exceptions.

83e0bed

Lower requirement to 1.9.2; we aren't macOS.

730e3e9

Remove outdated comment.

d4eb8d0

maleadt added enhancement New feature or request performance How fast can we go? labels Aug 11, 2023

Remove comment about premature optimization.

a5482f3

vchuravy reviewed Aug 11, 2023

View reviewed changes

lib/utils/memoization.jl Show resolved Hide resolved

Remove redundant locking.

7c6d309

Introduce a preference to control synchronization kind.

e56c6a8

lcw mentioned this pull request Aug 12, 2023

Make nonblocking synchronize optional #2014

Closed

maleadt merged commit 4cd4d14 into master Aug 12, 2023
1 check passed

maleadt deleted the tb/foreign_thread_sync branch August 12, 2023 17:55

maleadt mentioned this pull request Aug 14, 2023

Stream synchronization is slow when waiting on the event from CUDA #1910

Closed

maleadt mentioned this pull request Sep 20, 2023

CompatHelper: bump compat for "CUDA" to "5" JuliaGPU/GemmKernels.jl#155

Merged

simonbyrne mentioned this pull request Oct 17, 2023

Investigate causes of poor scaling in multi-GPU runs CliMA/ClimaAtmos.jl#2222

Open

maleadt mentioned this pull request Apr 22, 2024

Task scheduling can result in delays when synchronizing #1525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform synchronization on a worker thread #2025

Perform synchronization on a worker thread #2025

maleadt commented Aug 11, 2023

vchuravy Aug 11, 2023

maleadt Aug 11, 2023

codecov bot commented Aug 11, 2023 •

edited

lcw commented Aug 12, 2023

maleadt commented Aug 12, 2023

lcw commented Aug 12, 2023

Perform synchronization on a worker thread #2025

Perform synchronization on a worker thread #2025

Conversation

maleadt commented Aug 11, 2023

vchuravy Aug 11, 2023

Choose a reason for hiding this comment

maleadt Aug 11, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 11, 2023 • edited

Codecov Report

lcw commented Aug 12, 2023

maleadt commented Aug 12, 2023

lcw commented Aug 12, 2023

codecov bot commented Aug 11, 2023 •

edited