-
Notifications
You must be signed in to change notification settings - Fork 258
Automatic task-based concurrency using local streams #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f8cc025 to
47876ce
Compare
|
Changing the stream on a task now updates the active task-bound handles of libraries like CUBLAS and CUDNN. That means switching streams isn't entirely free; so it might be costly to write code that performs a single operation like that. That doesn't seem very realistic though, so I don't think it's worth additional complexity (like lazily changing the handle's stream). |
Codecov Report
@@ Coverage Diff @@
## master #662 +/- ##
==========================================
+ Coverage 77.79% 78.08% +0.28%
==========================================
Files 117 118 +1
Lines 7035 7132 +97
==========================================
+ Hits 5473 5569 +96
- Misses 1562 1563 +1
Continue to review full report at Codecov.
|
|
Current overhead: |
|
If we keep the stream as a keyword argument to low-level functions I agree, for KA purposes I need to run a short sequence of ops on the same stream and then restore the previous one, or I need to pass a explicit stream object for |
Is the switching overhead low enough now? We could add some stream kwargs, but again where to draw the line. I imagine I'd have to add most of the ones removed in this PR back. I could imagine adding it back for kernel launches though, as |
vchuravy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this direction a whole lot! I would push for making most memory operations use the async variant by default (iirc the non-async are device level syncs) and then emit a sync on the stream for async=false, i.e. have async represent the semantics w.r.t. the host. KA uses a busy-wait https://github.com/JuliaGPU/KernelAbstractions.jl/blob/c9c5daa579653877cd9f7893633d4fc224881cce/src/backends/cuda.jl#L63-L72 but that might not be worth it.
I will open a corresponding PR to KA to test these changes there.
That's interesting. That way, the |
ceba01f to
8134da5
Compare
|
Update: I retained stream arguments to most functions for the purpose of KA.jl. At the same time though, the performance overhead of querying/switching streams has been greatly reduced, so the new API could be used too. I've also pushed a commit that assigns each task its own non-blocking stream, so tasks will overlap their computations automatically (I would have expected this to break something at least, but apparently we don't accidentally rely on default stream semantics). And memory operations now always use the async APIs + synchronize only the current stream (not the whole device), but still default to synchronous behavior. |
b977dd3 to
967d9b4
Compare
edb2cc5 to
726d6dd
Compare
|
Well, this is promising. By handling blocking ourselves now during using CUDA, LinearAlgebra
# dummy calculation (that does not allocate or otherwise synchronize the GPU)
function run(a,b,c)
NVTX.@range "mul!" mul!(c, a, b) # uses CUBLAS, so needs a library handle
NVTX.@range "broadcast!" broadcast!(sin, c, c)
end
# one "iteration", performing the above calculation twice in two tasks
# and comparing the output.
function iteration(a,b,c)
x,y = missing, missing
NVTX.@range "iteration" @sync begin
@async begin
x = NVTX.@range "run 1" run(a,b,c)
synchronize()
end
@async begin
y = NVTX.@range "run 2" run(a,b,c)
synchronize()
end
end
# no need to synchronize here, as both tasks have been synchronized already
x == y
end
function main(N=1024)
a = CUDA.rand(N,N)
b = CUDA.rand(N,N)
c = CUDA.rand(N,N)
synchronize() # to make sure we can use this data from other tasks
NVTX.@range "warmup" iteration(a,b,c)
GC.gc(true) # we want to collect and cache the library handles used during warmup
NVTX.@range "main" iteration(a,b,c)
endWith some improvements in caching library handles (turns out creating them is expensive, so we can't naively do so each time a new task performs its first library call), we get very nice concurrent execution from the above example (which doesn't even use unreasonably-large inputs to hide latency): The tiny marks at the bottom are calls to |
Have them synchronize only the currently-active stream, if requested.
This makes task overlap by default, and means we don't have to bother with the default stream semantics (ptsz/ptds).
Reduces the cost of querying the stream from ~20 to 4ns.
We use a task-local stream now, so don't need these default stream semantics.
Since we now use explicit per-task streams, we don't rely on these default stream semantics anymore.
Async H2D copies require pinned memory, but if it isn't the copy will just execute synchronously.
That way we can yield to other tasks.
This ensures newly-created tasks don't have to spend time on creating these handles (which often requires memory allocations and/or global synchronization) during their first library call.
This makes stream switching take 500ns instead of 80, but it simplifies handle management (making sure any active handles immediately use the new stream, reducing the risk of a mismatch there), and avoids needless stream switches when switching tasks. The latter seems like it might happen much more frequently, so it makes sense to have the overhead when switching instead of when querying. Furthermore, for KA.jl many APIs take stream arguments again, so it won't have to switch streams globally anyway.
Now that we don't check for errors on _every_ API call, it's possible an exception doesn't get caught by CUDA.check_exceptions, but results in a CUDA error instead.
This state management is too tricky.
b6b4afd to
1b4ebb2
Compare

This is the next big step (after JuliaGPU/CUDAnative.jl#609 and #395) in marrying Julia's task-based parallelism with the CUDA APIs. By managing our own default stream, and using that instead of CUDA's default stream objects, we can ensure that all operations (kernel launches, library calls, etc) happen using the stream that's active for the current task. That should make it much easier to isolate independent operations. For example:
As you can see, the operations in the global stream are independent from the ones executed in a local stream context 🎉 Here, that results in overlapping execution, which is great for performance.
Remains to be done/decided:
synchronizenow default to synchronizing the current stream, or should it still be a device-wide sync?