-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate non-blocking sync, and always call the synchronization API. #1213
Conversation
CUDA uses synchronization calls for certain events, like releasing memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sad, but I agree the threaded approach likely won't be better (although have we looked into @threadcall
for the blocking call into CUDA)?
context of dynamic parallelism. | ||
""" | ||
device_synchronize() = nonblocking_synchronize() | ||
# XXX: can we put the device docstring in dynamic_parallelism.jl? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@doc
should let you do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but there's 2 instances of device_synchronize
, and calling @doc
twice overwrites the previous one IIUC.
Hmm, |
Maaaaybe
|
Oh my julia> f() = @threadcall(ptr, Nothing, ())
f (generic function with 1 method)
julia> @benchmark f()
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 5.090 μs … 27.698 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.067 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.361 μs ± 1.091 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▇▃▄▁ ▃█▂▁▂▂
▁▄█████▅▃▃███████▇▅▃▃▃▃▂▂▂▂▃▄▄▃▂▂▂▃▂▃▂▃▂▃▂▂▂▂▂▂▂▁▁▂▂▂▁▁▁▂▂ ▃
5.09 μs Histogram: frequency by time 9.68 μs <
Memory estimate: 400 bytes, allocs estimate: 10.
julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 923 evaluations.
Range (min … max): 113.769 ns … 203.703 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 118.427 ns ┊ GC (median): 0.00%
Time (mean ± σ): 118.797 ns ± 4.090 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █▆▃▃ ▄▅▅▇▄ █▅▇▅ ▃▃▁▂
▁▁▁▁▁▁▁▂▂▃▂▇█▆▅▄████▇██████████▆████▅██▆▆▂▅▄▄▃▂▂▄▂▂▁▁▂▂▁▁▁▁▁▁ ▄
114 ns Histogram: frequency by time 124 ns <
Memory estimate: 0 bytes, allocs estimate: 0. I was just getting my hopes up after figuring out the horrible incantation of calling two CUDA APIs from a # synchronization is a blocking API call, so we use @threadcall to make it non-blocking.
# however, contexts are thread-bound, so we need to first set the context on that thread.
# that requires additional API calls, so we use a @cfunction to perform those.
# that's the function below, which needs to take care not to call any Julia code.
function _synchronize_threadcall(context, context_f, stream, stream_f)
# avoid calls to jl_throw checking if f==C_NULL
assume(context_f != C_NULL)
res = ccall(context_f, CUresult, (CUcontext,), context)
if res == SUCCESS
assume(stream_f != C_NULL)
res = ccall(stream_f, CUresult, (CUstream,), stream)
end
return res
end
# lazily-iniitialized handles passed to the threadcalled function
const _set_context_f = Ref{Ptr{Cvoid}}(C_NULL)
const _sync_stream_f = Ref{Ptr{Cvoid}}(C_NULL)
const _sync_stream_cfunction = Ref{Ptr{Cvoid}}(C_NULL)
"""
synchronize([stream::CuStream])
Wait until `stream` has finished executing, with `stream` defaulting to the stream
associated with the current Julia task.
See also: [`device_synchronize`](@ref)
"""
function synchronize(stream::CuStream=stream(); blocking=nothing)
if blocking !== nothing
Base.depwarn("the blocking keyword to synchronize() has been deprecated", :synchronize)
end
# perform the synchronization API call using @threadcall to avoid blocking in libcuda
# if _sync_stream_cfunction[] == C_NULL
# lib = Libdl.dlopen(libcuda())
# _set_context_f[] = Libdl.dlsym(lib, "cuCtxSetCurrent")
# _sync_stream_f[] = Libdl.dlsym(lib, "cuStreamSynchronize")
# _sync_stream_cfunction[] =
# @cfunction(_synchronize_threadcall, CUresult,
# (CUcontext, Ptr{Cvoid}, CUstream, Ptr{Cvoid}))
# end
@check @threadcall(_sync_stream_cfunction[], CUresult,
(CUcontext, Ptr{Cvoid}, CUstream, Ptr{Cvoid}),
context().handle, _set_context_f[], stream.handle, _sync_stream_f[])
check_exceptions()
end |
Codecov Report
@@ Coverage Diff @@
## master #1213 +/- ##
==========================================
- Coverage 80.48% 80.47% -0.02%
==========================================
Files 119 119
Lines 8385 8394 +9
==========================================
+ Hits 6749 6755 +6
- Misses 1636 1639 +3
Continue to review full report at Codecov.
|
CUDA uses synchronization calls for certain events, like releasing memory.
CUDA uses synchronization calls for certain events, like releasing memory. Adds around 150 ns to a no-op synchronize, which took around 200ns before, so that's not great, but I don't think we have another option. Also, if we were to sync on a separate thread (one of the alternatives do doing both a yielding loop & blocking API call), it's likely that the overhead would be higher anyway.
cc @vchuravy