Deprecate non-blocking sync, and always call the synchronization API. #1213

maleadt · 2021-10-21T14:15:06Z

CUDA uses synchronization calls for certain events, like releasing memory. Adds around 150 ns to a no-op synchronize, which took around 200ns before, so that's not great, but I don't think we have another option. Also, if we were to sync on a separate thread (one of the alternatives do doing both a yielding loop & blocking API call), it's likely that the overhead would be higher anyway.

cc @vchuravy

CUDA uses synchronization calls for certain events, like releasing memory.

vchuravy

Sad, but I agree the threaded approach likely won't be better (although have we looked into @threadcall for the blocking call into CUDA)?

vchuravy · 2021-10-21T14:26:20Z

lib/cudadrv/context.jl

+context of dynamic parallelism.
+"""
+device_synchronize() = nonblocking_synchronize()
+# XXX: can we put the device docstring in dynamic_parallelism.jl?


@doc should let you do that

yes, but there's 2 instances of device_synchronize, and calling @doc twice overwrites the previous one IIUC.

maleadt · 2021-10-21T14:36:35Z

Hmm, @threadcall is interesting. Problem is that the called thread should be on the correct context (which are thread-bound). We could try @threadcallling a @cfunction that first sets the context and then performs the synchronization. Seems risky.

maleadt · 2021-10-21T14:41:46Z

Maaaaybe

julia> foo() = (ccall(:write, Cint, (Cint, Ptr{Cvoid}, Csize_t), 1, C_NULL, 0); nothing)
foo (generic function with 1 method)

julia> foo()

julia> ptr = @cfunction(foo, Nothing, ())
Ptr{Nothing} @0x00007f2d3c0786d0

julia> ccall(ptr, Nothing, ())

julia> @threadcall(ptr, Nothing, ())

# NO SEGFAULT YAY

maleadt · 2021-10-21T15:05:57Z

Oh my @threadcall is awfully slow 😭

julia> f() = @threadcall(ptr, Nothing, ())
f (generic function with 1 method)

julia> @benchmark f()
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.090 μs … 27.698 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.067 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.361 μs ±  1.091 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▇▇▃▄▁   ▃█▂▁▂▂                                            
  ▁▄█████▅▃▃███████▇▅▃▃▃▃▂▂▂▂▃▄▄▃▂▂▂▃▂▃▂▃▂▃▂▂▂▂▂▂▂▁▁▂▂▂▁▁▁▂▂ ▃
  5.09 μs        Histogram: frequency by time        9.68 μs <

 Memory estimate: 400 bytes, allocs estimate: 10.

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 923 evaluations.
 Range (min … max):  113.769 ns … 203.703 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     118.427 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   118.797 ns ±   4.090 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁   █▆▃▃ ▄▅▅▇▄ █▅▇▅ ▃▃▁▂                           
  ▁▁▁▁▁▁▁▂▂▃▂▇█▆▅▄████▇██████████▆████▅██▆▆▂▅▄▄▃▂▂▄▂▂▁▁▂▂▁▁▁▁▁▁ ▄
  114 ns           Histogram: frequency by time          124 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I was just getting my hopes up after figuring out the horrible incantation of calling two CUDA APIs from a @threadcall:

# synchronization is a blocking API call, so we use @threadcall to make it non-blocking.
# however, contexts are thread-bound, so we need to first set the context on that thread.
# that requires additional API calls, so we use a @cfunction to perform those.
# that's the function below, which needs to take care not to call any Julia code.
function _synchronize_threadcall(context, context_f, stream, stream_f)
    # avoid calls to jl_throw checking if f==C_NULL
    assume(context_f != C_NULL)
    res = ccall(context_f, CUresult, (CUcontext,), context)
    if res == SUCCESS
        assume(stream_f != C_NULL)
        res = ccall(stream_f, CUresult, (CUstream,), stream)
    end
    return res
end

# lazily-iniitialized handles passed to the threadcalled function
const _set_context_f = Ref{Ptr{Cvoid}}(C_NULL)
const _sync_stream_f = Ref{Ptr{Cvoid}}(C_NULL)
const _sync_stream_cfunction = Ref{Ptr{Cvoid}}(C_NULL)

"""
    synchronize([stream::CuStream])

Wait until `stream` has finished executing, with `stream` defaulting to the stream
associated with the current Julia task.

See also: [`device_synchronize`](@ref)
"""
function synchronize(stream::CuStream=stream(); blocking=nothing)
    if blocking !== nothing
        Base.depwarn("the blocking keyword to synchronize() has been deprecated", :synchronize)
    end

    # perform the synchronization API call using @threadcall to avoid blocking in libcuda
    # if _sync_stream_cfunction[] == C_NULL
    #     lib = Libdl.dlopen(libcuda())
    #     _set_context_f[] = Libdl.dlsym(lib, "cuCtxSetCurrent")
    #     _sync_stream_f[] = Libdl.dlsym(lib, "cuStreamSynchronize")
    #     _sync_stream_cfunction[] =
    #         @cfunction(_synchronize_threadcall, CUresult,
    #                    (CUcontext, Ptr{Cvoid}, CUstream, Ptr{Cvoid}))
    # end
    @check @threadcall(_sync_stream_cfunction[], CUresult,
                       (CUcontext, Ptr{Cvoid}, CUstream, Ptr{Cvoid}),
                       context().handle, _set_context_f[], stream.handle, _sync_stream_f[])

    check_exceptions()
end

codecov · 2021-10-21T17:02:51Z

Codecov Report

Merging #1213 (ef3d9a2) into master (bab6e31) will decrease coverage by 0.01%.
The diff coverage is 95.12%.

@@            Coverage Diff             @@
##           master    #1213      +/-   ##
==========================================
- Coverage   80.48%   80.47%   -0.02%     
==========================================
  Files         119      119              
  Lines        8385     8394       +9     
==========================================
+ Hits         6749     6755       +6     
- Misses       1636     1639       +3

Impacted Files	Coverage Δ
src/utilities.jl	`78.46% <50.00%> (-9.42%)`	⬇️
lib/cudadrv/stream.jl	`93.44% <90.90%> (+5.94%)`	⬆️
lib/cudadrv/context.jl	`75.83% <100.00%> (+0.83%)`	⬆️
lib/cudadrv/events.jl	`95.83% <100.00%> (+0.27%)`	⬆️
src/array.jl	`91.15% <100.00%> (+0.05%)`	⬆️
src/pool.jl	`76.79% <100.00%> (-0.58%)`	⬇️
src/state.jl	`79.66% <100.00%> (+0.23%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bab6e31...ef3d9a2. Read the comment docs.

CUDA uses synchronization calls for certain events, like releasing memory.

Deprecate non-blocking sync, and always call the API.

ef3d9a2

CUDA uses synchronization calls for certain events, like releasing memory.

vchuravy approved these changes Oct 21, 2021

View reviewed changes

maleadt merged commit 80a5978 into master Oct 22, 2021

maleadt deleted the tb/sync branch October 22, 2021 05:30

simonbyrne pushed a commit to simonbyrne/CUDA.jl that referenced this pull request Nov 13, 2023

Deprecate non-blocking sync, and always call the API. (JuliaGPU#1213)

d9d8c2f

CUDA uses synchronization calls for certain events, like releasing memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate non-blocking sync, and always call the synchronization API. #1213

Deprecate non-blocking sync, and always call the synchronization API. #1213

maleadt commented Oct 21, 2021

vchuravy left a comment

vchuravy Oct 21, 2021

maleadt Oct 21, 2021

maleadt commented Oct 21, 2021

maleadt commented Oct 21, 2021

maleadt commented Oct 21, 2021

codecov bot commented Oct 21, 2021 •

edited

Deprecate non-blocking sync, and always call the synchronization API. #1213

Deprecate non-blocking sync, and always call the synchronization API. #1213

Conversation

maleadt commented Oct 21, 2021

vchuravy left a comment

Choose a reason for hiding this comment

vchuravy Oct 21, 2021

Choose a reason for hiding this comment

maleadt Oct 21, 2021

Choose a reason for hiding this comment

maleadt commented Oct 21, 2021

maleadt commented Oct 21, 2021

maleadt commented Oct 21, 2021

codecov bot commented Oct 21, 2021 • edited

Codecov Report

codecov bot commented Oct 21, 2021 •

edited