Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate non-blocking sync, and always call the synchronization API. #1213

Merged
merged 1 commit into from
Oct 22, 2021

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Oct 21, 2021

CUDA uses synchronization calls for certain events, like releasing memory. Adds around 150 ns to a no-op synchronize, which took around 200ns before, so that's not great, but I don't think we have another option. Also, if we were to sync on a separate thread (one of the alternatives do doing both a yielding loop & blocking API call), it's likely that the overhead would be higher anyway.

cc @vchuravy

CUDA uses synchronization calls for certain events, like releasing memory.
Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad, but I agree the threaded approach likely won't be better (although have we looked into @threadcall for the blocking call into CUDA)?

context of dynamic parallelism.
"""
device_synchronize() = nonblocking_synchronize()
# XXX: can we put the device docstring in dynamic_parallelism.jl?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doc should let you do that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but there's 2 instances of device_synchronize, and calling @doc twice overwrites the previous one IIUC.

@maleadt
Copy link
Member Author

maleadt commented Oct 21, 2021

Hmm, @threadcall is interesting. Problem is that the called thread should be on the correct context (which are thread-bound). We could try @threadcallling a @cfunction that first sets the context and then performs the synchronization. Seems risky.

@maleadt
Copy link
Member Author

maleadt commented Oct 21, 2021

Maaaaybe

julia> foo() = (ccall(:write, Cint, (Cint, Ptr{Cvoid}, Csize_t), 1, C_NULL, 0); nothing)
foo (generic function with 1 method)

julia> foo()

julia> ptr = @cfunction(foo, Nothing, ())
Ptr{Nothing} @0x00007f2d3c0786d0

julia> ccall(ptr, Nothing, ())

julia> @threadcall(ptr, Nothing, ())

# NO SEGFAULT YAY

@maleadt
Copy link
Member Author

maleadt commented Oct 21, 2021

Oh my @threadcall is awfully slow 😭

julia> f() = @threadcall(ptr, Nothing, ())
f (generic function with 1 method)

julia> @benchmark f()
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min  max):  5.090 μs  27.698 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.067 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.361 μs ±  1.091 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▇▇▃▄▁   ▃█▂▁▂▂                                            
  ▁▄█████▅▃▃███████▇▅▃▃▃▃▂▂▂▂▃▄▄▃▂▂▂▃▂▃▂▃▂▃▂▂▂▂▂▂▂▁▁▂▂▂▁▁▁▂▂ ▃
  5.09 μs        Histogram: frequency by time        9.68 μs <

 Memory estimate: 400 bytes, allocs estimate: 10.

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 923 evaluations.
 Range (min  max):  113.769 ns  203.703 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     118.427 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   118.797 ns ±   4.090 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁   █▆▃▃ ▄▅▅▇▄ █▅▇▅ ▃▃▁▂                           
  ▁▁▁▁▁▁▁▂▂▃▂▇█▆▅▄████▇██████████▆████▅██▆▆▂▅▄▄▃▂▂▄▂▂▁▁▂▂▁▁▁▁▁▁ ▄
  114 ns           Histogram: frequency by time          124 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I was just getting my hopes up after figuring out the horrible incantation of calling two CUDA APIs from a @threadcall:

# synchronization is a blocking API call, so we use @threadcall to make it non-blocking.
# however, contexts are thread-bound, so we need to first set the context on that thread.
# that requires additional API calls, so we use a @cfunction to perform those.
# that's the function below, which needs to take care not to call any Julia code.
function _synchronize_threadcall(context, context_f, stream, stream_f)
    # avoid calls to jl_throw checking if f==C_NULL
    assume(context_f != C_NULL)
    res = ccall(context_f, CUresult, (CUcontext,), context)
    if res == SUCCESS
        assume(stream_f != C_NULL)
        res = ccall(stream_f, CUresult, (CUstream,), stream)
    end
    return res
end

# lazily-iniitialized handles passed to the threadcalled function
const _set_context_f = Ref{Ptr{Cvoid}}(C_NULL)
const _sync_stream_f = Ref{Ptr{Cvoid}}(C_NULL)
const _sync_stream_cfunction = Ref{Ptr{Cvoid}}(C_NULL)

"""
    synchronize([stream::CuStream])

Wait until `stream` has finished executing, with `stream` defaulting to the stream
associated with the current Julia task.

See also: [`device_synchronize`](@ref)
"""
function synchronize(stream::CuStream=stream(); blocking=nothing)
    if blocking !== nothing
        Base.depwarn("the blocking keyword to synchronize() has been deprecated", :synchronize)
    end

    # perform the synchronization API call using @threadcall to avoid blocking in libcuda
    # if _sync_stream_cfunction[] == C_NULL
    #     lib = Libdl.dlopen(libcuda())
    #     _set_context_f[] = Libdl.dlsym(lib, "cuCtxSetCurrent")
    #     _sync_stream_f[] = Libdl.dlsym(lib, "cuStreamSynchronize")
    #     _sync_stream_cfunction[] =
    #         @cfunction(_synchronize_threadcall, CUresult,
    #                    (CUcontext, Ptr{Cvoid}, CUstream, Ptr{Cvoid}))
    # end
    @check @threadcall(_sync_stream_cfunction[], CUresult,
                       (CUcontext, Ptr{Cvoid}, CUstream, Ptr{Cvoid}),
                       context().handle, _set_context_f[], stream.handle, _sync_stream_f[])

    check_exceptions()
end

@codecov
Copy link

codecov bot commented Oct 21, 2021

Codecov Report

Merging #1213 (ef3d9a2) into master (bab6e31) will decrease coverage by 0.01%.
The diff coverage is 95.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1213      +/-   ##
==========================================
- Coverage   80.48%   80.47%   -0.02%     
==========================================
  Files         119      119              
  Lines        8385     8394       +9     
==========================================
+ Hits         6749     6755       +6     
- Misses       1636     1639       +3     
Impacted Files Coverage Δ
src/utilities.jl 78.46% <50.00%> (-9.42%) ⬇️
lib/cudadrv/stream.jl 93.44% <90.90%> (+5.94%) ⬆️
lib/cudadrv/context.jl 75.83% <100.00%> (+0.83%) ⬆️
lib/cudadrv/events.jl 95.83% <100.00%> (+0.27%) ⬆️
src/array.jl 91.15% <100.00%> (+0.05%) ⬆️
src/pool.jl 76.79% <100.00%> (-0.58%) ⬇️
src/state.jl 79.66% <100.00%> (+0.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bab6e31...ef3d9a2. Read the comment docs.

@maleadt maleadt merged commit 80a5978 into master Oct 22, 2021
@maleadt maleadt deleted the tb/sync branch October 22, 2021 05:30
simonbyrne pushed a commit to simonbyrne/CUDA.jl that referenced this pull request Nov 13, 2023
CUDA uses synchronization calls for certain events, like releasing memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants