-
Notifications
You must be signed in to change notification settings - Fork 243
Support dot product on GPU between CuArrays with inconsistent eltypes #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The performance is comparable to the CUBLAS implementation based on my local test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
lib/cublas/linalg.jl
Outdated
end | ||
k = @cuda launch=false kernel(x, y, res,T) | ||
config = launch_configuration(k.fun) | ||
k(x, y, res, T; threads=min(length(x), config.threads, MAX_THREADS), blocks=config.blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't blocks
also need to be clamped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I just do thread-level reduction here, so I think blocks
doesn't need to be clamped?
I spent a little time cleaning-up your implementation, but it surprisingly turned out a lot slower: function LinearAlgebra.dot(x::StridedCuArray{T1}, y::StridedCuArray{T2}) where {T1,T2}
n = length(x)
n==length(y) || throw(DimensionMismatch("dot product arguments have lengths $(length(x)) and $(length(y))"))
res = CUDA.zeros(promote_type(T1, T2), 1)
function kernel(x, y, res::AbstractArray{T}) where T
neutral = zero(T)
val = neutral
# grid-stride loop
i0 = (blockIdx().x-1i32)*blockDim().x
@inbounds for i in i0:(blockDim().x*gridDim().x):length(x)
# reduce_block synchronizes, so the entire block needs to participate
j = i + threadIdx().x
local_val = j <= length(x) ? x[j]*y[j] : neutral
val += reduce_block(+, local_val, neutral, #=shuffle=# Val(true))
end
sync_threads()
# finalize the computation
if threadIdx().x == 1i32
@inbounds CUDA.@atomic res[] += val
end
return
end
kernel = @cuda launch=false kernel(x, y, res)
config = launch_configuration(kernel.fun)
threads = min(config.threads, n)
blocks = min(config.blocks, cld(n, threads))
kernel(x, y, res; threads, blocks)
CUDA.@allowscalar res[]
end Some ideas might be worth porting to your implementation though, e.g. the grid-stride loop that doesn't require a |
Good suggestion. I've made necessary changes and used Hmm, @maleadt any idea why the CI complains :
Local benchmark:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, only doing the reduce once per block was probably the reason this performed slower. I'll push some other improvements.
The line before is important:
ComplexF64, an 128-bit type, isn't supported by any atomic operation. It should be safe to split the atomic on the real and imaginary part here; I wonder if we can generalize this to Anyway, I'll disable that test for now. |
7cbdf9b
to
634b17c
Compare
Codecov Report
@@ Coverage Diff @@
## master #1240 +/- ##
==========================================
- Coverage 80.17% 80.16% -0.02%
==========================================
Files 119 119
Lines 8390 8420 +30
==========================================
+ Hits 6727 6750 +23
- Misses 1663 1670 +7
Continue to review full report at Codecov.
|
Thanks! Pretty elegant! |
Ah, I wasn't thinking about things like this. But yeah, I think complex addition is a cool application. By the way, using atomic add like this will not guarantee a deterministic result for floats. Is it OK? I'm bringing it up here since it seems like this PR is the first patch that introduces the |
Hmm, that's a good point. It's hard/inefficient to do synchronization across blocks (which may be executing on different multiprocessors, at different points in time), so if we want to be able to easily use a variable amount of blocks that inconsistency is pretty much unavoidable. The alternative, like |
Yeah, I can imagine using atomic like this can improve performance on GPU. I think the deterministic output is still preferable as the default for reproducibility/debuggability. But maybe it'd be nice to also have a mode that is non-deterministic but fast. |
Without this change,
dot(::CuArray{Float32}, ::CuArray{Bool})
will fallback to the default implementation which triggers the scalar operations.I simply calculate the sum of view here, not sure whether there is a more efficient solution. It seems cublas.dot do not have any boolean related signature.