Skip to content

Support dot product on GPU between CuArrays with inconsistent eltypes #1240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Nov 17, 2021

Conversation

findmyway
Copy link
Contributor

Without this change, dot(::CuArray{Float32}, ::CuArray{Bool}) will fallback to the default implementation which triggers the scalar operations.

I simply calculate the sum of view here, not sure whether there is a more efficient solution. It seems cublas.dot do not have any boolean related signature.

@findmyway findmyway marked this pull request as draft November 14, 2021 14:52
@findmyway findmyway marked this pull request as ready for review November 15, 2021 05:21
@findmyway findmyway requested a review from maleadt November 16, 2021 15:17
@findmyway
Copy link
Contributor Author

The performance is comparable to the CUBLAS implementation based on my local test.

@findmyway findmyway changed the title Support dot product on bool cuarray Support dot product on GPU between CuArrays with inconsistent eltypes Nov 16, 2021
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

end
k = @cuda launch=false kernel(x, y, res,T)
config = launch_configuration(k.fun)
k(x, y, res, T; threads=min(length(x), config.threads, MAX_THREADS), blocks=config.blocks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't blocks also need to be clamped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I just do thread-level reduction here, so I think blocks doesn't need to be clamped?

@maleadt
Copy link
Member

maleadt commented Nov 16, 2021

I spent a little time cleaning-up your implementation, but it surprisingly turned out a lot slower:

function LinearAlgebra.dot(x::StridedCuArray{T1}, y::StridedCuArray{T2}) where {T1,T2}
    n = length(x)
    n==length(y) || throw(DimensionMismatch("dot product arguments have lengths $(length(x)) and $(length(y))"))

    res = CUDA.zeros(promote_type(T1, T2), 1)

    function kernel(x, y, res::AbstractArray{T}) where T
        neutral = zero(T)
        val = neutral

        # grid-stride loop
        i0 = (blockIdx().x-1i32)*blockDim().x
        @inbounds for i in i0:(blockDim().x*gridDim().x):length(x)
            # reduce_block synchronizes, so the entire block needs to participate
            j = i + threadIdx().x
            local_val = j <= length(x) ? x[j]*y[j] : neutral
            val += reduce_block(+, local_val, neutral, #=shuffle=# Val(true))
        end
        sync_threads()

        # finalize the computation
        if threadIdx().x == 1i32
            @inbounds CUDA.@atomic res[] += val
        end
        return
    end

    kernel = @cuda launch=false kernel(x, y, res)
    config = launch_configuration(kernel.fun)
    threads = min(config.threads, n)
    blocks = min(config.blocks, cld(n, threads))
    kernel(x, y, res; threads, blocks)

    CUDA.@allowscalar res[]
end

Some ideas might be worth porting to your implementation though, e.g. the grid-stride loop that doesn't require a cld. It's surprising that the shfl-based reduce_block is slower than your simple reduction, but I don't have the time to look into this closely.

@findmyway
Copy link
Contributor Author

findmyway commented Nov 17, 2021

Good suggestion. I've made necessary changes and used reduce_block instead. The code is much simplified now. The reduce_block is quite general and flexible. 👍


Hmm, @maleadt any idea why the CI complains :

Reason: unsupported dynamic function invocation (call to atomic_cas!)

https://buildkite.com/julialang/cuda-dot-jl/builds/2378#45381236-4004-4821-98a7-895cc76a4440/219-1213


Local benchmark:
julia> @benchmark CUDA.@sync dot($(cu(rand(N))), $(cu(rand(Bool, N))))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.680 μs … 431.477 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.892 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.139 μs ±   4.804 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▅▇▇█▆▄▄▃▄▄▃▁▂▁                                         
  ▂▂▂▃▄▅████████████████▇▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂ ▄
  24.7 μs         Histogram: frequency by time         29.9 μs <

 Memory estimate: 2.33 KiB, allocs estimate: 43.

julia> @benchmark CUDA.@sync dot($(cu(rand(N))), $(cu(rand(N))))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  27.526 μs … 90.905 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.185 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.724 μs ±  2.125 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▅██▄▁▁▄▇▇▄▁  ▁▃▂                             ▁▂▂▁      ▂
  ▄▄▃▅█████████████▇███▇▆▆▆▇▆▇▇▆▇▇▇█▇▆▆▇▆▇▅▆▅▆▅▇▆▆▇████████▇█ █
  27.5 μs      Histogram: log(frequency) by time      36.9 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark CUDA.@sync dot($(cu(rand(N))), $(cu(rand(Float16, N))))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  28.711 μs … 169.144 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     30.070 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.252 μs ±   2.084 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▂▅▇▇▇█▇▆▅▅▃▂▁                                       
  ▂▂▂▂▂▂▃▃▄▇██████████████▆▆▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▄
  28.7 μs         Histogram: frequency by time         33.6 μs <

 Memory estimate: 2.33 KiB, allocs estimate: 43.

@findmyway findmyway requested a review from maleadt November 17, 2021 04:25
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, only doing the reduce once per block was probably the reason this performed slower. I'll push some other improvements.

@maleadt
Copy link
Member

maleadt commented Nov 17, 2021

Reason: unsupported dynamic function invocation (call to atomic_cas!)

The line before is important:

ERROR: InvalidIRError: compiling kernel kernel(CuDeviceVector{Bool, 1}, CuDeviceVector{ComplexF64, 1}, CuDeviceVector{ComplexF64, 1}, Val{true}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to atomic_cas!)

ComplexF64, an 128-bit type, isn't supported by any atomic operation. It should be safe to split the atomic on the real and imaginary part here; I wonder if we can generalize this to @atomic. I guess that's what @tkf hinted at with tearable atomics in JuliaLang/julia#43065?

Anyway, I'll disable that test for now.

@maleadt maleadt force-pushed the dot_between_float_bool branch from 7cbdf9b to 634b17c Compare November 17, 2021 07:50
@codecov
Copy link

codecov bot commented Nov 17, 2021

Codecov Report

Merging #1240 (634b17c) into master (8caad95) will decrease coverage by 0.01%.
The diff coverage is 73.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1240      +/-   ##
==========================================
- Coverage   80.17%   80.16%   -0.02%     
==========================================
  Files         119      119              
  Lines        8390     8420      +30     
==========================================
+ Hits         6727     6750      +23     
- Misses       1663     1670       +7     
Impacted Files Coverage Δ
lib/cublas/linalg.jl 81.81% <ø> (ø)
src/accumulate.jl 100.00% <ø> (+2.94%) ⬆️
src/linalg.jl 78.84% <73.33%> (-7.52%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c903144...634b17c. Read the comment docs.

@findmyway
Copy link
Contributor Author

Thanks! Pretty elegant!

@maleadt maleadt merged commit 9055c10 into JuliaGPU:master Nov 17, 2021
@tkf
Copy link
Contributor

tkf commented Nov 17, 2021

I guess that's what @tkf hinted at with tearable atomics in JuliaLang/julia#43065?

Ah, I wasn't thinking about things like this. But yeah, I think complex addition is a cool application.

By the way, using atomic add like this will not guarantee a deterministic result for floats. Is it OK? I'm bringing it up here since it seems like this PR is the first patch that introduces the @atomic-based reduction (I just grepped the repo).

@maleadt
Copy link
Member

maleadt commented Nov 18, 2021

By the way, using atomic add like this will not guarantee a deterministic result for floats. Is it OK? I'm bringing it up here since it seems like this PR is the first patch that introduces the @atomic-based reduction (I just grepped the repo).

Hmm, that's a good point. It's hard/inefficient to do synchronization across blocks (which may be executing on different multiprocessors, at different points in time), so if we want to be able to easily use a variable amount of blocks that inconsistency is pretty much unavoidable. The alternative, like mapreduce currently does, is to launch multiple kernels to perform the final reduction in a single block. I guess that may be preferable in some situations...

@tkf
Copy link
Contributor

tkf commented Nov 18, 2021

Yeah, I can imagine using atomic like this can improve performance on GPU. I think the deterministic output is still preferable as the default for reproducibility/debuggability. But maybe it'd be nice to also have a mode that is non-deterministic but fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants