Use geam for + and - operations with CuMatrix{<:CublasFloat} #1775

amontoison · 2023-02-16T22:48:51Z

No description provided.

vchuravy · 2023-02-16T22:56:40Z

Is it faster?

amontoison · 2023-02-16T23:53:17Z

Yes, we have a significant speed-up when we use a transpose / adjoint wrapper.
I did some benchmarks before the PR.

maleadt · 2023-02-17T07:07:50Z

I did some benchmarks before the PR.

Including those here since Slack forgets things:

using CUDA, CUDA.CUBLAS, BenchmarkTools

T = ComplexF64
n = 10000
A = CUDA.rand(T, n, n)
B = CUDA.rand(T, n, n)

@btime CUDA.@sync CUBLAS.geam('N', 'N', A, B)  # 11.611 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync A + B                        # 12.375 ms (85 allocations: 5.27 KiB)

@btime CUDA.@sync CUBLAS.geam('T', 'N', A, B)  # 11.691 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync transpose(A) + B             # 39.687 ms (86 allocations: 5.28 KiB)

@btime CUDA.@sync CUBLAS.geam('C', 'N', A, B)  # 11.675 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync A' + B                       # 39.724 ms (86 allocations: 5.28 KiB)

@btime CUDA.@sync CUBLAS.geam('C', 'C', A, B)  # 12.232 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync A' + B'                      # 76.601 ms (87 allocations: 5.30 KiB)

T = Float64
n = 1000
A = CUDA.rand(T, n, n)
B = CUDA.rand(T, n, n)

@btime CUDA.@sync CUBLAS.geam('N', 'N', A, B)  # 70.670 μs (5 allocations: 176 bytes)
@btime CUDA.@sync A + B                        # 75.876 μs (34 allocations: 2.23 KiB)

@btime CUDA.@sync CUBLAS.geam('T', 'N', A, B)  # 72.377 μs  (5 allocations: 176 bytes)
@btime CUDA.@sync A' + B                       # 123.523 μs (35 allocations: 2.25 KiB)

@btime CUDA.@sync CUBLAS.geam('T', 'T', A, B)  # 73.332 μs (5 allocations: 176 bytes)
@btime CUDA.@sync A' + B'                      # 246.884 μs (36 allocations: 2.27 KiB)

i.e. when using transposes, CUBLAS is significantly faster. I guess that makes sense; element-wise operations can be done in any order, but we don't care and just index the array. When dealing with transposed inputs, that means that our naive indexing doesn't result in coalescable accesses.

Use geam for + and - operations with CuMatrix{<:CublasFloat}

6069fff

maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels Feb 17, 2023

maleadt merged commit 042df36 into JuliaGPU:master Feb 17, 2023

amontoison deleted the geam+- branch February 17, 2023 14:21

simonbyrne pushed a commit to simonbyrne/CUDA.jl that referenced this pull request Nov 13, 2023

Use CUBLAS.geam for elementwise + and - on CuMatrix (JuliaGPU#1775)

c1da0f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use geam for + and - operations with CuMatrix{<:CublasFloat} #1775

Use geam for + and - operations with CuMatrix{<:CublasFloat} #1775

amontoison commented Feb 16, 2023

vchuravy commented Feb 16, 2023

amontoison commented Feb 16, 2023 •

edited

maleadt commented Feb 17, 2023

Use geam for + and - operations with CuMatrix{<:CublasFloat} #1775

Use geam for + and - operations with CuMatrix{<:CublasFloat} #1775

Conversation

amontoison commented Feb 16, 2023

vchuravy commented Feb 16, 2023

amontoison commented Feb 16, 2023 • edited

maleadt commented Feb 17, 2023

amontoison commented Feb 16, 2023 •

edited