Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use geam for + and - operations with CuMatrix{<:CublasFloat} #1775

Merged
merged 1 commit into from
Feb 17, 2023

Conversation

amontoison
Copy link
Member

No description provided.

@vchuravy
Copy link
Member

Is it faster?

@amontoison
Copy link
Member Author

amontoison commented Feb 16, 2023

Yes, we have a significant speed-up when we use a transpose / adjoint wrapper.
I did some benchmarks before the PR.

@maleadt
Copy link
Member

maleadt commented Feb 17, 2023

I did some benchmarks before the PR.

Including those here since Slack forgets things:

using CUDA, CUDA.CUBLAS, BenchmarkTools

T = ComplexF64
n = 10000
A = CUDA.rand(T, n, n)
B = CUDA.rand(T, n, n)

@btime CUDA.@sync CUBLAS.geam('N', 'N', A, B)  # 11.611 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync A + B                        # 12.375 ms (85 allocations: 5.27 KiB)

@btime CUDA.@sync CUBLAS.geam('T', 'N', A, B)  # 11.691 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync transpose(A) + B             # 39.687 ms (86 allocations: 5.28 KiB)

@btime CUDA.@sync CUBLAS.geam('C', 'N', A, B)  # 11.675 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync A' + B                       # 39.724 ms (86 allocations: 5.28 KiB)

@btime CUDA.@sync CUBLAS.geam('C', 'C', A, B)  # 12.232 ms (53 allocations: 3.16 KiB)
@btime CUDA.@sync A' + B'                      # 76.601 ms (87 allocations: 5.30 KiB)

T = Float64
n = 1000
A = CUDA.rand(T, n, n)
B = CUDA.rand(T, n, n)

@btime CUDA.@sync CUBLAS.geam('N', 'N', A, B)  # 70.670 μs (5 allocations: 176 bytes)
@btime CUDA.@sync A + B                        # 75.876 μs (34 allocations: 2.23 KiB)

@btime CUDA.@sync CUBLAS.geam('T', 'N', A, B)  # 72.377 μs  (5 allocations: 176 bytes)
@btime CUDA.@sync A' + B                       # 123.523 μs (35 allocations: 2.25 KiB)

@btime CUDA.@sync CUBLAS.geam('T', 'T', A, B)  # 73.332 μs (5 allocations: 176 bytes)
@btime CUDA.@sync A' + B'                      # 246.884 μs (36 allocations: 2.27 KiB)

i.e. when using transposes, CUBLAS is significantly faster. I guess that makes sense; element-wise operations can be done in any order, but we don't care and just index the array. When dealing with transposed inputs, that means that our naive indexing doesn't result in coalescable accesses.

@maleadt maleadt added cuda array Stuff about CuArray. performance How fast can we go? labels Feb 17, 2023
@maleadt maleadt merged commit 042df36 into JuliaGPU:master Feb 17, 2023
@amontoison amontoison deleted the geam+- branch February 17, 2023 14:21
simonbyrne pushed a commit to simonbyrne/CUDA.jl that referenced this pull request Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants