Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreasonablely slow copy kernel #1301

Closed
GiggleLiu opened this issue Jan 2, 2022 · 4 comments · Fixed by #1303
Closed

Unreasonablely slow copy kernel #1301

GiggleLiu opened this issue Jan 2, 2022 · 4 comments · Fixed by #1303

Comments

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Jan 2, 2022

I tried to implement a permutedims kernel, however, it is much slower than pytorch version. Then I tried to delete all computations and left only a copy kernel, it is still very slow.

julia> using CUDA: @cartesianidx, AbstractGPUArray, gpu_call, @linearidx
julia> using CUDA, BenchmarkTools, Random

julia> function mycopy!(dest::AbstractGPUArray, src::AbstractGPUArray)
           function copy_kernel(ctx, dest, src)
               LI = @linearidx dest
               @inbounds dest[LI] = src[LI]
               return
           end
           gpu_call(copy_kernel, dest, src)
           return dest
       end
mycopy! (generic function with 1 method)

julia> t = CUDA.randn(fill(2, 28)...);

julia> @benchmark CUDA.@sync mycopy!($(copy(t)), $t)
BenchmarkTools.Trial: 33 samples with 1 evaluation.
 Range (min  max):  154.534 ms  155.534 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     154.973 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   154.982 ms ± 225.369 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

As a comparison, pytorch

In [19]: import torch

In [20]: t = torch.zeros((2,)*28, device="cuda:0");

In [21]: timeit t.permute(tuple(torch.randperm(28))).clone(); torch.cuda.synchronize()
2.83 ms ± 600 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

The copy method in Julia base:

julia> @benchmark CUDA.@sync CUDA.copy($t)
BenchmarkTools.Trial: 136 samples with 1 evaluation.
 Range (min  max):   7.204 ms  243.923 ms  ┊ GC (min  max): 0.00%  0.65%
 Time  (median):      7.496 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.795 ms ±  77.836 ms  ┊ GC (mean ± σ):  0.48% ± 0.20%

CUDA version

(@v1.7) pkg> st CUDA
      Status `~/.julia/environments/v1.7/Project.toml`
  [052768ef] CUDA v3.6.2

GPU is V100, and system cuda version is 11.4

Related issues:
#1298 under-Peter/OMEinsum.jl#133

@GiggleLiu GiggleLiu changed the title Unreasonable slow copy kernel Unreasonablely slow copy kernel Jan 2, 2022
@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Jan 2, 2022

This is how I used to write CUDA kernels.

julia> function mycopy!(dest::AbstractGPUArray, src::AbstractGPUArray)
           function copy_kernel(dest, src)
               LI = (blockIdx().x-1) * blockDim().x + threadIdx().x
               @inbounds dest[LI] = src[LI]
               return
           end
           @cuda threads=256 blocks=length(dest)÷256 copy_kernel(dest, src)
           return dest
       end
mycopy! (generic function with 1 method)

julia> @benchmark CUDA.@sync mycopy!($(copy(t)), $t)
BenchmarkTools.Trial: 1744 samples with 1 evaluation.
 Range (min  max):  2.840 ms    6.606 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.851 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.862 ms ± 173.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▃▃▅▆██▇▇▄▃▄▃                                    
  ▂▁▂▂▂▃▃▃▃▄▅▇██████████████▆▅▅▅▄▄▃▃▄▃▃▃▂▂▃▃▂▂▂▂▃▂▂▂▂▂▁▂▂▁▂▁▂ ▄
  2.84 ms         Histogram: frequency by time        2.87 ms <

 Memory estimate: 256 bytes, allocs estimate: 3.

Haha, what is happening to @linearidx.

@maleadt
Copy link
Member

maleadt commented Jan 2, 2022

Try to debug which launch configuration it is using. For something as simple as a copy kernel, the heuristics may be bad.

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Jan 2, 2022

It turns out @linearidx is slow because length(arr) is very slow when the arr dimension is high. I compute this quantity on CPU and use it in kernel, then the speed improved for several times.

Do you have any recommended method to profile program, note nsight and nvprof can not look into individual instructions. I wish there can be an API for counting the number of generate instructions of each line, because finding this type of pitfall is very time consuming.

@maleadt
Copy link
Member

maleadt commented Jan 3, 2022

Do you have any recommended method to profile program, note nsight and nvprof can not look into individual instructions.

NSight Systems is for application profiling, use NSight Compute for looking into a kernel. Source-code matching is not ideal though because we don't yet emit the proper inlined_at info, but the PTX code should be high-level enough for basic profiling.

It turns out @linearidx is slow because length(arr) is very slow when the arr dimension is high. I compute this quantity on CPU and use it in kernel, then the speed improved for several times.

Interesting. We could always store the length too, a la STORE_ARRAY_LEN in Base, but that's a trade-off of course (increasing the object size for every user).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants