New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreasonablely slow copy kernel #1301
Comments
This is how I used to write CUDA kernels. julia> function mycopy!(dest::AbstractGPUArray, src::AbstractGPUArray)
function copy_kernel(dest, src)
LI = (blockIdx().x-1) * blockDim().x + threadIdx().x
@inbounds dest[LI] = src[LI]
return
end
@cuda threads=256 blocks=length(dest)÷256 copy_kernel(dest, src)
return dest
end
mycopy! (generic function with 1 method)
julia> @benchmark CUDA.@sync mycopy!($(copy(t)), $t)
BenchmarkTools.Trial: 1744 samples with 1 evaluation.
Range (min … max): 2.840 ms … 6.606 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.851 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.862 ms ± 173.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▃▅▆██▇▇▄▃▄▃
▂▁▂▂▂▃▃▃▃▄▅▇██████████████▆▅▅▅▄▄▃▃▄▃▃▃▂▂▃▃▂▂▂▂▃▂▂▂▂▂▁▂▂▁▂▁▂ ▄
2.84 ms Histogram: frequency by time 2.87 ms <
Memory estimate: 256 bytes, allocs estimate: 3. Haha, what is happening to |
Try to debug which launch configuration it is using. For something as simple as a copy kernel, the heuristics may be bad. |
It turns out Do you have any recommended method to profile program, note |
NSight Systems is for application profiling, use NSight Compute for looking into a kernel. Source-code matching is not ideal though because we don't yet emit the proper inlined_at info, but the PTX code should be high-level enough for basic profiling.
Interesting. We could always store the length too, a la |
I tried to implement a permutedims kernel, however, it is much slower than pytorch version. Then I tried to delete all computations and left only a copy kernel, it is still very slow.
As a comparison, pytorch
The
copy
method in Julia base:CUDA version
GPU is V100, and system cuda version is 11.4
Related issues:
#1298 under-Peter/OMEinsum.jl#133
The text was updated successfully, but these errors were encountered: