Closed
Description
I tried to implement a permutedims kernel, however, it is much slower than pytorch version. Then I tried to delete all computations and left only a copy kernel, it is still very slow.
julia> using CUDA: @cartesianidx, AbstractGPUArray, gpu_call, @linearidx
julia> using CUDA, BenchmarkTools, Random
julia> function mycopy!(dest::AbstractGPUArray, src::AbstractGPUArray)
function copy_kernel(ctx, dest, src)
LI = @linearidx dest
@inbounds dest[LI] = src[LI]
return
end
gpu_call(copy_kernel, dest, src)
return dest
end
mycopy! (generic function with 1 method)
julia> t = CUDA.randn(fill(2, 28)...);
julia> @benchmark CUDA.@sync mycopy!($(copy(t)), $t)
BenchmarkTools.Trial: 33 samples with 1 evaluation.
Range (min … max): 154.534 ms … 155.534 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 154.973 ms ┊ GC (median): 0.00%
Time (mean ± σ): 154.982 ms ± 225.369 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
As a comparison, pytorch
In [19]: import torch
In [20]: t = torch.zeros((2,)*28, device="cuda:0");
In [21]: timeit t.permute(tuple(torch.randperm(28))).clone(); torch.cuda.synchronize()
2.83 ms ± 600 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
The copy
method in Julia base:
julia> @benchmark CUDA.@sync CUDA.copy($t)
BenchmarkTools.Trial: 136 samples with 1 evaluation.
Range (min … max): 7.204 ms … 243.923 ms ┊ GC (min … max): 0.00% … 0.65%
Time (median): 7.496 ms ┊ GC (median): 0.00%
Time (mean ± σ): 36.795 ms ± 77.836 ms ┊ GC (mean ± σ): 0.48% ± 0.20%
CUDA version
(@v1.7) pkg> st CUDA
Status `~/.julia/environments/v1.7/Project.toml`
[052768ef] CUDA v3.6.2
GPU is V100, and system cuda version is 11.4
Related issues:
#1298 under-Peter/OMEinsum.jl#133
Metadata
Metadata
Assignees
Labels
No labels