Unreasonablely slow copy kernel

I tried to implement a permutedims kernel, however, it is much slower than pytorch version. Then I tried to delete all computations and left only a copy kernel, it is still very slow. 

```julia
julia> using CUDA: @cartesianidx, AbstractGPUArray, gpu_call, @linearidx
julia> using CUDA, BenchmarkTools, Random

julia> function mycopy!(dest::AbstractGPUArray, src::AbstractGPUArray)
           function copy_kernel(ctx, dest, src)
               LI = @linearidx dest
               @inbounds dest[LI] = src[LI]
               return
           end
           gpu_call(copy_kernel, dest, src)
           return dest
       end
mycopy! (generic function with 1 method)

julia> t = CUDA.randn(fill(2, 28)...);

julia> @benchmark CUDA.@sync mycopy!($(copy(t)), $t)
BenchmarkTools.Trial: 33 samples with 1 evaluation.
 Range (min … max):  154.534 ms … 155.534 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     154.973 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   154.982 ms ± 225.369 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
```

As a comparison, pytorch
```python
In [19]: import torch

In [20]: t = torch.zeros((2,)*28, device="cuda:0");

In [21]: timeit t.permute(tuple(torch.randperm(28))).clone(); torch.cuda.synchronize()
2.83 ms ± 600 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

The `copy` method in Julia base:
```julia
julia> @benchmark CUDA.@sync CUDA.copy($t)
BenchmarkTools.Trial: 136 samples with 1 evaluation.
 Range (min … max):   7.204 ms … 243.923 ms  ┊ GC (min … max): 0.00% … 0.65%
 Time  (median):      7.496 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.795 ms ±  77.836 ms  ┊ GC (mean ± σ):  0.48% ± 0.20%
```

CUDA version
```julia
(@v1.7) pkg> st CUDA
      Status `~/.julia/environments/v1.7/Project.toml`
  [052768ef] CUDA v3.6.2
```

GPU is V100, and system cuda version is 11.4

Related issues: 
#1298 https://github.com/under-Peter/OMEinsum.jl/issues/133

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unreasonablely slow copy kernel #1301

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unreasonablely slow copy kernel #1301

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions