Skip to content

gather on empty CUDA array#413

Closed
YichengDWu wants to merge 1 commit intoFluxML:masterfrom
YichengDWu:master
Closed

gather on empty CUDA array#413
YichengDWu wants to merge 1 commit intoFluxML:masterfrom
YichengDWu:master

Conversation

@YichengDWu
Copy link

Close #411

@YichengDWu YichengDWu changed the title gather on empty CUDA array gather on empty CUDA array May 31, 2022
@mcabbott
Copy link
Member

mcabbott commented May 31, 2022

It's possible that a shortcut like that should be added here, for the CPU case. But #411 is specifically about the GPU case, and I think will need to be fixed in NNlibCUDA.jl .

One thing that deserves a little thought is that this should probably remain an error (and ideally be tested):

julia> NNlib.gather(rand(0, 32), [2, 17, 33])
ERROR: BoundsError: attempt to access 0×32 Matrix{Float64} at index [1:0, 33]

And while here, it would be good if empty cases like this could also be checked:

julia> NNlib.gather(rand(7, 32), Int[])
7×0 Matrix{Float64}

@YichengDWu
Copy link
Author

BTW, I found it useful when I have a bunch of things like NN(vcat(u,x)), where x can sometimes just be empty.

@YichengDWu
Copy link
Author

I think will need to be fixed in NNlibCUDA.jl .

Exactly, I misread ur comment. Sorry for my stupid attempt. Let me try again.

@YichengDWu
Copy link
Author

Here is another strange behavior

julia> src = CuArray(rand(2,3))
2×3 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.18921   0.00229533  0.440581
 0.140795  0.638388    0.751325

julia> NNlib.gather(src,cu[1,4])
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
 0.18921   0.0
 0.140795  0.0

julia> NNlib.gather(src,[1,4])
ERROR: BoundsError: attempt to access 2×3 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer} at index [1:2, 4]
Stacktrace:
 [1] throw_boundserror(A::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, I::Tuple{Base.Slice{Base.OneTo{Int64}}, Int64})
   @ Base .\abstractarray.jl:691
 [2] checkbounds
   @ .\abstractarray.jl:656 [inlined]
 [3] view
   @ C:\Users\Luffy\.julia\packages\CUDA\qAl31\src\array.jl:617 [inlined]
 [4] _view(X::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, colons::Tuple{Colon}, k::Int64)
   @ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\scatter.jl:38
 [5] gather!(dst::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, src::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, idx::Vector{Int64})
   @ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:27
 [6] gather(src::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, idx::Vector{Int64})
   @ NNlib C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:77
 [7] top-level scope
   @ REPL[155]:1
 [8] top-level scope
   @ C:\Users\Luffy\.julia\packages\CUDA\qAl31\src\initialization.jl:52
julia> @which NNlib.gather(src,[1,4])
gather(src::AbstractArray{Tsrc, Nsrc}, idx::AbstractArray{Tidx, Nidx}) where {Tsrc, Nsrc, Nidx, Tidx} in NNlib at C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:70

julia> @which NNlib.gather(src,cu[1,4])
gather(src::AbstractArray{Tsrc, Nsrc}, idx::AbstractArray{Tidx, Nidx}) where {Tsrc, Nsrc, Nidx, Tidx} in NNlib at C:\Users\Luffy\.julia\packages\NNlib\hydo3\src\gather.jl:70

@mcabbott
We can successfully check the bounds if the index is on CPU, but not GPU? I know this could be fixed by adding a checkbounds function when srt and idx both live on GPU, but I don't understand why here only one works.

@YichengDWu
Copy link
Author

YichengDWu commented May 31, 2022

Do we even need to move the index to GPU in any cases? Everything seems just fine if we keep it on CPU.

So one benefit is the speed

julia> @btime NNlib.gather($a,$idx)
  400.234 ms (450005 allocations: 28.99 MiB)
10×50000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.691634   0.873781   0.0642042  0.229622  0.0919693  0.88087       0.331066  0.401409    0.486825   0.0794942  0.0328875  0.538867
 0.382004   0.269381   0.652076   0.814351  0.0334432  0.949356       0.981492  0.486789    0.538543   0.0939153  0.0317709  0.738783
 0.64629    0.794243   0.704002   0.662857  0.938788   0.917456       0.11773   0.0184704   0.6812     0.699423   0.94094    0.298974
 0.693101   0.311339   0.281524   0.146332  0.459633   0.00642134     0.918656  0.225414    0.0749762  0.92406    0.272871   0.5222
 0.716304   0.708746   0.911626   0.912521  0.773224   0.549931       0.574009  0.00617868  0.715682   0.441322   0.0636071  0.310628
 0.550515   0.0741554  0.588083   0.106769  0.785537   0.391265      0.900756  0.357758    0.724709   0.727474   0.281789   0.758916
 0.0764654  0.534348   0.079201   0.758459  0.424882   0.173804       0.136813  0.982668    0.0272927  0.803712   0.524672   0.287456
 0.133258   0.908437   0.23821    0.859476  0.0171796  0.580579       0.637107  0.69076     0.0927079  0.7699     0.433433   0.787452
 0.847404   0.561789   0.846151   0.928472  0.327801   0.75679        0.107646  0.128363    0.685991   0.99785    0.818783   0.8978
 0.642568   0.383775   0.24313    0.823817  0.80557    0.814838       0.488513  0.706078    0.158599   0.904244   0.670385   0.977725

julia> @btime NNlib.gather($a,$idx_gpu)
  4.917 μs (10 allocations: 512 bytes)
10×50000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.691634   0.873781   0.0642042  0.229622  0.0919693  0.88087       0.331066  0.401409    0.486825   0.0794942  0.0328875  0.538867
 0.382004   0.269381   0.652076   0.814351  0.0334432  0.949356       0.981492  0.486789    0.538543   0.0939153  0.0317709  0.738783
 0.64629    0.794243   0.704002   0.662857  0.938788   0.917456       0.11773   0.0184704   0.6812     0.699423   0.94094    0.298974
 0.693101   0.311339   0.281524   0.146332  0.459633   0.00642134     0.918656  0.225414    0.0749762  0.92406    0.272871   0.5222
 0.716304   0.708746   0.911626   0.912521  0.773224   0.549931       0.574009  0.00617868  0.715682   0.441322   0.0636071  0.310628
 0.550515   0.0741554  0.588083   0.106769  0.785537   0.391265      0.900756  0.357758    0.724709   0.727474   0.281789   0.758916
 0.0764654  0.534348   0.079201   0.758459  0.424882   0.173804       0.136813  0.982668    0.0272927  0.803712   0.524672   0.287456
 0.133258   0.908437   0.23821    0.859476  0.0171796  0.580579       0.637107  0.69076     0.0927079  0.7699     0.433433   0.787452
 0.847404   0.561789   0.846151   0.928472  0.327801   0.75679        0.107646  0.128363    0.685991   0.99785    0.818783   0.8978
 0.642568   0.383775   0.24313    0.823817  0.80557    0.814838       0.488513  0.706078    0.158599   0.904244   0.670385   0.977725

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gather is not friendly with matrix of size 0 on GPU

3 participants