Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment using logical indexing #131

Closed
avik-pal opened this issue May 22, 2019 · 26 comments · Fixed by #724
Closed

Assignment using logical indexing #131

avik-pal opened this issue May 22, 2019 · 26 comments · Fixed by #724
Labels
cuda array Stuff about CuArray. enhancement New feature or request

Comments

@avik-pal
Copy link

Is your feature request related to a problem? Please describe.
I am trying to assign to a cuarray in the following manner. It works on cpu but fails to compile on gpu.

Describe the solution you'd like
a[cond] .= b
a -> CuArray of size x
cond -> Bool CuArray of size x
b -> CuArray of size (a[cond])

Describe alternatives you've considered
Transferring to CPU is not a viable alternative as the transfer overhead is massive due to the size of a being ~ 100000.

@avik-pal
Copy link
Author

A minimal example of this would be

a = cu(rand(100))
cond = a .> 0.5
b = cu(rand(size(a[cond])...))

a[cond] .= b

@dpsanders
Copy link

This should be a small modification of JuliaGPU/CuArrays.jl#290

@avik-pal
Copy link
Author

This is what I am currently doing

function kernel_place!(a, b, c, x)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i < length(c) && c[i]
        a[i] = b[x[i]]
    end
    return nothing
end

a = cu(rand(10))
cond = a .> 0.5
b = cu(rand(size(a[cond])...))
idx = cumsum(cond);

@cuda threads=12 kernel_place!(a, b, cond, idx)

@dpsanders
Copy link

Great, so now just wrap that into a setindex function for CuArrays, in a similar way that getindex is done in the PR I referenced.

@dpsanders
Copy link

I think it would probably be better to do a[cond] .= b where b is the same size as a, rather than the same size as a[cond]. I guess it depends on your use case.

@avik-pal
Copy link
Author

If b is of the same size then we can simply do

function kernel_place!(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i < length(c) && c[i]
        a[i] = b[i]
    end
    return nothing
end

But a[cond] .= b won't work in this case. We would have to do a[cond] .= b[cond]

@dpsanders
Copy link

You can do a[c[i]] = b[c[i]].

@dpsanders
Copy link

No actually a[i] = b[i] should work?

@avik-pal
Copy link
Author

So should I do a PR for size of b equal to that of a. IMO a[cond] .= b is more general as b[cond] is in essence a modified b.

@dpsanders
Copy link

Yes, please go ahead and make that PR with a[cond] .= b.

@maleadt maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020
@maleadt maleadt added cuda array Stuff about CuArray. enhancement New feature or request labels May 27, 2020
@bjarthur
Copy link
Contributor

+1 for support for logical indexing into CuArrays

the workaround is to use findall(cond) and linear indexing

@bjarthur
Copy link
Contributor

bjarthur commented Feb 1, 2021

@avik-pal i would really find this PR useful. is there a reason you didn't follow through? mind if i pick it up if you don't have time?

@avik-pal
Copy link
Author

avik-pal commented Feb 2, 2021

@bjarthur sure take it up. It just slipped my mind.

@bjarthur
Copy link
Contributor

bjarthur commented Feb 2, 2021

@dpsanders it doesn't seem that setindex is what we want to create a new method for (or even setindex!). rather, assignment with logical indexing calls materialize!:

julia> using CUDA

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool,1}:
 0
 1
 0

julia> Meta.@lower a[i]    ### logical getting calls getindex()
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.getindex(a, i)
└──      return %1
))))

julia> Meta.@lower a[i] .= 7  ### but logical setting calls materialize!()
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(a, i)
│   %2 = Base.broadcasted(Base.identity, 7)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

i'm trying to hack together a CUDA-specific materialize! method, and am getting an "KernelError: passing and using non-bitstype argument" error:

julia> import Base: materialize!

julia> @inline function materialize!(::CUDA.CuArrayStyle{1}, dest, bc::Broadcast.Broadcasted{Style}) where {Style}
           @info dest
           @info bc
           function kernel(ys, xs)
               i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

               @inbounds if i <= length(ys) 
                   ys[i] = xs[i]
               end
               return
           end

           function configurator(kernel)
               config = launch_configuration(kernel.fun)

               threads = Base.min(length(ys), config.threads)
               blocks = cld(length(ys), threads)

               return (threads=threads, blocks=blocks)
           end

           @cuda name="logical_setindex!" config=configurator kernel(dest, bc)

           return ys
       end
materialize! (generic function with 12 methods)

julia> a[i] .= 7
[ Info: [2]
[ Info: Base.Broadcast.Broadcasted(identity, (7,))
ERROR: GPU compilation of kernel logical_setindex!(SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) failed
KernelError: passing and using non-bitstype argument

Argument 2 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.


Stacktrace:
 [1] check_invocation(::GPUCompiler.CompilerJob, ::LLVM.Function) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:68
 [2] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:238 [inlined]
 [3] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [6] compile at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
 [8] cufunction_compile(::GPUCompiler.FunctionSpec) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:305
 [9] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{var"#kernel#1",Tuple{SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [10] kernel at ./REPL[8]:5 [inlined]
 [11] cached_compilation at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
 [12] cufunction(::var"#kernel#1", ::Type{Tuple{SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}}}; name::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
 [13] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:109 [inlined]
 [14] materialize! at ./REPL[8]:22 [inlined]
 [15] materialize!(::SubArray{Int64,1,CuArray{Int64,1},Tuple{Array{Int64,1}},false}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) at ./broadcast.jl:845
 [16] top-level scope at REPL[9]:1

suggestions would be much appreciated. thanks!

@dpsanders
Copy link

Ah, interesting -- it seems vectorised assignment is different, as I guess it must be. Apologies for the mis-direction.
I don't know how to help out with the issue.

@maleadt
Copy link
Member

maleadt commented Feb 3, 2021

Vectorized assignment calls broadcast, which should work on the GPU already. So you don't need to implement materialize, we already provide implementations of lower-level copyto! methods in GPUArrays.jl.

@bjarthur
Copy link
Contributor

bjarthur commented Feb 3, 2021

@maleadt but it doesn't work. in a fresh julia session without the materialize! method above i get the same "not isbits" error:

julia> using CUDA

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool,1}:
 0
 1
 0

julia> a[i].=7
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.


Stacktrace:
 [1] check_invocation(::GPUCompiler.CompilerJob, ::LLVM.Function) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:68
 [2] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:238 [inlined]
 [3] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [6] compile at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
 [8] cufunction_compile(::GPUCompiler.FunctionSpec) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:305
 [9] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#12",Tuple{CUDA.CuKernelContext,SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [10] broadcast_kernel at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/broadcast.jl:60 [inlined]
 [11] cached_compilation at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
 [12] cufunction(::GPUArrays.var"#broadcast_kernel#12", ::Type{Tuple{CUDA.CuKernelContext,SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
 [13] cufunction at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:294 [inlined]
 [14] #launch_heuristic#853 at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:19 [inlined]
 [15] launch_heuristic at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [16] copyto! at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/broadcast.jl:66 [inlined]
 [17] copyto! at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/broadcast.jl:76 [inlined]
 [18] materialize!(::CUDA.CuArrayStyle{1}, ::SubArray{Int64,1,CuArray{Int64,1},Tuple{Array{Int64,1}},false}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) at ./broadcast.jl:848
 [19] materialize!(::SubArray{Int64,1,CuArray{Int64,1},Tuple{Array{Int64,1}},false}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) at ./broadcast.jl:845
 [20] top-level scope at REPL[4]:1

i'd really like to get this working. any advice would be appreciated.

@maleadt
Copy link
Member

maleadt commented Feb 3, 2021

OK, this turned out a little more intricate than I expected. Let's start with the expansion of this broadcast:

julia> Meta.@lower a[i] .= 1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(a, i)
│   %2 = Base.broadcasted(Base.identity, 1)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

julia> x1 = Base.dotview(a, i)
1-element view(::CuArray{Int64, 1}, [2]) with eltype Int64:
 2

julia> dump(x1)
SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{Vector{Int64}}, false}
  parent: CuArray{Int64, 1}
    baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac00000)
    offset: Int64 0
    dims: Tuple{Int64}
      1: Int64 3
    state: CUDA.ArrayState CUDA.ARRAY_MANAGED
    ctx: CuContext
      handle: Ptr{Nothing} @0x0000000003b838d0
  indices: Tuple{Vector{Int64}}
    1: Array{Int64}((1,)) [2]
  offset1: Int64 0
  stride1: Int64 0

The problem is in the returned SubArray by (dot)view, which contains CPU indices. Peeking a little deeper, that seems to caused by Base.unsafe_view on a LogicalIndex:

julia> dump(Base.unsafe_view(a, CuArray([1])))
SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{CuArray{Int64, 1}}, false}
  parent: CuArray{Int64, 1}
    baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac00000)
    offset: Int64 0
    dims: Tuple{Int64}
      1: Int64 3
    state: CUDA.ArrayState CUDA.ARRAY_MANAGED
    ctx: CuContext
      handle: Ptr{Nothing} @0x0000000003b838d0
  indices: Tuple{CuArray{Int64, 1}}
    1: CuArray{Int64, 1}
      baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac01c00)
      offset: Int64 0
      dims: Tuple{Int64}
        1: Int64 1
      state: CUDA.ArrayState CUDA.ARRAY_MANAGED
      ctx: CuContext
        handle: Ptr{Nothing} @0x0000000003b838d0
  offset1: Int64 0
  stride1: Int64 0

julia> dump(Base.unsafe_view(a, Base.LogicalIndex(CuArray([true]))))
SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{Vector{Int64}}, false}
  parent: CuArray{Int64, 1}
    baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac00000)
    offset: Int64 0
    dims: Tuple{Int64}
      1: Int64 3
    state: CUDA.ArrayState CUDA.ARRAY_MANAGED
    ctx: CuContext
      handle: Ptr{Nothing} @0x0000000003b838d0
  indices: Tuple{Vector{Int64}}
    1: Array{Int64}((1,)) [1]
  offset1: Int64 0
  stride1: Int64 0

That function just calls the SubArray constructor, which seems to remove the GPU-ness from these logical indices when ensure_indexable is called, which calls collect on the CuArray. So this is another case where it's ambiguous what collect means; the use case by LogicalIndex is presumable to get a dense array, while it also results in a conversion to a CPU array. We can bypass that using the following method:

Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) = (I[1], Base.ensure_indexable(Base.tail(I))...)

Things then crash when launching the kernel because creating a LogicalIndex{CuDeviceArray} performs an operation on the device array, which is not a valid thing to do. We can hack around this using:

@eval function adapt_structure(to, A::Base.LogicalIndex{T}) where {T}
      # LogicalIndex's constructor performs a costly (and sometimes impossible) `count`,
      # so recreate the struct using low-level Expr(:new).
      mask = adapt(to, A.mask)
      $(Expr(:new, :(Base.LogicalIndex{T, typeof(mask)}), :mask, :(A.sum)))
end

Turns out SubArray does the same (how have we not run into this?), so:

@eval function adapt_structure(to, A::SubArray{T,N,<:Any,<:Any,L}) where {T,N,L}
    parent = adapt(to, A.parent)
    indices = adapt(to, A.indices)
    $(Expr(:new, :(SubArray{T, N, typeof(parent), typeof(indices), L}), :parent, :indices, :(A.offset1), :(A.stride1)))
end

This gets us to a kernel, but it fails to compile:

Reason: unsupported dynamic function invocation (call to error(s::Vararg{Any, N}) where N in Base at error.jl:40)
Stacktrace:
 [1] error_if_canonical_getindex
   @ abstractarray.jl:1177
 [2] getindex
   @ abstractarray.jl:1163
 [3] reindex
   @ subarray.jl:244
 [4] setindex!
   @ subarray.jl:307
 [5] _setindex!
   @ abstractarray.jl:1291
 [6] setindex!
   @ abstractarray.jl:1261
 [7] broadcast_kernel
   @ ~/Julia/pkg/GPUArrays/src/host/broadcast.jl:59

And that's basically because:

julia> Base.error_if_canonical_getindex(IndexCartesian(),Base.LogicalIndex([true]),1)
ERROR: getindex not defined for Base.LogicalIndex{Int64, Vector{Bool}}

So our broadcast implementation can't handle LogicalIndex because you can't index into those (because, as per the docstring, that is a O(n) operation). That's expected, because it's a O(n) operation. Instead of returning a Bool-based LogicalIndex from ensure_indexable, we need a kernel here that collect the Int-based indices. findall does that, so:

Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) = (findall(I[1].mask), Base.ensure_indexable(Base.tail(I))...)

Et voila:

julia> a=CuArray([1,2,3])
3-element CuArray{Int64, 1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool, 1}:
 0
 1
 0

julia> a[i].=7
1-element view(::CuArray{Int64, 1}, [2]) with eltype Int64:
 7

julia> a
3-element CuArray{Int64, 1}:
 1
 7
 3

TL;DR: we need collect(::Base.LogicalIndices{CuArray}) to perform a findall and return integer indices. I wonder how breaking it would be to have collect methods not return an Array, and have users explicitly call Array if they want to move stuff to the CPU (cc @vchuravy).

@bjarthur
Copy link
Contributor

bjarthur commented Feb 3, 2021

@maleadt i think you might not understand that we're talking about logical indexing here with non-contiguous views.

my attempt to make a new materialize! method was motivated by the fact that dotview and broadcasted both work:

julia> using CUDA

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool,1}:
 0
 1
 0

julia> Meta.@lower a[i].=7
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(a, i)
│   %2 = Base.broadcasted(Base.identity, 7)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

julia> dest = Base.dotview(a, i)
1-element view(::CuArray{Int64,1}, [2]) with eltype Int64:
 2

julia> bc = Base.broadcasted(Base.identity, 7)
Base.Broadcast.Broadcasted(identity, (7,))

julia> Base.materialize!(dest, bc)
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.

the good news is that i now have a materialize! which works for my specific use case:

import Base: materialize!

@inline function materialize!(::CUDA.CuArrayStyle{1}, dest, bc::Broadcast.Broadcasted{Style}) where {Style}
    function kernel(ys, is, xs)
        i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

        @inbounds if i <= length(is)
            ys[is[i]] = xs[i]
        end
        return
    end

    function configurator(kernel)
        config = launch_configuration(kernel.fun)

        threads = Base.min(length(dest), config.threads)
        blocks = cld(length(dest), threads)

        return (threads=threads, blocks=blocks)
    end

    length(bc)==length(dest.indices[1]) || throw(DimensionMismatch())
    @cuda name="logical_materialize!" config=configurator kernel(dest.parent, CuArray(dest.indices[1]), length(bc)==1 ? bc : CuArray(bc.args[1]))

    return dest
end

julia> a[i].=7
1-element view(::CuArray{Int64,1}, [2]) with eltype Int64:
 7

julia> a
3-element CuArray{Int64,1}:
 1
 7
 3

julia> i=CuArray([true,false,true])
3-element CuArray{Bool,1}:
 1
 0
 1

julia> a[i].=[10,11]
2-element view(::CuArray{Int64,1}, [1, 3]) with eltype Int64:
 10
 11

julia> a
3-element CuArray{Int64,1}:
 10
  7
 11

i think this should be useful to others, but there is more work that needs to be done to make it work for more than one dimension. let me know if you concur and i'll work up a PR.

@maleadt
Copy link
Member

maleadt commented Feb 3, 2021

@maleadt i think you might not understand that we're talking about logical indexing here with non-contiguous views.

Sure I do, and no: rewriting materialize! is the wrong approach I think. See my comment, by fixing how the LogicalIndices are treated we can reuse the existing broadcast kernel.

view not working is also pretty clear from your error:

julia> Base.materialize!(dest, bc)
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.

i.e. there's a CPU array in there.

@bjarthur
Copy link
Contributor

bjarthur commented Feb 3, 2021

@maleadt thanks for the detailed investigation! my last post was made near simultaneously, so i had not read yours yet.

i'm concerned about the performance degradation by your suggestion to use a findall. my workaround so far has been to simply do a[findall(i)].=7. three such expressions like this account for 30% of my runtime. i will shortly time my custom materialize! and get back to you.

@maleadt
Copy link
Member

maleadt commented Feb 3, 2021

True, it might make for a good optimization. It should depend on the other fixes I mentioned though, because now you're doing a memory copy to the CPU when doing view, and back to the GPU when launching the kernel, both of which are ridiculously expensive too. I also wonder if we really can't reuse the existing broadcast kernel, without the findall.

@bjarthur
Copy link
Contributor

bjarthur commented Feb 3, 2021

turns out my bespoke materialize! can not do a[i]+=7, so i'd have to develop it further for my use case. i tried your solution, and while it worked of course, i found that a.=ifelse(i,7,a) is faster. and also faster is a.+=i*7. so i think i'm good.

thanks again so much for diving into this. i hope to see your solution above merged in to the codebase at some point for the cases where my workarounds are not applicable.

@bjarthur
Copy link
Contributor

bjarthur commented Feb 5, 2021

i just noticed that @maleadt 's solution above does not work for 2-D arrays :(

julia> using CUDA

julia> import Base: ensure_indexable

julia> Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) =
                 (I[1], Base.ensure_indexable(Base.tail(I))...)

julia> Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) =
                 (findall(I[1].mask), Base.ensure_indexable(Base.tail(I))...)

julia> @eval function adapt_structure(to, A::Base.LogicalIndex{T}) where {T}
            # LogicalIndex's constructor performs a costly (and sometimes impossible) `count`,
            # so recreate the struct using low-level Expr(:new).
            mask = adapt(to, A.mask)
            $(Expr(:new, :(Base.LogicalIndex{T, typeof(mask)}), :mask, :(A.sum)))
       end
adapt_structure (generic function with 1 method)

julia> @eval function adapt_structure(to, A::SubArray{T,N,<:Any,<:Any,L}) where {T,N,L}
            parent = adapt(to, A.parent)
            indices = adapt(to, A.indices)
            $(Expr(:new, :(SubArray{T, N, typeof(parent), typeof(indices), L}), :parent, :indices, :(A.offset1), :(A.stride1)))
       end
adapt_structure (generic function with 2 methods)

julia> a=CuArray([1 2; 3 4; 5 6])
3×2 CuArray{Int64,2}:
 1  2
 3  4
 5  6

julia> i=CuArray([true,false,true])
3-element CuArray{Bool,1}:
 1
 0
 1

julia> a[i,:]
ERROR: ReadOnlyMemoryError()
Stacktrace:
 [1] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/base.jl:82 [inlined]
 [2] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/pointer.jl:7 [inlined]
 [3] pointerref at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/pointer.jl:7 [inlined]
 [4] unsafe_load at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/pointer.jl:79 [inlined]
 [5] arrayref at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:80 [inlined]
 [6] getindex at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:99 [inlined]
 [7] iterate at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:145 [inlined]
 [8] iterate at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:144 [inlined]
 [9] _foldl_impl at ./reduce.jl:56 [inlined]
 [10] foldl_impl at ./reduce.jl:48 [inlined]
 [11] mapfoldl_impl at ./reduce.jl:44 [inlined]
 [12] #mapfoldl#204 at ./reduce.jl:160 [inlined]
 [13] _mapreduce_dim at ./reducedim.jl:315 [inlined]
 [14] #mapreduce#620 at ./reducedim.jl:310 [inlined]
 [15] #count#624 at ./reducedim.jl:390 [inlined]
 [16] #count#623 at ./reducedim.jl:389 [inlined]
 [17] count at ./reducedim.jl:389 [inlined]
 [18] LogicalIndex at ./multidimensional.jl:636 [inlined]
 [19] LogicalIndex at ./multidimensional.jl:638 [inlined]
 [20] adapt_structure(::CUDA.Adaptor, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}) at /groups/scicompsoft/home/arthurb/.julia/packages/Adapt/8kQMV/src/wrappers.jl:12
 [21] adapt at /groups/scicompsoft/home/arthurb/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
 [22] cudaconvert at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:143 [inlined]
 [23] map at ./tuple.jl:159 [inlined]
 [24] map at ./tuple.jl:160 [inlined] (repeats 2 times)
 [25] #launch_heuristic#853 at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [26] launch_heuristic at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [27] gpu_call(::typeof(GPUArrays.getindex_kernel), ::CuArray{Int64,2}, ::CuArray{Int64,2}, ::Tuple{Int64,Int64}, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}, ::Base.Slice{Base.OneTo{Int64}}; target::CuArray{Int64,2}, total_threads::Nothing, threads::Nothing, blocks::Nothing, name::Nothing) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/device/execution.jl:61
 [28] gpu_call(::typeof(GPUArrays.getindex_kernel), ::CuArray{Int64,2}, ::CuArray{Int64,2}, ::Tuple{Int64,Int64}, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}, ::Base.Slice{Base.OneTo{Int64}}) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/device/execution.jl:46
 [29] _getindex(::CuArray{Int64,2}, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}, ::Vararg{Any,N} where N) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/indexing.jl:135
 [30] getindex(::CuArray{Int64,2}, ::CuArray{Bool,1}, ::Function) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/indexing.jl:125
 [31] top-level scope at REPL[9]:1

julia> Array(a)
3×2 Array{Int64,2}:
 1  2
 3  4
 5  6

julia> ans[i,:]
2×2 Array{Int64,2}:
 1  2
 5  6

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> a[i]
2-element CuArray{Int64,1}:
 1
 3

@bjarthur
Copy link
Contributor

bjarthur commented Feb 8, 2021

oops, my last comment was not about assigment, but just N-D logical indexing, which already has an issue filed.

@zhangcx93
Copy link

zhangcx93 commented Mar 26, 2021

This seem not fixed for me, i'm using CUDA.jl 2.6.2 with Julia 1.6 stable. According to the changelog this pr should be merged already.

a = CuArray([1,2,3,4,5])
a[a .> 3] .= 10

this still wont compile, while cpu version works.

a = [1,2,3,4,5]
a[ a .> 3] .= 10

the error messege is this:

ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, which is not isbits:
.indices is of type Tuple{Vector{Int64}} which is not isbits.
.1 is of type Vector{Int64} which is not isbits.

Stacktrace:
[1] check_invocation(job::GPUCompiler.CompilerJob, entry::LLVM.Function)
@ GPUCompiler ~.julia\packages\GPUCompiler\XwWPj\src\validation.jl:68
[2] macro expansion
@ ~.julia\packages\GPUCompiler\XwWPj\src\driver.jl:287 [inlined]
[3] macro expansion
@ ~.julia\packages\TimerOutputs\4QAIk\src\TimerOutput.jl:206 [inlined]
[4] macro expansion
@ ~.julia\packages\GPUCompiler\XwWPj\src\driver.jl:286 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module, kernel::LLVM.Function; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~.julia\packages\GPUCompiler\XwWPj\src\utils.jl:62
[6] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:306
[7] check_cache
@ ~.julia\packages\GPUCompiler\XwWPj\src\cache.jl:44 [inlined]
[8] cached_compilation
@ ~.julia\packages\GPUArrays\WV76E\src\host\broadcast.jl:60 [inlined]
[9] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#12", Tuple{CUDA.CuKernelContext, SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}, Int64}}}, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~.julia\packages\GPUCompiler\XwWPj\src\cache.jl:0
[10] cufunction(f::GPUArrays.var"#broadcast_kernel#12", tt::Type{Tuple{CUDA.CuKernelContext, SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:294
[11] cufunction
@ ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:288 [inlined]
[12] macro expansion
@ ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:102 [inlined]
[13] #launch_heuristic#280
@ ~.julia\packages\CUDA\qEV3Y\src\gpuarrays.jl:17 [inlined]
[14] launch_heuristic
@ ~.julia\packages\CUDA\qEV3Y\src\gpuarrays.jl:17 [inlined]
[15] copyto!
@ ~.julia\packages\GPUArrays\WV76E\src\host\broadcast.jl:66 [inlined]
[16] copyto!
@ ~.julia\packages\GPUArrays\WV76E\src\host\broadcast.jl:76 [inlined]
[17] materialize!
@ .\broadcast.jl:894 [inlined]
[18] materialize!(dest::SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{Vector{Int64}}, false}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(identity), Tuple{Int64}})
@ Base.Broadcast .\broadcast.jl:891
[19] top-level scope
@ REPL[5]:1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda array Stuff about CuArray. enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants