Assignment using logical indexing #131

avik-pal · 2019-05-22T07:43:43Z

Is your feature request related to a problem? Please describe.
I am trying to assign to a cuarray in the following manner. It works on cpu but fails to compile on gpu.

Describe the solution you'd like
a[cond] .= b
a -> CuArray of size x
cond -> Bool CuArray of size x
b -> CuArray of size (a[cond])

Describe alternatives you've considered
Transferring to CPU is not a viable alternative as the transfer overhead is massive due to the size of a being ~ 100000.

The text was updated successfully, but these errors were encountered:

avik-pal · 2019-05-22T07:45:33Z

A minimal example of this would be

a = cu(rand(100))
cond = a .> 0.5
b = cu(rand(size(a[cond])...))

a[cond] .= b

dpsanders · 2019-05-22T07:57:10Z

This should be a small modification of JuliaGPU/CuArrays.jl#290

avik-pal · 2019-05-22T08:01:13Z

This is what I am currently doing

function kernel_place!(a, b, c, x)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i < length(c) && c[i]
        a[i] = b[x[i]]
    end
    return nothing
end

a = cu(rand(10))
cond = a .> 0.5
b = cu(rand(size(a[cond])...))
idx = cumsum(cond);

@cuda threads=12 kernel_place!(a, b, cond, idx)

dpsanders · 2019-05-22T08:03:06Z

Great, so now just wrap that into a setindex function for CuArrays, in a similar way that getindex is done in the PR I referenced.

dpsanders · 2019-05-22T08:26:21Z

I think it would probably be better to do a[cond] .= b where b is the same size as a, rather than the same size as a[cond]. I guess it depends on your use case.

avik-pal · 2019-05-22T08:32:17Z

If b is of the same size then we can simply do

function kernel_place!(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i < length(c) && c[i]
        a[i] = b[i]
    end
    return nothing
end

But a[cond] .= b won't work in this case. We would have to do a[cond] .= b[cond]

dpsanders · 2019-05-22T08:33:13Z

You can do a[c[i]] = b[c[i]].

dpsanders · 2019-05-22T08:34:08Z

No actually a[i] = b[i] should work?

avik-pal · 2019-05-22T18:07:30Z

So should I do a PR for size of b equal to that of a. IMO a[cond] .= b is more general as b[cond] is in essence a modified b.

dpsanders · 2019-06-08T16:18:43Z

Yes, please go ahead and make that PR with a[cond] .= b.

bjarthur · 2021-01-30T14:57:22Z

+1 for support for logical indexing into CuArrays

the workaround is to use findall(cond) and linear indexing

bjarthur · 2021-02-01T23:08:41Z

@avik-pal i would really find this PR useful. is there a reason you didn't follow through? mind if i pick it up if you don't have time?

avik-pal · 2021-02-02T06:10:39Z

@bjarthur sure take it up. It just slipped my mind.

bjarthur · 2021-02-02T23:11:55Z

@dpsanders it doesn't seem that setindex is what we want to create a new method for (or even setindex!). rather, assignment with logical indexing calls materialize!:

julia> using CUDA

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool,1}:
 0
 1
 0

julia> Meta.@lower a[i]    ### logical getting calls getindex()
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.getindex(a, i)
└──      return %1
))))

julia> Meta.@lower a[i] .= 7  ### but logical setting calls materialize!()
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(a, i)
│   %2 = Base.broadcasted(Base.identity, 7)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

i'm trying to hack together a CUDA-specific materialize! method, and am getting an "KernelError: passing and using non-bitstype argument" error:

julia> import Base: materialize!

julia> @inline function materialize!(::CUDA.CuArrayStyle{1}, dest, bc::Broadcast.Broadcasted{Style}) where {Style}
           @info dest
           @info bc
           function kernel(ys, xs)
               i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

               @inbounds if i <= length(ys) 
                   ys[i] = xs[i]
               end
               return
           end

           function configurator(kernel)
               config = launch_configuration(kernel.fun)

               threads = Base.min(length(ys), config.threads)
               blocks = cld(length(ys), threads)

               return (threads=threads, blocks=blocks)
           end

           @cuda name="logical_setindex!" config=configurator kernel(dest, bc)

           return ys
       end
materialize! (generic function with 12 methods)

julia> a[i] .= 7
[ Info: [2]
[ Info: Base.Broadcast.Broadcasted(identity, (7,))
ERROR: GPU compilation of kernel logical_setindex!(SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) failed
KernelError: passing and using non-bitstype argument

Argument 2 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.


Stacktrace:
 [1] check_invocation(::GPUCompiler.CompilerJob, ::LLVM.Function) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:68
 [2] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:238 [inlined]
 [3] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [6] compile at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
 [8] cufunction_compile(::GPUCompiler.FunctionSpec) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:305
 [9] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{var"#kernel#1",Tuple{SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [10] kernel at ./REPL[8]:5 [inlined]
 [11] cached_compilation at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
 [12] cufunction(::var"#kernel#1", ::Type{Tuple{SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}}}; name::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
 [13] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:109 [inlined]
 [14] materialize! at ./REPL[8]:22 [inlined]
 [15] materialize!(::SubArray{Int64,1,CuArray{Int64,1},Tuple{Array{Int64,1}},false}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) at ./broadcast.jl:845
 [16] top-level scope at REPL[9]:1

suggestions would be much appreciated. thanks!

dpsanders · 2021-02-02T23:22:26Z

Ah, interesting -- it seems vectorised assignment is different, as I guess it must be. Apologies for the mis-direction.
I don't know how to help out with the issue.

maleadt · 2021-02-03T06:54:11Z

Vectorized assignment calls broadcast, which should work on the GPU already. So you don't need to implement materialize, we already provide implementations of lower-level copyto! methods in GPUArrays.jl.

bjarthur · 2021-02-03T14:38:33Z

@maleadt but it doesn't work. in a fresh julia session without the materialize! method above i get the same "not isbits" error:

julia> using CUDA

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool,1}:
 0
 1
 0

julia> a[i].=7
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.


Stacktrace:
 [1] check_invocation(::GPUCompiler.CompilerJob, ::LLVM.Function) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:68
 [2] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:238 [inlined]
 [3] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [6] compile at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
 [8] cufunction_compile(::GPUCompiler.FunctionSpec) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:305
 [9] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#12",Tuple{CUDA.CuKernelContext,SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [10] broadcast_kernel at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/broadcast.jl:60 [inlined]
 [11] cached_compilation at /groups/scicompsoft/home/arthurb/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
 [12] cufunction(::GPUArrays.var"#broadcast_kernel#12", ::Type{Tuple{CUDA.CuKernelContext,SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
 [13] cufunction at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:294 [inlined]
 [14] #launch_heuristic#853 at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:19 [inlined]
 [15] launch_heuristic at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [16] copyto! at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/broadcast.jl:66 [inlined]
 [17] copyto! at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/broadcast.jl:76 [inlined]
 [18] materialize!(::CUDA.CuArrayStyle{1}, ::SubArray{Int64,1,CuArray{Int64,1},Tuple{Array{Int64,1}},false}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) at ./broadcast.jl:848
 [19] materialize!(::SubArray{Int64,1,CuArray{Int64,1},Tuple{Array{Int64,1}},false}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}) at ./broadcast.jl:845
 [20] top-level scope at REPL[4]:1

i'd really like to get this working. any advice would be appreciated.

maleadt · 2021-02-03T15:50:59Z

OK, this turned out a little more intricate than I expected. Let's start with the expansion of this broadcast:

julia> Meta.@lower a[i] .= 1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(a, i)
│   %2 = Base.broadcasted(Base.identity, 1)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

julia> x1 = Base.dotview(a, i)
1-element view(::CuArray{Int64, 1}, [2]) with eltype Int64:
 2

julia> dump(x1)
SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{Vector{Int64}}, false}
  parent: CuArray{Int64, 1}
    baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac00000)
    offset: Int64 0
    dims: Tuple{Int64}
      1: Int64 3
    state: CUDA.ArrayState CUDA.ARRAY_MANAGED
    ctx: CuContext
      handle: Ptr{Nothing} @0x0000000003b838d0
  indices: Tuple{Vector{Int64}}
    1: Array{Int64}((1,)) [2]
  offset1: Int64 0
  stride1: Int64 0

The problem is in the returned SubArray by (dot)view, which contains CPU indices. Peeking a little deeper, that seems to caused by Base.unsafe_view on a LogicalIndex:

julia> dump(Base.unsafe_view(a, CuArray([1])))
SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{CuArray{Int64, 1}}, false}
  parent: CuArray{Int64, 1}
    baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac00000)
    offset: Int64 0
    dims: Tuple{Int64}
      1: Int64 3
    state: CUDA.ArrayState CUDA.ARRAY_MANAGED
    ctx: CuContext
      handle: Ptr{Nothing} @0x0000000003b838d0
  indices: Tuple{CuArray{Int64, 1}}
    1: CuArray{Int64, 1}
      baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac01c00)
      offset: Int64 0
      dims: Tuple{Int64}
        1: Int64 1
      state: CUDA.ArrayState CUDA.ARRAY_MANAGED
      ctx: CuContext
        handle: Ptr{Nothing} @0x0000000003b838d0
  offset1: Int64 0
  stride1: Int64 0

julia> dump(Base.unsafe_view(a, Base.LogicalIndex(CuArray([true]))))
SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{Vector{Int64}}, false}
  parent: CuArray{Int64, 1}
    baseptr: CuPtr{Nothing} CuPtr{Nothing}(0x00007f83bac00000)
    offset: Int64 0
    dims: Tuple{Int64}
      1: Int64 3
    state: CUDA.ArrayState CUDA.ARRAY_MANAGED
    ctx: CuContext
      handle: Ptr{Nothing} @0x0000000003b838d0
  indices: Tuple{Vector{Int64}}
    1: Array{Int64}((1,)) [1]
  offset1: Int64 0
  stride1: Int64 0

That function just calls the SubArray constructor, which seems to remove the GPU-ness from these logical indices when ensure_indexable is called, which calls collect on the CuArray. So this is another case where it's ambiguous what collect means; the use case by LogicalIndex is presumable to get a dense array, while it also results in a conversion to a CPU array. We can bypass that using the following method:

Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) = (I[1], Base.ensure_indexable(Base.tail(I))...)

Things then crash when launching the kernel because creating a LogicalIndex{CuDeviceArray} performs an operation on the device array, which is not a valid thing to do. We can hack around this using:

@eval function adapt_structure(to, A::Base.LogicalIndex{T}) where {T}
      # LogicalIndex's constructor performs a costly (and sometimes impossible) `count`,
      # so recreate the struct using low-level Expr(:new).
      mask = adapt(to, A.mask)
      $(Expr(:new, :(Base.LogicalIndex{T, typeof(mask)}), :mask, :(A.sum)))
end

Turns out SubArray does the same (how have we not run into this?), so:

@eval function adapt_structure(to, A::SubArray{T,N,<:Any,<:Any,L}) where {T,N,L}
    parent = adapt(to, A.parent)
    indices = adapt(to, A.indices)
    $(Expr(:new, :(SubArray{T, N, typeof(parent), typeof(indices), L}), :parent, :indices, :(A.offset1), :(A.stride1)))
end

This gets us to a kernel, but it fails to compile:

Reason: unsupported dynamic function invocation (call to error(s::Vararg{Any, N}) where N in Base at error.jl:40)
Stacktrace:
 [1] error_if_canonical_getindex
   @ abstractarray.jl:1177
 [2] getindex
   @ abstractarray.jl:1163
 [3] reindex
   @ subarray.jl:244
 [4] setindex!
   @ subarray.jl:307
 [5] _setindex!
   @ abstractarray.jl:1291
 [6] setindex!
   @ abstractarray.jl:1261
 [7] broadcast_kernel
   @ ~/Julia/pkg/GPUArrays/src/host/broadcast.jl:59

And that's basically because:

julia> Base.error_if_canonical_getindex(IndexCartesian(),Base.LogicalIndex([true]),1)
ERROR: getindex not defined for Base.LogicalIndex{Int64, Vector{Bool}}

So our broadcast implementation can't handle LogicalIndex because you can't index into those (because, as per the docstring, that is a O(n) operation). That's expected, because it's a O(n) operation. Instead of returning a Bool-based LogicalIndex from ensure_indexable, we need a kernel here that collect the Int-based indices. findall does that, so:

Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) = (findall(I[1].mask), Base.ensure_indexable(Base.tail(I))...)

Et voila:

julia> a=CuArray([1,2,3])
3-element CuArray{Int64, 1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool, 1}:
 0
 1
 0

julia> a[i].=7
1-element view(::CuArray{Int64, 1}, [2]) with eltype Int64:
 7

julia> a
3-element CuArray{Int64, 1}:
 1
 7
 3

TL;DR: we need collect(::Base.LogicalIndices{CuArray}) to perform a findall and return integer indices. I wonder how breaking it would be to have collect methods not return an Array, and have users explicitly call Array if they want to move stuff to the CPU (cc @vchuravy).

bjarthur · 2021-02-03T15:52:27Z

@maleadt i think you might not understand that we're talking about logical indexing here with non-contiguous views.

my attempt to make a new materialize! method was motivated by the fact that dotview and broadcasted both work:

julia> using CUDA

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> i=CuArray([false,true,false])
3-element CuArray{Bool,1}:
 0
 1
 0

julia> Meta.@lower a[i].=7
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(a, i)
│   %2 = Base.broadcasted(Base.identity, 7)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

julia> dest = Base.dotview(a, i)
1-element view(::CuArray{Int64,1}, [2]) with eltype Int64:
 2

julia> bc = Base.broadcasted(Base.identity, 7)
Base.Broadcast.Broadcasted(identity, (7,))

julia> Base.materialize!(dest, bc)
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.

the good news is that i now have a materialize! which works for my specific use case:

import Base: materialize!

@inline function materialize!(::CUDA.CuArrayStyle{1}, dest, bc::Broadcast.Broadcasted{Style}) where {Style}
    function kernel(ys, is, xs)
        i = threadIdx().x + (blockIdx().x - 1) * blockDim().x

        @inbounds if i <= length(is)
            ys[is[i]] = xs[i]
        end
        return
    end

    function configurator(kernel)
        config = launch_configuration(kernel.fun)

        threads = Base.min(length(dest), config.threads)
        blocks = cld(length(dest), threads)

        return (threads=threads, blocks=blocks)
    end

    length(bc)==length(dest.indices[1]) || throw(DimensionMismatch())
    @cuda name="logical_materialize!" config=configurator kernel(dest.parent, CuArray(dest.indices[1]), length(bc)==1 ? bc : CuArray(bc.args[1]))

    return dest
end

julia> a[i].=7
1-element view(::CuArray{Int64,1}, [2]) with eltype Int64:
 7

julia> a
3-element CuArray{Int64,1}:
 1
 7
 3

julia> i=CuArray([true,false,true])
3-element CuArray{Bool,1}:
 1
 0
 1

julia> a[i].=[10,11]
2-element view(::CuArray{Int64,1}, [1, 3]) with eltype Int64:
 10
 11

julia> a
3-element CuArray{Int64,1}:
 10
  7
 11

i think this should be useful to others, but there is more work that needs to be done to make it work for more than one dimension. let me know if you concur and i'll work up a PR.

maleadt · 2021-02-03T16:01:48Z

@maleadt i think you might not understand that we're talking about logical indexing here with non-contiguous views.

Sure I do, and no: rewriting materialize! is the wrong approach I think. See my comment, by fixing how the LogicalIndices are treated we can reuse the existing broadcast kernel.

view not working is also pretty clear from your error:

julia> Base.materialize!(dest, bc)
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64,1,CuDeviceArray{Int64,1,1},Tuple{Array{Int64,1}},false}, which is not isbits:
  .indices is of type Tuple{Array{Int64,1}} which is not isbits.
    .1 is of type Array{Int64,1} which is not isbits.

i.e. there's a CPU array in there.

bjarthur · 2021-02-03T16:05:44Z

@maleadt thanks for the detailed investigation! my last post was made near simultaneously, so i had not read yours yet.

i'm concerned about the performance degradation by your suggestion to use a findall. my workaround so far has been to simply do a[findall(i)].=7. three such expressions like this account for 30% of my runtime. i will shortly time my custom materialize! and get back to you.

maleadt · 2021-02-03T16:19:02Z

True, it might make for a good optimization. It should depend on the other fixes I mentioned though, because now you're doing a memory copy to the CPU when doing view, and back to the GPU when launching the kernel, both of which are ridiculously expensive too. I also wonder if we really can't reuse the existing broadcast kernel, without the findall.

bjarthur · 2021-02-03T21:24:18Z

turns out my bespoke materialize! can not do a[i]+=7, so i'd have to develop it further for my use case. i tried your solution, and while it worked of course, i found that a.=ifelse(i,7,a) is faster. and also faster is a.+=i*7. so i think i'm good.

thanks again so much for diving into this. i hope to see your solution above merged in to the codebase at some point for the cases where my workarounds are not applicable.

bjarthur · 2021-02-05T20:34:52Z

i just noticed that @maleadt 's solution above does not work for 2-D arrays :(

julia> using CUDA

julia> import Base: ensure_indexable

julia> Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) =
                 (I[1], Base.ensure_indexable(Base.tail(I))...)

julia> Base.ensure_indexable(I::Tuple{Base.LogicalIndex{<:Any,<:CuArray}, Vararg{Any}}) =
                 (findall(I[1].mask), Base.ensure_indexable(Base.tail(I))...)

julia> @eval function adapt_structure(to, A::Base.LogicalIndex{T}) where {T}
            # LogicalIndex's constructor performs a costly (and sometimes impossible) `count`,
            # so recreate the struct using low-level Expr(:new).
            mask = adapt(to, A.mask)
            $(Expr(:new, :(Base.LogicalIndex{T, typeof(mask)}), :mask, :(A.sum)))
       end
adapt_structure (generic function with 1 method)

julia> @eval function adapt_structure(to, A::SubArray{T,N,<:Any,<:Any,L}) where {T,N,L}
            parent = adapt(to, A.parent)
            indices = adapt(to, A.indices)
            $(Expr(:new, :(SubArray{T, N, typeof(parent), typeof(indices), L}), :parent, :indices, :(A.offset1), :(A.stride1)))
       end
adapt_structure (generic function with 2 methods)

julia> a=CuArray([1 2; 3 4; 5 6])
3×2 CuArray{Int64,2}:
 1  2
 3  4
 5  6

julia> i=CuArray([true,false,true])
3-element CuArray{Bool,1}:
 1
 0
 1

julia> a[i,:]
ERROR: ReadOnlyMemoryError()
Stacktrace:
 [1] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/base.jl:82 [inlined]
 [2] macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/pointer.jl:7 [inlined]
 [3] pointerref at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/pointer.jl:7 [inlined]
 [4] unsafe_load at /groups/scicompsoft/home/arthurb/.julia/packages/LLVM/MZvb3/src/interop/pointer.jl:79 [inlined]
 [5] arrayref at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:80 [inlined]
 [6] getindex at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:99 [inlined]
 [7] iterate at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:145 [inlined]
 [8] iterate at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/device/array.jl:144 [inlined]
 [9] _foldl_impl at ./reduce.jl:56 [inlined]
 [10] foldl_impl at ./reduce.jl:48 [inlined]
 [11] mapfoldl_impl at ./reduce.jl:44 [inlined]
 [12] #mapfoldl#204 at ./reduce.jl:160 [inlined]
 [13] _mapreduce_dim at ./reducedim.jl:315 [inlined]
 [14] #mapreduce#620 at ./reducedim.jl:310 [inlined]
 [15] #count#624 at ./reducedim.jl:390 [inlined]
 [16] #count#623 at ./reducedim.jl:389 [inlined]
 [17] count at ./reducedim.jl:389 [inlined]
 [18] LogicalIndex at ./multidimensional.jl:636 [inlined]
 [19] LogicalIndex at ./multidimensional.jl:638 [inlined]
 [20] adapt_structure(::CUDA.Adaptor, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}) at /groups/scicompsoft/home/arthurb/.julia/packages/Adapt/8kQMV/src/wrappers.jl:12
 [21] adapt at /groups/scicompsoft/home/arthurb/.julia/packages/Adapt/8kQMV/src/Adapt.jl:40 [inlined]
 [22] cudaconvert at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:143 [inlined]
 [23] map at ./tuple.jl:159 [inlined]
 [24] map at ./tuple.jl:160 [inlined] (repeats 2 times)
 [25] #launch_heuristic#853 at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [26] launch_heuristic at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
 [27] gpu_call(::typeof(GPUArrays.getindex_kernel), ::CuArray{Int64,2}, ::CuArray{Int64,2}, ::Tuple{Int64,Int64}, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}, ::Base.Slice{Base.OneTo{Int64}}; target::CuArray{Int64,2}, total_threads::Nothing, threads::Nothing, blocks::Nothing, name::Nothing) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/device/execution.jl:61
 [28] gpu_call(::typeof(GPUArrays.getindex_kernel), ::CuArray{Int64,2}, ::CuArray{Int64,2}, ::Tuple{Int64,Int64}, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}, ::Base.Slice{Base.OneTo{Int64}}) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/device/execution.jl:46
 [29] _getindex(::CuArray{Int64,2}, ::Base.LogicalIndex{Int64,CuArray{Bool,1}}, ::Vararg{Any,N} where N) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/indexing.jl:135
 [30] getindex(::CuArray{Int64,2}, ::CuArray{Bool,1}, ::Function) at /groups/scicompsoft/home/arthurb/.julia/packages/GPUArrays/jhRU7/src/host/indexing.jl:125
 [31] top-level scope at REPL[9]:1

julia> Array(a)
3×2 Array{Int64,2}:
 1  2
 3  4
 5  6

julia> ans[i,:]
2×2 Array{Int64,2}:
 1  2
 5  6

julia> a=CuArray([1,2,3])
3-element CuArray{Int64,1}:
 1
 2
 3

julia> a[i]
2-element CuArray{Int64,1}:
 1
 3

bjarthur · 2021-02-08T16:15:36Z

oops, my last comment was not about assigment, but just N-D logical indexing, which already has an issue filed.

zhangcx93 · 2021-03-26T07:52:45Z

This seem not fixed for me, i'm using CUDA.jl 2.6.2 with Julia 1.6 stable. According to the changelog this pr should be merged already.

a = CuArray([1,2,3,4,5])
a[a .> 3] .= 10

this still wont compile, while cpu version works.

a = [1,2,3,4,5]
a[ a .> 3] .= 10

the error messege is this:

ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, which is not isbits:
.indices is of type Tuple{Vector{Int64}} which is not isbits.
.1 is of type Vector{Int64} which is not isbits.

Stacktrace:
[1] check_invocation(job::GPUCompiler.CompilerJob, entry::LLVM.Function)
@ GPUCompiler ~.julia\packages\GPUCompiler\XwWPj\src\validation.jl:68
[2] macro expansion
@ ~.julia\packages\GPUCompiler\XwWPj\src\driver.jl:287 [inlined]
[3] macro expansion
@ ~.julia\packages\TimerOutputs\4QAIk\src\TimerOutput.jl:206 [inlined]
[4] macro expansion
@ ~.julia\packages\GPUCompiler\XwWPj\src\driver.jl:286 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module, kernel::LLVM.Function; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~.julia\packages\GPUCompiler\XwWPj\src\utils.jl:62
[6] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:306
[7] check_cache
@ ~.julia\packages\GPUCompiler\XwWPj\src\cache.jl:44 [inlined]
[8] cached_compilation
@ ~.julia\packages\GPUArrays\WV76E\src\host\broadcast.jl:60 [inlined]
[9] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#12", Tuple{CUDA.CuKernelContext, SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}, Int64}}}, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~.julia\packages\GPUCompiler\XwWPj\src\cache.jl:0
[10] cufunction(f::GPUArrays.var"#broadcast_kernel#12", tt::Type{Tuple{CUDA.CuKernelContext, SubArray{Int64, 1, CuDeviceVector{Int64, 1}, Tuple{Vector{Int64}}, false}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Int64}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:294
[11] cufunction
@ ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:288 [inlined]
[12] macro expansion
@ ~.julia\packages\CUDA\qEV3Y\src\compiler\execution.jl:102 [inlined]
[13] #launch_heuristic#280
@ ~.julia\packages\CUDA\qEV3Y\src\gpuarrays.jl:17 [inlined]
[14] launch_heuristic
@ ~.julia\packages\CUDA\qEV3Y\src\gpuarrays.jl:17 [inlined]
[15] copyto!
@ ~.julia\packages\GPUArrays\WV76E\src\host\broadcast.jl:66 [inlined]
[16] copyto!
@ ~.julia\packages\GPUArrays\WV76E\src\host\broadcast.jl:76 [inlined]
[17] materialize!
@ .\broadcast.jl:894 [inlined]
[18] materialize!(dest::SubArray{Int64, 1, CuArray{Int64, 1}, Tuple{Vector{Int64}}, false}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(identity), Tuple{Int64}})
@ Base.Broadcast .\broadcast.jl:891
[19] top-level scope
@ REPL[5]:1

maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020

maleadt added cuda array Stuff about CuArray. enhancement New feature or request labels May 27, 2020

pabloferz mentioned this issue Feb 19, 2021

Specialize Base.to_index for AnyCuArray{Bool} #724

Merged

maleadt closed this as completed in #724 Feb 23, 2021

zhangcx93 mentioned this issue Mar 26, 2021

Conditional element-wise assignment still won't compile #789

Closed

maleadt mentioned this issue Apr 9, 2021

Use Expr(:new) to avoid costly constructors. JuliaGPU/Adapt.jl#42

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment using logical indexing #131

Assignment using logical indexing #131

avik-pal commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented May 22, 2019

dpsanders commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented May 22, 2019

dpsanders commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented Jun 8, 2019

bjarthur commented Jan 30, 2021

bjarthur commented Feb 1, 2021

avik-pal commented Feb 2, 2021

bjarthur commented Feb 2, 2021

dpsanders commented Feb 2, 2021

maleadt commented Feb 3, 2021

bjarthur commented Feb 3, 2021 •

edited

maleadt commented Feb 3, 2021

bjarthur commented Feb 3, 2021

maleadt commented Feb 3, 2021 •

edited

bjarthur commented Feb 3, 2021

maleadt commented Feb 3, 2021

bjarthur commented Feb 3, 2021

bjarthur commented Feb 5, 2021

bjarthur commented Feb 8, 2021

zhangcx93 commented Mar 26, 2021 •

edited

Assignment using logical indexing #131

Assignment using logical indexing #131

Comments

avik-pal commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented May 22, 2019

dpsanders commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented May 22, 2019

dpsanders commented May 22, 2019

avik-pal commented May 22, 2019

dpsanders commented Jun 8, 2019

bjarthur commented Jan 30, 2021

bjarthur commented Feb 1, 2021

avik-pal commented Feb 2, 2021

bjarthur commented Feb 2, 2021

dpsanders commented Feb 2, 2021

maleadt commented Feb 3, 2021

bjarthur commented Feb 3, 2021 • edited

maleadt commented Feb 3, 2021

bjarthur commented Feb 3, 2021

maleadt commented Feb 3, 2021 • edited

bjarthur commented Feb 3, 2021

maleadt commented Feb 3, 2021

bjarthur commented Feb 3, 2021

bjarthur commented Feb 5, 2021

bjarthur commented Feb 8, 2021

zhangcx93 commented Mar 26, 2021 • edited

bjarthur commented Feb 3, 2021 •

edited

maleadt commented Feb 3, 2021 •

edited

zhangcx93 commented Mar 26, 2021 •

edited