axpby! support for BFloat16 #1399

bjarthur · 2022-02-23T20:40:17Z

axpby! with BFloat16 falls back to an iterative method. i was hoping CuTensor might support it, but no joy, and it seems slower for Float32 anyway.

so i wrote a dumb array-interface method for axpby! that supports BFloat16, which is competitively fast. would this code (see method definition near bottom of REPL transcript below), or a better version of it, which perhaps calls a new scal! and axpy! to mimic the CUBLAS dataflow, fit in anywhere? happy to submit a PR with some guidance.

[julia> using CUDA, LinearAlgebra, BFloat16s, BenchmarkTools, CUDA.CUTENSOR

julia> X=rand(100000);

julia> Y=rand(100000);

julia> cuF32X=CuArray{Float32}(X);

julia> cuF32Y=CuArray{Float32}(Y);

julia> cuB16X=CuArray{BFloat16}(X);

julia> cuB16Y=CuArray{BFloat16}(Y);

julia> cutB16X=CuTensor(cuB16X,['a']);

julia> cutB16Y=CuTensor(cuB16Y,['a']);

julia> cutF32X=CuTensor(cuF32X,['a']);

julia> cutF32Y=CuTensor(cuF32Y,['a']);

julia> @benchmark CUDA.@sync axpby!(0.5,$cuF32X,1.5,$cuF32Y)   ### FAST Float32
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.631 μs … 123.606 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.046 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.377 μs ±   2.983 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▆▇██▇▆▆▅▃▂▂▂▂▂▁▁ ▁▁▂▂▂▁ ▁▁▁                                 ▃
  ████████████████████████████▇▇▆▅▅▄▆▃▅▄▅▄▅▅▅▁▄▅▅▄▁▅▅▃▅▄▅▅▃▄▃▅ █
  11.6 μs       Histogram: log(frequency) by time      17.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark CUDA.@sync axpby!(0.5,$cuB16X,1.5,$cuB16Y)   ### VERY SLOW BFloat16
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000014ce20540010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:56
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.740 s …   2.754 s  ┊ GC (min … max): 0.23% … 0.23%
 Time  (median):     2.747 s              ┊ GC (median):    0.23%
 Time  (mean ± σ):   2.747 s ± 10.005 ms  ┊ GC (mean ± σ):  0.23% ± 0.00%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.74 s         Histogram: frequency by time        2.75 s <

 Memory estimate: 137.33 MiB, allocs estimate: 900000.

julia> @benchmark CUDA.@sync axpby!(0.5,$cutF32X,1.5,$cutF32Y)   ### KINDA SLOW Float32 w/ CuTensor
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.807 μs … 457.077 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.021 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.800 μs ±   5.090 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▇█▅▃                                                 
  ▂▄▆██▇▆▆██████▇▆▄▄▄▃▄▅▆█▆▆▆▄▅▄▄▃▂▂▃▂▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  17.8 μs         Histogram: frequency by time         28.1 μs <

 Memory estimate: 2.28 KiB, allocs estimate: 40.

julia> @benchmark CUDA.@sync axpby!(0.5,$cutB16X,1.5,$cutB16Y)   ### BROKEN BFloat16 w/ CuTensor
ERROR: CUTENSORError: an invalid value was used as an argument (code 7, CUTENSOR_STATUS_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUTENSOR.cutensorStatus_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/cutensor/error.jl:58
  [2] macro expansion
    @ ~/.julia/packages/CUDA/KnJGx/lib/cutensor/error.jl:71 [inlined]
  [3] cutensorInitTensorDescriptor(handle::Base.RefValue{CUDA.CUTENSOR.cutensorHandle_t}, desc::Base.RefValue{CUDA.CUTENSOR.cutensorTensorDescriptor_t}, numModes::Int64, extent::Vector{Int64}, stride::Vector{Int64}, dataType::Type, unaryOp::CUDA.CUTENSOR.cutensorOperator_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/utils/call.jl:26
  [4] CUDA.CUTENSOR.CuTensorDescriptor(a::CuTensor{BFloat16, 1}; size::Tuple{Int64}, strides::Tuple{Int64}, eltype::Type, op::CUDA.CUTENSOR.cutensorOperator_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/cutensor/wrappers.jl:45
  [5] elementwiseBinary!(alpha::Number, A::CuTensor, opA::CUDA.CUTENSOR.cutensorOperator_t, gamma::Number, C::CuTensor, opC::CUDA.CUTENSOR.cutensorOperator_t, D::CuTensor, opAC::CUDA.CUTENSOR.cutensorOperator_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/cutensor/wrappers.jl:190
  [6] axpby!
    @ ~/.julia/packages/CUDA/KnJGx/lib/cutensor/interfaces.jl:37 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/CUDA/KnJGx/src/utilities.jl:25 [inlined]
  [8] var"##core#298"(cutB16X#296::CuTensor{BFloat16, 1}, cutB16Y#297::CuTensor{BFloat16, 1})
    @ Main ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:483
  [9] var"##sample#299"(::Vector{CuTensor{BFloat16, 1}}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:489
 [10] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:99
 [11] #invokelatest#2
    @ ./essentials.jl:718 [inlined]
 [12] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:34 [inlined]
 [13] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:117
 [14] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:169 [inlined]
 [15] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:169
 [16] top-level scope
    @ ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:393
 [17] top-level scope
    @ ~/.julia/packages/CUDA/KnJGx/src/initialization.jl:52

### proposed method definition
julia> LinearAlgebra.axpby!(alpha::Number, x::StridedCuArray{BFloat16},
                            beta::Number,  y::StridedCuArray{BFloat16}) = y .= x.*alpha .+ y.*beta

julia> @benchmark CUDA.@sync axpby!(0.5,$cuB16X,1.5,$cuB16Y)   ### FAST BFloat16 (w/o CuTensor) !!!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.507 μs … 137.971 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.397 μs ±   2.797 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇█▇▄▁                                                        
  ▅█████▇▆▄▄▃▃▂▂▃▄▅▅▅▅▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  13.5 μs         Histogram: frequency by time         21.9 μs <

 Memory estimate: 3.14 KiB, allocs estimate: 42.

The text was updated successfully, but these errors were encountered:

maleadt · 2022-02-24T07:50:01Z

proposed method definition

julia> LinearAlgebra.axpby!(alpha::Number, x::StridedCuArray{BFloat16},
beta::Number, y::StridedCuArray{BFloat16}) = y .= x.*alpha .+ y.*beta

This would be a good fallback definition, and a great addition to GPUArrays.jl in https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/host/linalg.jl.

julia> @benchmark CUDA.@sync axpby!(0.5,$cuB16X,1.5,$cuB16Y) ### VERY SLOW BFloat16
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000014ce20540010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations do not execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:56

FWIW, benchmarking a method that does scalar iteration is not really interesting, as it's known to be slow and just a fallback for correctness. Hence the warning.

bjarthur · 2022-02-24T13:31:00Z

great! i'll work on a PR to GPUArrays. by "fallback" you mean that the proper way to do this would be to add a kernel to CUDA.jl?

maleadt · 2022-02-24T13:40:05Z

by "fallback" you mean that the proper way to do this would be to add a kernel to CUDA.jl?

If CUBLAS or so provides an accelerated version for some types, we prefer that one (because it's likely to be faster). In that sense, a generic definition like yours is a fallback for when we can't use vendor libraries.

bjarthur added the enhancement New feature or request label Feb 23, 2022

bjarthur mentioned this issue Feb 24, 2022

add methods for axpby, axpy, scal JuliaGPU/GPUArrays.jl#394

Merged

maleadt closed this as completed in JuliaGPU/GPUArrays.jl#394 Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

axpby! support for BFloat16 #1399

axpby! support for BFloat16 #1399

bjarthur commented Feb 23, 2022

maleadt commented Feb 24, 2022

proposed method definition

bjarthur commented Feb 24, 2022 •

edited

maleadt commented Feb 24, 2022

axpby! support for BFloat16 #1399

axpby! support for BFloat16 #1399

Comments

bjarthur commented Feb 23, 2022

maleadt commented Feb 24, 2022

proposed method definition

bjarthur commented Feb 24, 2022 • edited

maleadt commented Feb 24, 2022

bjarthur commented Feb 24, 2022 •

edited