Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

axpby! support for BFloat16 #1399

Closed
bjarthur opened this issue Feb 23, 2022 · 3 comments · Fixed by JuliaGPU/GPUArrays.jl#394
Closed

axpby! support for BFloat16 #1399

bjarthur opened this issue Feb 23, 2022 · 3 comments · Fixed by JuliaGPU/GPUArrays.jl#394
Labels
enhancement New feature or request

Comments

@bjarthur
Copy link
Contributor

axpby! with BFloat16 falls back to an iterative method. i was hoping CuTensor might support it, but no joy, and it seems slower for Float32 anyway.

so i wrote a dumb array-interface method for axpby! that supports BFloat16, which is competitively fast. would this code (see method definition near bottom of REPL transcript below), or a better version of it, which perhaps calls a new scal! and axpy! to mimic the CUBLAS dataflow, fit in anywhere? happy to submit a PR with some guidance.

[julia> using CUDA, LinearAlgebra, BFloat16s, BenchmarkTools, CUDA.CUTENSOR

julia> X=rand(100000);

julia> Y=rand(100000);

julia> cuF32X=CuArray{Float32}(X);

julia> cuF32Y=CuArray{Float32}(Y);

julia> cuB16X=CuArray{BFloat16}(X);

julia> cuB16Y=CuArray{BFloat16}(Y);

julia> cutB16X=CuTensor(cuB16X,['a']);

julia> cutB16Y=CuTensor(cuB16Y,['a']);

julia> cutF32X=CuTensor(cuF32X,['a']);

julia> cutF32Y=CuTensor(cuF32Y,['a']);

julia> @benchmark CUDA.@sync axpby!(0.5,$cuF32X,1.5,$cuF32Y)   ### FAST Float32
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.631 μs … 123.606 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.046 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.377 μs ±   2.983 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▆▇██▇▆▆▅▃▂▂▂▂▂▁▁ ▁▁▂▂▂▁ ▁▁▁                                 ▃
  ████████████████████████████▇▇▆▅▅▄▆▃▅▄▅▄▅▅▅▁▄▅▅▄▁▅▅▃▅▄▅▅▃▄▃▅ █
  11.6 μs       Histogram: log(frequency) by time      17.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark CUDA.@sync axpby!(0.5,$cuB16X,1.5,$cuB16Y)   ### VERY SLOW BFloat16
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000014ce20540010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:56
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.740 s …   2.754 s  ┊ GC (min … max): 0.23% … 0.23%
 Time  (median):     2.747 s              ┊ GC (median):    0.23%
 Time  (mean ± σ):   2.747 s ± 10.005 ms  ┊ GC (mean ± σ):  0.23% ± 0.00%

  █                                                       █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.74 s         Histogram: frequency by time        2.75 s <

 Memory estimate: 137.33 MiB, allocs estimate: 900000.

julia> @benchmark CUDA.@sync axpby!(0.5,$cutF32X,1.5,$cutF32Y)   ### KINDA SLOW Float32 w/ CuTensor
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.807 μs … 457.077 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.021 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.800 μs ±   5.090 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▇█▅▃                                                 
  ▂▄▆██▇▆▆██████▇▆▄▄▄▃▄▅▆█▆▆▆▄▅▄▄▃▂▂▃▂▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  17.8 μs         Histogram: frequency by time         28.1 μs <

 Memory estimate: 2.28 KiB, allocs estimate: 40.

julia> @benchmark CUDA.@sync axpby!(0.5,$cutB16X,1.5,$cutB16Y)   ### BROKEN BFloat16 w/ CuTensor
ERROR: CUTENSORError: an invalid value was used as an argument (code 7, CUTENSOR_STATUS_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUTENSOR.cutensorStatus_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/cutensor/error.jl:58
  [2] macro expansion
    @ ~/.julia/packages/CUDA/KnJGx/lib/cutensor/error.jl:71 [inlined]
  [3] cutensorInitTensorDescriptor(handle::Base.RefValue{CUDA.CUTENSOR.cutensorHandle_t}, desc::Base.RefValue{CUDA.CUTENSOR.cutensorTensorDescriptor_t}, numModes::Int64, extent::Vector{Int64}, stride::Vector{Int64}, dataType::Type, unaryOp::CUDA.CUTENSOR.cutensorOperator_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/utils/call.jl:26
  [4] CUDA.CUTENSOR.CuTensorDescriptor(a::CuTensor{BFloat16, 1}; size::Tuple{Int64}, strides::Tuple{Int64}, eltype::Type, op::CUDA.CUTENSOR.cutensorOperator_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/cutensor/wrappers.jl:45
  [5] elementwiseBinary!(alpha::Number, A::CuTensor, opA::CUDA.CUTENSOR.cutensorOperator_t, gamma::Number, C::CuTensor, opC::CUDA.CUTENSOR.cutensorOperator_t, D::CuTensor, opAC::CUDA.CUTENSOR.cutensorOperator_t)
    @ CUDA.CUTENSOR ~/.julia/packages/CUDA/KnJGx/lib/cutensor/wrappers.jl:190
  [6] axpby!
    @ ~/.julia/packages/CUDA/KnJGx/lib/cutensor/interfaces.jl:37 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/CUDA/KnJGx/src/utilities.jl:25 [inlined]
  [8] var"##core#298"(cutB16X#296::CuTensor{BFloat16, 1}, cutB16Y#297::CuTensor{BFloat16, 1})
    @ Main ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:483
  [9] var"##sample#299"(::Vector{CuTensor{BFloat16, 1}}, __params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:489
 [10] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctrial, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:99
 [11] #invokelatest#2
    @ ./essentials.jl:718 [inlined]
 [12] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:34 [inlined]
 [13] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:117
 [14] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:169 [inlined]
 [15] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:169
 [16] top-level scope
    @ ~/.julia/packages/BenchmarkTools/3UOOD/src/execution.jl:393
 [17] top-level scope
    @ ~/.julia/packages/CUDA/KnJGx/src/initialization.jl:52

### proposed method definition
julia> LinearAlgebra.axpby!(alpha::Number, x::StridedCuArray{BFloat16},
                            beta::Number,  y::StridedCuArray{BFloat16}) = y .= x.*alpha .+ y.*beta

julia> @benchmark CUDA.@sync axpby!(0.5,$cuB16X,1.5,$cuB16Y)   ### FAST BFloat16 (w/o CuTensor) !!!
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.507 μs … 137.971 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.397 μs ±   2.797 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇█▇▄▁                                                        
  ▅█████▇▆▄▄▃▃▂▂▃▄▅▅▅▅▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  13.5 μs         Histogram: frequency by time         21.9 μs <

 Memory estimate: 3.14 KiB, allocs estimate: 42.
@bjarthur bjarthur added the enhancement New feature or request label Feb 23, 2022
@maleadt
Copy link
Member

maleadt commented Feb 24, 2022

proposed method definition

julia> LinearAlgebra.axpby!(alpha::Number, x::StridedCuArray{BFloat16},
beta::Number, y::StridedCuArray{BFloat16}) = y .= x.*alpha .+ y.*beta

This would be a good fallback definition, and a great addition to GPUArrays.jl in https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/host/linalg.jl.

julia> @benchmark CUDA.@sync axpby!(0.5,$cuB16X,1.5,$cuB16Y) ### VERY SLOW BFloat16
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000014ce20540010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations do not execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/umZob/src/host/indexing.jl:56

FWIW, benchmarking a method that does scalar iteration is not really interesting, as it's known to be slow and just a fallback for correctness. Hence the warning.

@bjarthur
Copy link
Contributor Author

bjarthur commented Feb 24, 2022

great! i'll work on a PR to GPUArrays. by "fallback" you mean that the proper way to do this would be to add a kernel to CUDA.jl?

@maleadt
Copy link
Member

maleadt commented Feb 24, 2022

by "fallback" you mean that the proper way to do this would be to add a kernel to CUDA.jl?

If CUBLAS or so provides an accelerated version for some types, we prefer that one (because it's likely to be faster). In that sense, a generic definition like yours is a fallback for when we can't use vendor libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants