Register BFloat16 #1092

DhairyaLGandhi · 2021-08-09T10:27:49Z

This currently isn't sufficient to run BFloat16 kernels yet, but its a start to get CUDNN's BFloat16 type recognised. Currently this is mapped from BFloat16s.jl which is already a dep for CUDA.jl, but would hopefully be replaced by the language's version when its added.

maleadt · 2021-08-09T10:38:25Z

This will probably require some logic like we have for gemmEx, to determine an appropriate compute type for a given set of inputs:

CUDA.jl/lib/cublas/wrappers.jl

Lines 807 to 877 in 92622ed

    
           function gemmExComputeType(TA, TB, TC, m, k, n) 
        
               if TA !== TB 
        
                   return nothing 
        
               end 
        
               sig = (TA, TC) 
        
               # gemmEx requires sm_50 or higher 
        
               cap = capability(device()) 
        
               if cap < v"5" 
        
                   return nothing 
        
               end 
        
               # source: CUBLAS Features and Technical Specifications 
        
               if Float16 in sig && cap < v"5.3" 
        
                   return nothing 
        
               end 
        
               math_mode = CUDA.math_mode() 
        
               reduced_precision = CUDA.math_precision() 
        
               if sig === (Float16, Float16) 
        
                   # NOTE: Float16=Float16*Float16 can also happen in 32-bit compute 
        
                   return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_16F_PEDANTIC : CUBLAS_COMPUTE_16F 
        
               end 
        
               if m%4 == 0 && n%4 == 0 && k%4 == 0 && sig === (Int8, Int32) 
        
                   CUDA.version() >= v"11.2" && return nothing # NVIDIA bug #3221266 
        
                   # Int32=Int8*Int8 requires m,n,k to be multiples of 4 
        
                   # https://forums.developer.nvidia.com/t/cublasgemmex-cant-use-cuda-r-8i-compute-type-on-gtx1080/58100/2 
        
                   return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_32I_PEDANTIC : CUBLAS_COMPUTE_32I 
        
               end 
        
               if math_mode == CUDA.FAST_MATH 
        
                   if sig === (Float32, Float32) || 
        
                      sig === (Complex{Float32}, Complex{Float32}) 
        
                       if reduced_precision === :Float16 
        
                           return CUBLAS_COMPUTE_32F_FAST_16F 
        
                       elseif reduced_precision === :BFloat16 
        
                           return CUBLAS_COMPUTE_32F_FAST_16BF 
        
                       elseif reduced_precision === :TensorFloat32 
        
                           return CUBLAS_COMPUTE_32F_FAST_TF32 
        
                       else 
        
                           throw(ArgumentError("Unknown reduced precision type $reduced_precision")) 
        
                       end 
        
                   end 
        
               end 
        
               if sig === (Float16,  Float16) || 
        
                  sig === (Int8,     Float32) || 
        
                  sig === (Float16,  Float32) || 
        
                  sig === (Float32,  Float32) || 
        
                  sig === (Complex{Int8},    Complex{Float32}) || 
        
                  sig === (Complex{Float32}, Complex{Float32}) 
        
                   return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_32F_PEDANTIC : CUBLAS_COMPUTE_32F 
        
               end 
        
               if sig === (Float64, Float64) || 
        
                  sig === (Complex{Float64}, Complex{Float64}) 
        
                   return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_64F_PEDANTIC : CUBLAS_COMPUTE_64F 
        
               end 
        
               # BFloat16 support was added in CUDA 11 
        
               if version() >= v"11" 
        
                   if sig === (BFloat16, BFloat16) || 
        
                      sig === (BFloat16, Float32) 
        
                       return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_32F_PEDANTIC : CUBLAS_COMPUTE_32F 
        
                   end 
        
               end 
        
               return nothing 
        
           end

For non-mutating APIs, we may want to extend this (both for matrix multiplicaton and for the DNN APIs you want to wrap) so that it also figures out an appropriate output type (e.g. depending on the CUDA math mode). But doing all this ad hoc for every mixed-mode API seems bad though, so we probably need a more systematic solution.

DhairyaLGandhi · 2021-08-09T10:43:41Z

Right. I was thinking of following what nvidia suggests for accumulating etc since those are likely the best tested versions of these kernels. I'm a bit unsure of how to choose the math mode still. I'm assuming there's a complementary math mode for bfloats as there is for f32 and 64?

maleadt · 2021-08-09T10:48:15Z

I'm a bit unsure of how to choose the math mode still. I'm assuming there's a complementary math mode for bfloats as there is for f32 and 64?

I'm not sure what you mean. We have a CUDA.jl math mode:

CUDA.jl/src/state.jl

Lines 18 to 30 in 92622ed

    
           @enum MathMode begin 
        
               # use prescribed precision and standardized arithmetic for all calculations. 
        
               # this may serialize operations, and reduce performance. 
        
               PEDANTIC_MATH 
        
               # use at least the required precision, and allow reordering operations for performance. 
        
               DEFAULT_MATH 
        
               # additionally allow downcasting operations for better use of hardware resources. 
        
               # whenever possible the `precision` flag passed to `math_mode!` will be used 
        
               # to constrain those downcasts. 
        
               FAST_MATH 
        
           end

When performing API calls, we either (for old APIs) convert that math mode to the library-specific ones, or (for new APIs, which 'express' the math mode in terms of which compute type you want the API to use) use it to determine which compute type to use.

DhairyaLGandhi · 2021-08-09T11:08:51Z

I meant the likes of CUBLAS_COMPUTE_32F_PEDANTIC, sorry should have been clearer. I'm assuming this is what counts as library specific ones. I'll have to scour the codebase to see where all we need to dispatch to BF16 that isn't already handled.

maleadt · 2021-08-09T11:26:51Z

CUBLAS_COMPUTE_32F_PEDANTIC

That's the 'new-style' math mode, specified per API via the compute type. For older CUBLAS APIs we need to set the per-handle math mode:

CUDA.jl/lib/cublas/CUBLAS.jl

Lines 31 to 65 in 92622ed

    
           function math_mode!(handle, mode) 
        
               flags = 0 
        
               # https://github.com/facebookresearch/faiss/issues/1385 
        
               if version() > v"11" 
        
                   flags = CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION 
        
               end 
        
               flags |= if mode == CUDA.PEDANTIC_MATH 
        
                   # prevent use of tensor cores 
        
                   if version() < v"11" 
        
                       CUBLAS_DEFAULT_MATH 
        
                   else 
        
                       CUBLAS_PEDANTIC_MATH 
        
                   end 
        
               elseif mode == CUDA.DEFAULT_MATH 
        
                   # use tensor cores, but don't reduce precision 
        
                   if version() < v"11" 
        
                       CUBLAS_TENSOR_OP_MATH 
        
                   else 
        
                       CUBLAS_DEFAULT_MATH 
        
                   end 
        
               elseif mode == CUDA.FAST_MATH 
        
                   # we'll additionally select a compute-mode with reduced precision whenever possible 
        
                   if version() < v"11" 
        
                       CUBLAS_TENSOR_OP_MATH 
        
                   else 
        
                       CUBLAS_TF32_TENSOR_OP_MATH 
        
                   end 
        
               end 
        
               cublasSetMathMode(handle, flags) 
        
               return 
        
           end

DhairyaLGandhi added 2 commits August 9, 2021 15:53

add basic utils for bf16

ee532cd

imports fix

8272c71

maleadt added the cuda libraries Stuff about CUDA library wrappers. label Aug 9, 2021

precompile fixes

2e21db3

maleadt force-pushed the master branch from 8f1a067 to 6758fca Compare August 13, 2021 18:44

maleadt force-pushed the master branch from b4a4542 to 0e9ad9d Compare October 13, 2021 15:32

maleadt force-pushed the master branch from db0ecb0 to 4d6dbd1 Compare May 13, 2022 07:37

maleadt force-pushed the master branch from 476979e to d53a63e Compare March 16, 2023 12:34

maleadt force-pushed the master branch from c97bc77 to d57e020 Compare September 8, 2023 20:12

maleadt force-pushed the master branch from 1cb1f53 to 1a1d127 Compare September 18, 2023 16:28

maleadt force-pushed the master branch from aef3298 to 4b017c6 Compare January 18, 2024 12:09

maleadt force-pushed the master branch 14 times, most recently from 73ed8f9 to 5d585c4 Compare December 20, 2024 08:08

maleadt force-pushed the master branch from 5d585c4 to c850163 Compare December 20, 2024 08:18

AntonOresten mentioned this pull request Nov 30, 2025

[CUDNN] Support BFloat16 #2987

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Register BFloat16 #1092

Register BFloat16 #1092

Uh oh!

DhairyaLGandhi commented Aug 9, 2021

Uh oh!

maleadt commented Aug 9, 2021

Uh oh!

DhairyaLGandhi commented Aug 9, 2021

Uh oh!

maleadt commented Aug 9, 2021

Uh oh!

DhairyaLGandhi commented Aug 9, 2021

Uh oh!

maleadt commented Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Register BFloat16 #1092

Are you sure you want to change the base?

Register BFloat16 #1092

Uh oh!

Conversation

DhairyaLGandhi commented Aug 9, 2021

Uh oh!

maleadt commented Aug 9, 2021

Uh oh!

DhairyaLGandhi commented Aug 9, 2021

Uh oh!

maleadt commented Aug 9, 2021

Uh oh!

DhairyaLGandhi commented Aug 9, 2021

Uh oh!

maleadt commented Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants