Skip to content

Conversation

@DhairyaLGandhi
Copy link
Member

This currently isn't sufficient to run BFloat16 kernels yet, but its a start to get CUDNN's BFloat16 type recognised. Currently this is mapped from BFloat16s.jl which is already a dep for CUDA.jl, but would hopefully be replaced by the language's version when its added.

@maleadt
Copy link
Member

maleadt commented Aug 9, 2021

This will probably require some logic like we have for gemmEx, to determine an appropriate compute type for a given set of inputs:

function gemmExComputeType(TA, TB, TC, m, k, n)
if TA !== TB
return nothing
end
sig = (TA, TC)
# gemmEx requires sm_50 or higher
cap = capability(device())
if cap < v"5"
return nothing
end
# source: CUBLAS Features and Technical Specifications
if Float16 in sig && cap < v"5.3"
return nothing
end
math_mode = CUDA.math_mode()
reduced_precision = CUDA.math_precision()
if sig === (Float16, Float16)
# NOTE: Float16=Float16*Float16 can also happen in 32-bit compute
return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_16F_PEDANTIC : CUBLAS_COMPUTE_16F
end
if m%4 == 0 && n%4 == 0 && k%4 == 0 && sig === (Int8, Int32)
CUDA.version() >= v"11.2" && return nothing # NVIDIA bug #3221266
# Int32=Int8*Int8 requires m,n,k to be multiples of 4
# https://forums.developer.nvidia.com/t/cublasgemmex-cant-use-cuda-r-8i-compute-type-on-gtx1080/58100/2
return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_32I_PEDANTIC : CUBLAS_COMPUTE_32I
end
if math_mode == CUDA.FAST_MATH
if sig === (Float32, Float32) ||
sig === (Complex{Float32}, Complex{Float32})
if reduced_precision === :Float16
return CUBLAS_COMPUTE_32F_FAST_16F
elseif reduced_precision === :BFloat16
return CUBLAS_COMPUTE_32F_FAST_16BF
elseif reduced_precision === :TensorFloat32
return CUBLAS_COMPUTE_32F_FAST_TF32
else
throw(ArgumentError("Unknown reduced precision type $reduced_precision"))
end
end
end
if sig === (Float16, Float16) ||
sig === (Int8, Float32) ||
sig === (Float16, Float32) ||
sig === (Float32, Float32) ||
sig === (Complex{Int8}, Complex{Float32}) ||
sig === (Complex{Float32}, Complex{Float32})
return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_32F_PEDANTIC : CUBLAS_COMPUTE_32F
end
if sig === (Float64, Float64) ||
sig === (Complex{Float64}, Complex{Float64})
return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_64F_PEDANTIC : CUBLAS_COMPUTE_64F
end
# BFloat16 support was added in CUDA 11
if version() >= v"11"
if sig === (BFloat16, BFloat16) ||
sig === (BFloat16, Float32)
return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_32F_PEDANTIC : CUBLAS_COMPUTE_32F
end
end
return nothing
end

For non-mutating APIs, we may want to extend this (both for matrix multiplicaton and for the DNN APIs you want to wrap) so that it also figures out an appropriate output type (e.g. depending on the CUDA math mode). But doing all this ad hoc for every mixed-mode API seems bad though, so we probably need a more systematic solution.

@maleadt maleadt added the cuda libraries Stuff about CUDA library wrappers. label Aug 9, 2021
@DhairyaLGandhi
Copy link
Member Author

Right. I was thinking of following what nvidia suggests for accumulating etc since those are likely the best tested versions of these kernels. I'm a bit unsure of how to choose the math mode still. I'm assuming there's a complementary math mode for bfloats as there is for f32 and 64?

@maleadt
Copy link
Member

maleadt commented Aug 9, 2021

I'm a bit unsure of how to choose the math mode still. I'm assuming there's a complementary math mode for bfloats as there is for f32 and 64?

I'm not sure what you mean. We have a CUDA.jl math mode:

CUDA.jl/src/state.jl

Lines 18 to 30 in 92622ed

@enum MathMode begin
# use prescribed precision and standardized arithmetic for all calculations.
# this may serialize operations, and reduce performance.
PEDANTIC_MATH
# use at least the required precision, and allow reordering operations for performance.
DEFAULT_MATH
# additionally allow downcasting operations for better use of hardware resources.
# whenever possible the `precision` flag passed to `math_mode!` will be used
# to constrain those downcasts.
FAST_MATH
end

When performing API calls, we either (for old APIs) convert that math mode to the library-specific ones, or (for new APIs, which 'express' the math mode in terms of which compute type you want the API to use) use it to determine which compute type to use.

@DhairyaLGandhi
Copy link
Member Author

I meant the likes of CUBLAS_COMPUTE_32F_PEDANTIC, sorry should have been clearer. I'm assuming this is what counts as library specific ones. I'll have to scour the codebase to see where all we need to dispatch to BF16 that isn't already handled.

@maleadt
Copy link
Member

maleadt commented Aug 9, 2021

CUBLAS_COMPUTE_32F_PEDANTIC

That's the 'new-style' math mode, specified per API via the compute type. For older CUBLAS APIs we need to set the per-handle math mode:

function math_mode!(handle, mode)
flags = 0
# https://github.com/facebookresearch/faiss/issues/1385
if version() > v"11"
flags = CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
end
flags |= if mode == CUDA.PEDANTIC_MATH
# prevent use of tensor cores
if version() < v"11"
CUBLAS_DEFAULT_MATH
else
CUBLAS_PEDANTIC_MATH
end
elseif mode == CUDA.DEFAULT_MATH
# use tensor cores, but don't reduce precision
if version() < v"11"
CUBLAS_TENSOR_OP_MATH
else
CUBLAS_DEFAULT_MATH
end
elseif mode == CUDA.FAST_MATH
# we'll additionally select a compute-mode with reduced precision whenever possible
if version() < v"11"
CUBLAS_TENSOR_OP_MATH
else
CUBLAS_TF32_TENSOR_OP_MATH
end
end
cublasSetMathMode(handle, flags)
return
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda libraries Stuff about CUDA library wrappers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants