Is your feature request related to a problem? Please describe.
TE fused attention currently treats bias_type != NO_BIAS as implying that backward should request dBias from cuDNN FE. However, cuDNN FE models bias input and dBias output independently (set_bias vs set_dbias), and frameworks can have bias tensors that affect attention but do not require gradients. This matters for training cases where an additive attention bias is used as a fixed mask or frozen score modifier. We still need backward for Q/K/V, but do not need dBias. Requesting dBias unnecessarily can disable otherwise-supported cuDNN kernels or trigger plan-build failures for kernels that support bias input but not dBias.
Describe the solution you'd like
TE fused attention should distinguish “bias input is present” from “bias gradient is requested”.
Concretely, plumb a dbias_requested / bias_requires_grad flag through backend selection and backward execution:
• Backend selection should allow POST_SCALE_BIAS when dbias_requested == false for cuDNN kernels that support bias input
but not dBias.
• Backward graph construction should call sdpa_backward_options.set_bias(bias) whenever bias input is present.
• Backward graph construction should call sdpa_backward_options.set_dbias(dBias) only when the framework actually requests
a bias gradient.
• PyTorch can derive this from core_attention_bias.requires_grad.
• JAX can derive this from whether bias is included in value_and_grad(..., argnums=...).
So, in the cuDNN FE graph construction:
if (bias_type != NVTE_NO_BIAS) {
sdpa_backward_options.set_bias(bias);
if (dbias_requested) {
sdpa_backward_options.set_dbias(dBias);
}
}
For PyTorch usage, a frozen bias should still use fused attention without requesting dBias:
bias = make_additive_attention_bias(...)
bias.requires_grad_(False)
out = transformer_engine.pytorch.DotProductAttention(...)(q, k, v, core_attention_bias=bias)
Backward should compute dQ/dK/dV, but not request dBias from cuDNN FE.
Describe alternatives you've considered
One workaround is to disable fused attention whenever bias is present for kernels that do not support dBias. This is safe
but overly conservative, because cuDNN FE may support bias input even when it does not support dBias.
Another workaround is for users to encode fixed masks through attn_mask_type instead of additive bias. That only works for
built-in mask patterns and does not cover arbitrary dense score modifiers, frozen relative-position bias, or other
application-specific additive biases.
A third option is to keep using bias_type alone and infer dBias from bias shape. This is fragile because whether dBias is
needed is an autograd property, not a tensor-shape property.
Additional context
This came up while enabling D=256 backward fused attention on Blackwell/SM10x. The cuDNN FE path can distinguish
set_bias(...) from set_dbias(...), but TE’s common fused attention path currently does not expose that distinction in
backend selection. As a result, TE may reject or fail to use kernels that would be valid for bias-input-only training.
Is your feature request related to a problem? Please describe.
TE fused attention currently treats
bias_type != NO_BIASas implying that backward should requestdBiasfrom cuDNN FE. However, cuDNN FE models bias input and dBias output independently (set_biasvsset_dbias), and frameworks can have bias tensors that affect attention but do not require gradients. This matters for training cases where an additive attention bias is used as a fixed mask or frozen score modifier. We still need backward for Q/K/V, but do not needdBias. RequestingdBiasunnecessarily can disable otherwise-supported cuDNN kernels or trigger plan-build failures for kernels that support bias input but not dBias.Describe the solution you'd like
TE fused attention should distinguish “bias input is present” from “bias gradient is requested”.
Concretely, plumb a dbias_requested / bias_requires_grad flag through backend selection and backward execution:
• Backend selection should allow POST_SCALE_BIAS when dbias_requested == false for cuDNN kernels that support bias input
but not dBias.
• Backward graph construction should call sdpa_backward_options.set_bias(bias) whenever bias input is present.
• Backward graph construction should call sdpa_backward_options.set_dbias(dBias) only when the framework actually requests
a bias gradient.
• PyTorch can derive this from core_attention_bias.requires_grad.
• JAX can derive this from whether bias is included in value_and_grad(..., argnums=...).
So, in the cuDNN FE graph construction:
For PyTorch usage, a frozen bias should still use fused attention without requesting dBias:
bias = make_additive_attention_bias(...)
bias.requires_grad_(False)
out = transformer_engine.pytorch.DotProductAttention(...)(q, k, v, core_attention_bias=bias)
Backward should compute dQ/dK/dV, but not request dBias from cuDNN FE.
Describe alternatives you've considered
One workaround is to disable fused attention whenever bias is present for kernels that do not support dBias. This is safe
but overly conservative, because cuDNN FE may support bias input even when it does not support dBias.
Another workaround is for users to encode fixed masks through attn_mask_type instead of additive bias. That only works for
built-in mask patterns and does not cover arbitrary dense score modifiers, frozen relative-position bias, or other
application-specific additive biases.
A third option is to keep using bias_type alone and infer dBias from bias shape. This is fragile because whether dBias is
needed is an autograd property, not a tensor-shape property.
Additional context
This came up while enabling D=256 backward fused attention on Blackwell/SM10x. The cuDNN FE path can distinguish
set_bias(...) from set_dbias(...), but TE’s common fused attention path currently does not expose that distinction in
backend selection. As a result, TE may reject or fail to use kernels that would be valid for bias-input-only training.