Fused attention should distinguish bias input from dBias request

**Is your feature request related to a problem? Please describe.**

TE fused attention currently treats `bias_type != NO_BIAS` as implying that backward should request `dBias` from cuDNN FE.  However, cuDNN FE models bias input and dBias output independently (`set_bias` vs `set_dbias`), and frameworks can have bias tensors that affect attention but do not require gradients. This matters for training cases where an additive attention bias is used as a fixed mask or frozen score modifier. We still need backward for Q/K/V, but do not need `dBias`. Requesting `dBias` unnecessarily can disable otherwise-supported cuDNN kernels or trigger plan-build failures for kernels that support bias input but not dBias.

**Describe the solution you'd like**
TE fused attention should distinguish “bias input is present” from “bias gradient is requested”.
Concretely, plumb a dbias_requested / bias_requires_grad flag through backend selection and backward execution:
  • Backend selection should allow POST_SCALE_BIAS when dbias_requested == false for cuDNN kernels that support bias input
    but not dBias.
  • Backward graph construction should call sdpa_backward_options.set_bias(bias) whenever bias input is present.
  • Backward graph construction should call sdpa_backward_options.set_dbias(dBias) only when the framework actually requests
     a bias gradient.
  • PyTorch can derive this from core_attention_bias.requires_grad.
  • JAX can derive this from whether bias is included in value_and_grad(..., argnums=...).

  So, in the cuDNN FE graph construction:
```
  if (bias_type != NVTE_NO_BIAS) {
      sdpa_backward_options.set_bias(bias);
      if (dbias_requested) {
          sdpa_backward_options.set_dbias(dBias);
      }
  }
```

  For PyTorch usage, a frozen bias should still use fused attention without requesting dBias:

  bias = make_additive_attention_bias(...)
  bias.requires_grad_(False)
  out = transformer_engine.pytorch.DotProductAttention(...)(q, k, v, core_attention_bias=bias)
  # Backward should compute dQ/dK/dV, but not request dBias from cuDNN FE.

**Describe alternatives you've considered**
One workaround is to disable fused attention whenever bias is present for kernels that do not support dBias. This is safe
  but overly conservative, because cuDNN FE may support bias input even when it does not support dBias.

  Another workaround is for users to encode fixed masks through attn_mask_type instead of additive bias. That only works for
  built-in mask patterns and does not cover arbitrary dense score modifiers, frozen relative-position bias, or other
  application-specific additive biases.

  A third option is to keep using bias_type alone and infer dBias from bias shape. This is fragile because whether dBias is
  needed is an autograd property, not a tensor-shape property.

**Additional context**

  This came up while enabling D=256 backward fused attention on Blackwell/SM10x. The cuDNN FE path can distinguish
  set_bias(...) from set_dbias(...), but TE’s common fused attention path currently does not expose that distinction in
  backend selection. As a result, TE may reject or fail to use kernels that would be valid for bias-input-only training.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused attention should distinguish bias input from dBias request #3082

Backward should compute dQ/dK/dV, but not request dBias from cuDNN FE.

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Fused attention should distinguish bias input from dBias request #3082

Description

Backward should compute dQ/dK/dV, but not request dBias from cuDNN FE.

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions