# Attention Is All You Need!

The core idea behind Transformer models is the attention mechanism [[1]](https://arxiv.org/abs/1706.03762). It identifies the correlation between words, selects the most important parts of the sentence to focus on, and captures meaningful patterns and dependencies in the data. A typical attention mechanism looks like this, where the pre-softmax operations can be scaling, bias and/or masking, and the post-softmax operation is usually dropout.

<figure align="center">
<img src="attn.png" width="70%">
<figcaption> Figure 1: Dot product attention. </figcaption>
</figure>

[Transformer Engine](https://github.com/NVIDIA/TransformerEngine.git) supports the calculation of dot product attention in three frameworks, [PyTorch](https://github.com/pytorch/pytorch), [JAX](https://github.com/google/jax) and [PaddlePaddle](https://github.com/PaddlePaddle/Paddle). The API for dot product attention in each framework is,
- [transformer_engine.pytorch.DotProductAttention](../api/pytorch.rst#transformer_engine.pytorch.DotProductAttention)
- [transformer_engine.jax.flax.DotProductAttention](../api/jax.rst#transformer_engine.jax.flax.DotProductAttention)
- [transformer_engine.paddle.DotProductAttention](../api/paddle.rst#transformer_engine.paddle.DotProductAttention)

## 1. Attention Backends

Transformer Engine provides multiple attention backends for each supported framework. While the framework-native implementations provide a robust baseline, the more fused, GPU-optimized backends offer better performance, for example, the flash-attention and cuDNN attention backends.

A list of the available attention backends is as follows. The framework-native implementations are mostly named with "unfused" in the corresponding modules, while the GPU-optimized backends are with "fused" and "flash". We will discuss the difference and similarities of "fused" and "flash" in the next three sub-sections.

| Framework | Backend (Module Name) | Module Location |
| :-------- | :-------------------- | :-------------- |
| PyTorch   | cuDNN attention (`FusedAttention`)<br> flash-attention (`FlashAttention`)<br> PyTorch-native attention (`UnfusedDotProductAttention`) | [transformer_engine.pytorch.attention](../../transformer_engine/pytorch/attention.py)      |
| JAX       | cuDNN attention (`_FusedDotProductAttention`)<br> JAX-native attention (`_UnfusedDotProductAttention`)                                | [transformer_engine.jax.flax.transformer](../../transformer_engine/jax/flax/transformer.py)   |
| PaddlePaddle    | cuDNN attention (`_te_forward`)<br> PaddlePaddle-native attention (`_pd_forward`)                                                           | [transformer_engine.paddle.layer.attention](../../transformer_engine/paddle/layer/attention.py) |

### 1.1 Flash vs. Non-Flash

The attention calculation has quadratic time and memory complexities to the sequence length, which presents a significant challenge to scale Transformer models up for longer contexts. If the sequence length doubles, the time and memory requirement quadriples, i.e. `O(N^2)`. The flash algorithm was proposed [[2]](https://arxiv.org/abs/2205.14135) to reduce the scaling from `O(N^2)` to `O(N)`.

Compared to the standard, non-flash algorithm (with `O(N^2)` complexity), the flash algorithm (with `O(N)` complexity) employs two techniques to improve the runtime and memory efficiency.

- **Tiling:** Instead of processing the query, key, value tensors in one single step, the flash algorithm makes several passes at the data. It computes the softmax one tile at a time, and then combines the results together in a separate step. This technique allows the kernels to work on smaller chunks of the data, reduces the memory footprint, and brings down the number of reads and writes between global memory and shared memory as well. The bandwidth between global memory and shared memory is a performance bottleneck, and alleviating it allows the flash algorithm to speed up the whole calculation significantly.

- **Recomputation:** Instead of storing the full softmax matrix (`O(N^2)` in size) to global memory and loading it back in during the backward pass, the flash algorithm stores only the softmax normalization factors (`O(N)` in size). Despite the extra calculation due to the recomputation of attention scores using the softmax normalization factors, the memory reductions and bandwidth savings still allow the flash algorithm to improve the overall efficiency for attention.

<div class="alert alert-info">
<b>Note:</b> Both our flash-attention and cuDNN attention (sub-backends 1 and 2) backends are based on the flash algorithm.
</div>

### 1.2 flash-attention

Our flash-attention backend is a wrapper around the public `flash-attn` package [[3]](https://github.com/Dao-AILab/flash-attention), implemented by the same authors that proposed the flash algorithm [[2]](https://arxiv.org/abs/2205.14135).

`flash-attn` offers PyTorch interfaces and is integrated into Transformer Engine as the `transformer_engine.pytorch.attention.FlashAttention` module. `FlashAttention` calls `flash-attn` and also provides a few miscellaneous functionalities, such as converting the `attention_mask` to cumulative sequence lengths `cu_seqlens` in the case of `padding` mask.

Transformer Engine updates its `flash-attn` requirement regularly (see `flash-attn` in [setup.py](../../setup.py)). For example, Transformer Engine 1.7 supports `flash-attn` 2.0.6+.

`flash-attn` offers significant performance improvements over the framework-native implementations. For more details, please see their own [benchmarks](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#performance).

### 1.3 cuDNN Attention

Our cuDNN attention backend is another high-performance attention implementation. It requires [cuDNN](https://developer.nvidia.com/cudnn) and [cudnn-frontend](../../3rdparty/cudnn-frontend) to run, and it has multiple sub-backends on offer - out of which, sub-backends 1 and 2 are based on the flash algorithm [[2]](https://arxiv.org/abs/2205.14135).

| Sub-Backend |  Algorithm | Precision | Sequence Length | Architecture | Docs |
| :---------- | :--------- | :-------- | :-------------- | :----------- | :--- |
| 0 | Non-Flash | BF16/FP16       | <=512       | sm80, 90 | [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/latest/developer/graph-api.html#fused-attention-fprop) |
| 1 | Flash     | BF16/FP16       | Any         | sm80+    | [cuDNN](https://docs.nvidia.com/deeplearning/cudnn/latest/developer/graph-api.html#fused-flash-attention-fprop),<br>[cudnn-frontend](https://github.com/NVIDIA/cudnn-frontend/blob/main/docs/operations/Attention.md#scaled-dot-product-attention) |
| 2 | Flash     | FP8             | cuDNN pre-9.0: <=512<br>cuDNN 9.0+: Any | cuDNN pre-9.0: sm90<br>cuDNN 9.0+:  sm90+ | cuDNN 9.0+: [cudnn-frontend](https://github.com/NVIDIA/cudnn-frontend/blob/main/docs/operations/Attention.md#scaled-dot-product-attention-fp8) |

The cuDNN attention backend and the flash-attention backend have several notable differences. For example, as of Transformer Engine 1.7, cuDNN 9.0 and `flash-attn` 2.4.2,

- flash-attention supports PyTorch only, and cuDNN attention supports all three frameworks (PyTorch, JAX and PaddlePaddle), as listed in the Attention Backends table.
- flash-attention does not have FP8 support, and cuDNN attention does, through its sub-backend 2.
- flash-attention supports `bshd` and `thd` formats without any transposes, but requires transposes for `sbhd` format (see section 3.1 for details on QKV layouts and formats). cuDNN attention supports `bshd`, `sbhd`, `thd` formats without transposes.
- flash-attention does not have support for `post_scale_bias`. cuDNN attention does.
- flash-attention has features such as sliding window attention and paged attention. cuDNN attention does not.
- flash-attention applies [bottom right diagonal](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#21-change-behavior-of-causal-flag) for `causal` mask in cross attention. cuDNN attention applies top left diagonal.
- flash-attention is more performant on Ampere architectures, and cuDNN attention has advantages on Hopper architectures from our benchmarking.

To compare flash-attention and cuDNN attention in performance, users can modify this script, [benchmark_attention.py](../../benchmarks/attention/benchmark_attention.py), to benchmark the model configuration of their interest. For example,

In [None]:
model_configs = {
    #   test:             b,  h, hg,   d,   sq,  skv,   p,     mask,              bias
    "test_0": ModelConfig(2, 16, 16,  64,  512,  512, 0.0, "no_mask",         "no_bias"), # short seq
    "test_1": ModelConfig(2, 16, 16, 128, 2048, 2048, 0.0,  "causal",         "no_bias"), # longer seq, mask
    "test_2": ModelConfig(2, 16, 16, 128, 2048, 2048, 0.0,  "causal", "post_scale_bias"), # bias
    "test_3": ModelConfig(2, 32,  4, 128, 8192, 8192, 0.0,  "causal",         "no_bias"), # GQA
}

The script runs each config with two backends, `FlashAttention` and `FusedAttention` in PyTorch. The average time over `num_iters` iterations is reported at the end. Each iteration includes one forward pass and one backward pass. If a specific config is not supported by a particular backend, the runtimes and speedups for that backend will be 0.

In [7]:
!cd ../../benchmarks/attention/ && python benchmark_attention.py

Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-38cf.qdstrm'
Generated:
    /code/fmha/github3/pr-attn-doc/TransformerEngine/benchmarks/attention/prof_test_0.nsys-rep
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-f01b.qdstrm'
Generated:
    /code/fmha/github3/pr-attn-doc/TransformerEngine/benchmarks/attention/prof_test_1.nsys-rep
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-c5af.qdstrm'
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.
Generated:
    /code/fmha/github3/pr-attn-doc/TransformerEngine/benchmarks/attention/prof_test_2.nsys-rep
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-e36c.qdstrm'
Generate

## 2. Backend Selection

Transformer Engine selects the appropriate backend (and sub-backend) based on the backend availability and performance.

Backend availability is determined by a number of factors such as the user input, software version, and the GPU architecture being run on. A few examples of these factors are, sequence length, number of attention heads, head size, attention mask type, attention bias type, whether it's in training mode, self or cross attention, MHA or MQA/GQA, `flash-attn`/cuDNN library versions, and the compute capability of the GPU. 

If there are multiple, eligible backends, Transformer Engine selects the backend with the highest performance based on our benchmarked heuristics. Generally speaking, the following selection order is implemented. This order may change as we monitor the performance of different backends.

| Framework | Selection Order                                                                                                                              |
| :-------- | :--------------------- |
| PyTorch   | sm90: cuDNN attention > flash-attention > PyTorch-native attention<br>sm80: flash-attention > cuDNN attention > PyTorch-native attention<br>cuDNN attention: sub-backend 1 > sub-backend 0 |
| JAX       | cuDNN attention > JAX-native attention |
| PaddlePaddle    | cuDNN attention > PaddlePaddle-native attention |

### 2.1 Debug Information

To find out which backend is being used, users can run with this debug flag in PyTorch, `NVTE_DEBUG=1`. For example, it may print out
```
        [DotProductAttention]: using flash-attn 2.4.2
        [DotProductAttention]: using cuDNN attention (backend 1)
```

To gather more details during runtime, users can run with `NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2` in PyTorch. Please include these details when filing an issue so the Transformer Engine team understands your model and runtime configuration.
```
        [DotProductAttention]: using cuDNN attention (backend 1)
        [DotProductAttention]: dtype=torch.float16, qkv_format=sbhd, q_shape=[128, 4, 16, 64], kv_shape=[256, 4, 16, 64], qkv_layout=sbhd_sb2hd, mask_type=no_mask, bias_type=no_bias, bias_shape=None, dropout=0.0, is_training=True, context_parallel=False, compute_capability=sm90, cudnn_version=90200
```

### 2.2 User Control

A few other environment variables are also provided if users encounter a performance or convergence issue and would like to experiment with different backends.

For example, the following two variables allow users to switch on or off the flash-attention backend or the cuDNN attention backend in PyTorch.
```
        NVTE_FLASH_ATTN = 0 # disables flash-attention; default = 1
        NVTE_FUSED_ATTN = 0 # disables cuDNN attention; default = 1
```

This variable offers a way for users to express their preference over cuDNN attention sub-backends. However, the elected sub-backend will only be used *if* it's eligible, i.e. if it has support for the specified input.
```
        NVTE_FUSED_ATTN_BACKEND = 0/1/2 # user perference of cuDNN attention sub-backend
```

cuDNN attention sub-backend 1 offers two execution paths: the workspace optimization path and the non-workspace optimization path. The workspace optimization path trades memory for performance - it requires a larger amount of global memory and provides 20-30% more performance in many cases. It is available on Hopper architectures and turned on when the estimated workspace size (`batch_size x seqlen_q x seqlen_kv`) is <= 256MB.

This environment variable allows users to control the maximum workspace size (in Bytes). Please be aware of the Out-Of-Memory risks when increasing the limit.
```
# CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT
# - unset: enables workspace optimization when required workspace is <= 256MB
#          or when bias gradient needs to be computed
# -     n: enables workspace optimization when required workspace is <= n bytes
# -    -1: enables workspace optimization always
# -     0: disables workspace optimization always
```

### 2.3 Example Tests

Our [unit tests](../../tests/) offer a variety of use cases of Transformer Engine dot product attention. For example, for PyTorch, [test_dot_product_attention](../../tests/pytorch/fused_attention/test_fused_attn.py).

## 3. Backend Support

Transformer Engine supports the commonly-used features, such as self and cross attention, FP16/BF16 precisions, and dropout. However, the different backends have varying support for some other features. As of v1.7, Transformer Engine has the following support matrix.

| Attention Backend | Precision | Architecture | Sliding Window Attention | MQA/GQA | Context Parallelism | Deterministic |
| :---------------- | :-------- | :----------- | :----------------------- | :------ | :------------------ | :------------ |
| cuDNN attention<br>(PyTorch, JAX, PaddlePaddle) | BF16, FP16, FP8 |  sm80+ | No  | Yes | `bshd`,`sbhd`: Yes<br>`thd`: No | Sub-backend 0, 2: Yes<br>Sub-backend 1: Yes, if workspace optimization path |
| flash-attention<br>(PyTorch)           | BF16, FP16      |  sm80+ | Yes | Yes | `bshd`,`thd`: Yes<br>`sbhd`: No  | Yes, if `deterministic=True`                                                                                    |
| Framework-native attention<br>(PyTorch, JAX, PaddlePaddle) | BF16, FP16, FP32 |  Any   | No, unless used as a mask  | Yes | No                                  | Yes |

For example usage of some of these features, these unit tests are a good starting place: [test_dpa_swa](../../tests/pytorch/fused_attention/test_fused_attn.py), [test_te_layer_mqa_gqa](../../tests/pytorch/fused_attention/test_fused_attn.py), [test_cp_with_fused_attention](../../tests/pytorch/fused_attention/test_fused_attn_with_cp.py), and [test_cp_with_flash_attention](../../tests/pytorch/fused_attention/test_fused_attn_with_cp.py).

### 3.1 QKV Layout

Transformer Engine supports 15 memory layouts for the query `q`, key `k`, value `v` tensors. These layouts, `qkv_layout`, can be grouped into 3 QKV formats, `qkv_format`, and 5 QKV layout groups, `qkv_layout_group`, to facilitate certain memory and computational operations.

| `qkv_layout` &nbsp; &nbsp; &nbsp; &nbsp; | `qkv_layout_group`=`3hd` | `h3d` | `hd_2hd` | `hd_h2d` | `hd_hd_hd` |
| ----------: | -----------: | -----: | ----------: | ----------: | -------------: |
| `qkv_format`=`sbhd` | `sb3hd`                | `sbh3d` | `sbhd_sb2hd` | `sbhd_sbh2d` | `sbhd_sbhd_sbhd` |
| `bshd` | `bs3hd`                | `bsh3d` | `bshd_bs2hd` | `bshd_bsh2d` | `bshd_bshd_bshd` |
| `thd`  | `t3hd`                 | `th3d`  | `thd_t2hd`   | `thd_th2d`   | `thd_thd_thd`    |

Here, `b` stands for the batch size, `s` sequence length, `h` number of attention heads, `d` head dimension, and `t` total number of tokens in a batch, `t = sum(s_i) for i in 0,...,b-1`. A few examples may help understand the meaning of the QKV layouts:

- `qkv_layout`=`sb3hd`: `q`, `k`, `v` are sequence first, i.e. `s` being the leading dimension. `q`, `k`, `v` are located in the same memory chunk and are interleaved at the `h * d` dimension. `q, k, v = [qkv[:,:,i,:,:] for i in range(3)]`.

- `qkv_layout`=`bshd_bsh2d`: `q`, `k`, `v` are batch first, i.e. `b` being the leading dimension. `q`, `k`, `v` are located in two memory chunks, `q` and `kv`. `q` is contiguous, and `k`, `v` are interleaved at the `d` dimension. `k, v = [kv[:,:,:,i,:] for i in range(2)]`. The `s` in `bsh2d` is the max sequence length for `k` and `v`, and it can be different from the `s` in `bshd` for `q`. Similarly, `h` may be different for `q`, and `k`, `v`, in the case of MQA/GQA.

- `qkv_layout`=`thd_thd_thd`: `q`, `k`, `v` have variable sequence lengths in a batch. They come from three different memory allocations and are all contiguous in their allocation.

Transformer Engine supports all 15 layouts in PyTorch, and 3 layouts, `bs3hd`, `bshd_bs2hd` and `bshd_bshd_bshd`, in JAX and Paddle. In PyTorch, this utility function [transformer_engine.pytorch.attention._get_qkv_layout](../../transformer_engine/pytorch/attention.py) can be used to figure out the exact `qkv_layout` given a set of `q`, `k`, `v` tensors.

As of v1.7, Transformer Engine has the following support matrix from different backends.

| Backend | Supported QKV Formats |
| :--------------- | :-------------------- |
| flash-attention | `bshd`, `sbhd`, `thd` (`sbhd` requires transpose operations) |
| cuDNN attention  | `bshd`, `sbhd`, `thd`  |
| Framework-native attention | `bshd`, `sbhd` (`sbhd` requires transpose operations) |

<div class="alert alert-info">
<b>Note:</b> In Pytorch, when RoPE is employed, the <code>qkv_layout</code> is converted to the corresponding <code>hd_hd_hd</code> layout. For example, the initial <code>sbh3d</code> layout in <code>pytorch.MultiHeadAttention</code> will become <code>sbhd_sbhd_sbhd</code> in <code>pytorch.DotProductAttention</code> after RoPE.
</div>


For example usage of the QKV layouts, please see [test_dpa_qkv_layout](../../tests/pytorch/fused_attention/test_fused_attn.py) and [test_dpa_qkv_layout_thd](../../tests/pytorch/fused_attention/test_fused_attn.py).

### 3.2 Attention Mask

Transformer Engine supports 5 mask types:
- `no_mask`, `padding`, `causal`, `padding_causal` (equivalent to `causal_padding`), `arbitrary`

All masks are defined as:
- `True`: masking out the corresponding element
- `False`: including the corresponding element in attention calculation

As of v1.7, Transformer Engine has the following support matrix for different mask types.

| Backend          | Supported Mask Types  | Requires `attention_mask` |
| :--------------- | :-------------------- | :------------------ |
| flash-attention | `no_mask`, `causal`, `padding`, `padding_causal` | `no_mask`, `causal`: No<br>`padding`, `padding_causal`: Yes if `cu_seqlens` not provided|
| cuDNN attention  | `no_mask`, `causal`, `padding`, `padding_causal` | No |
| Framework-native attention | `no_mask`, `causal`, `arbitrary` | `no_mask`, `causal`: No<br>`arbitrary`: Yes |

For `padding` and `padding_causal` masks, an `attention_mask` input is required for flash-attention. For self-attention, `attention_mask` should be a single tensor in `[batch_size, 1, 1, seqlen_q]` shape. For cross-attention, it should be a list of two tensors respectively in `[batch_size, 1, 1, seqlen_q]` and `[batch_size, 1, 1, seqlen_kv]` shapes. An alternative to `attention_mask` is the `cu_seqlens_q` and `cu_seqlens_kv` tensors. If both `attention_mask` and `cu_seqlens_q/kv` are passed in, Transformer Engine uses `cu_seqlens_q/kv` for attention calculations.

For `qkv_format`=`thd`, when `max_seqlen_q` and `max_seqlen_kv` are not present, Transformer Engine extracts the information from `q`, `k`, `v` tensors. This costs GPU-CPU copy and synchronization, and for performance reasons, it is recommended that users set `max_seqlen_q` and `max_seqlen_kv` when using `thd` layouts.

As of Transformer Engine 1.7, cuDNN attention does not directly support `Arbitrary` mask type. But the same functionality can be achieved by converting the mask to a bias and running Transformer Engine with `core_attention_bias_type=post_scale_bias`. An example script for the conversion is [here](./arbitrary_mask_to_bias.py).

As mentioned in Section 1.3, flash-attention uses bottom right diagonal for `causal` mask in cross attention (see `flash-attn`'s [change log](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#21-change-behavior-of-causal-flag)), while cuDNN attention uses the top left diagonal. This will be made more consistent in upcoming Transformer Engine versions.

For example usage of the attention mask types, please see [test_dpa_mask](../../tests/pytorch/fused_attention/test_fused_attn.py).

### 3.3 Attention Bias

Transformer Engine supports 4 attention bias types:
- `no_bias`, `pre_scale_bias`, `post_scale_bias`, `ALiBi` (with/without custom slopes)

As of v1.7, the support matrix from different backends is,

| Backend | Bias Type | Bias Shape | Bias Data Type | Architecture |
| :------ | :-------- | :--------- | :--------- | :----------- |
| flash-attention           | `no_bias`, `ALiBi` (with slopes) | N/A | ALiBi slopes: FP32 | sm80+ |
| cuDNN attention            | `no_bias`, `post_scale_bias`, `ALiBi` (without slopes) | `post_scale_bias`: BHSS, 1HSS, B1SS, 11SS for forward, 1HSS for backward | `post_scale_bias`: same as QKV type<br>ALiBi slopes: FP32 | cuDNN 8.9.6+: sm90<br>cuDNN 9.0+: sm80, 90 |
| Framework-native attention | `no_bias`, `pre_scale_bias`, `post_scale_bias` | `post_scale_bias`: BHSS, 1HSS, B1SS, 11SS | `post_scale_bias`: same as QKV type | sm80+ |

The flash-attention backend enables `ALiBi` by asking the user to pass in an `alibi_slopes` tensor. This can be the default slopes that come with vanila ALiBi, or custom slopes from the user.

cuDNN attention supports `ALiBi` by taking in a `Boolean` flag, instead of a tensor. cuDNN only supports the vanila ALiBi as of v8.9.6.

The framework-native backends do not support `ALiBi` explicitly. But in PyTorch, the ALiBi bias can be converted to a regular `post_scale_bias` bias, applied to attention. The utility function `_get_alibi()` is located in `transformer_engine.pytorch.attention`.

For example usage of the attention bias types, please see [test_dpa_bias](../../tests/pytorch/fused_attention/test_fused_attn.py).

### 3.4 FP8 Attention

Transformer Engine supports FP8 attention, where the `MatMul` operations in attention are in FP8 for computational efficiency, and the `SoftMax` operation is in FP32 for numerical accuracy. Two options are offered to perform FP8 attention in Transformer Engine 1.7 in PyTorch. Both are through the FP8 recipe definition, `transformer_engine.common.recipe.DelayedScaling`.

- `DelayedScaling.fp8_dpa=True`: This allows the use of cuDNN attention sub-backend 2 when it has the support for the specific user inputs. With this option, `FusedAttention` takes in FP16 or BF16 tensors, calculates dot product attention in FP8, and outputs the attention logits in the same data type as the input (FP16 or BF16). Some casting operations are required at the beginning and end of the module.

- `DelayedScaling.fp8_mha=True` (experimental): This allows not only the use of cuDNN attention sub-backend 2 but also the removal of the casting operations. 

Some examples of using these two features are at [test_dpa_fp8_vs_f16](../../tests/pytorch/fused_attention/test_fused_attn.py) and [test_mha_fp8_vs_f16](../../tests/pytorch/fused_attention/test_fused_attn.py).

To disable FP8 attention for backward and only use it for forward, please set `NVTE_FP8_DPA_BWD=1`. To check if the correct configuration is being used, please run with `NVTE_DEBUG=1`. For example,
```
[DotProductAttention]: using fp8_recipe.fp8_mha=False, fp8_recipe.fp8_dpa=True and NVTE_FP8_DPA_BWD=1
[DotProductAttention]: using FP8 forward
[DotProductAttention]: using FP8 backward
```