FP8 Support

### PR
https://github.com/deepseek-ai/FlashMLA/pull/54

### Intro
Support FP8 WGMMA based on the async  pipeline design of  FlashMLA. The TransV part draws on the implementation of SmemTranspose64x64 in [Fa3](https://github.com/Dao-AILab/flash-attention/blob/0823cf7b5d96499c1c79a4f64b1e256a035ba4b4/hopper/mainloop_fwd_sm90_tma_gmma_ws.hpp#L26). 
Currently, Q/K/V only support symmetric PerTensor quantization. Since the maximum value of P does not exceed 1, the f32tofp8_cast is directly used for quantization.

### Performance

> cuda driver version: 535.183.06   </br>  nvcc version: 12.8 </br> torch version: 2.6

On the H20, MLA typically demonstrate a high degree of arithmetic intensity. Consequently, the Memory Floating - point Utilization (MFU) is employed as a performance metric.
<img width="1189" alt="image" src="https://github.com/user-attachments/assets/e25bcd2f-12ef-41bd-ac25-c825562140db" />




On the H800, MLA typically encounter memory-bound situations. Consequently, the Memory Bandwidth Utilization (MBU) metric is adopted to evaluate the performance of the kernel. There is still a lot of room for optimization on the H800. Look forward to working together.
<img width="1199" alt="image" src="https://github.com/user-attachments/assets/fc14a569-1ce3-4486-9dce-a93105c71be7" />


### Reproduction
```python
python3 ./tests/test_flash_mla.py --dtype e4m3


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FP8 Support #56

PR

Intro

Performance

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FP8 Support #56

Description

PR

Intro

Performance

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions