[OP] support cfp8 in blackwell mla#7876
Conversation
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-21 13:50:38
📋 Review 摘要
PR 概述:为 MLA Attention 的 Prefill Cache 写入及 Blackwell GPU 的 Decode 路径增加 FP8 KV Cache 量化支持。
变更范围:custom_ops/gpu_ops/append_attn/(CUDA kernel)、fastdeploy/model_executor/layers/attention/(Python 调度层)
影响面 Tag:[Quantization] [OP]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/model_executor/layers/attention/mla_attention_backend.py:1003 |
compiled_mla 全局缓存未区分 FP8/FP16 kernel 类型 |
| ❓ 疑问 | custom_ops/gpu_ops/append_attn/mla_cache_kernel.cuh:233 |
FP8 量化无 scale 因子,精度影响未知 |
| 📝 PR 规范 | — | 标题/描述不合规 |
📝 PR 规范检查
PR 标题为 commit,完全不符合 [Tag] 描述 格式;PR 描述所有 section 均为空占位符,不合规。
标题建议(可直接复制):
[Quantization] Add FP8 KV cache support for MLA prefill write and Blackwell decode
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
为 MLA(Multi-head Latent Attention)的 Prefill Cache 写入路径及 Blackwell GPU Decode 路径新增 FP8(e4m3)KV Cache 量化支持,以降低 KV Cache 显存占用并提升 Blackwell 推理吞吐。
## Modifications
- `custom_ops/gpu_ops/append_attn/mla_cache_kernel.cu/.cuh`:`PrefillMLAWriteCache` 新增 `cache_quant_type_str` 参数;当取值 `cache_fp8` 时,对 kv_nope/kv_pe 执行 clamp+cast 写入 FP8 KV cache;保留 `none` 的原始 BF16/FP16 路径
- `fastdeploy/model_executor/layers/attention/mla_attention_backend.py`:`forward_mixed` 中改为读取 `layer.cache_quant_type_str`(默认 `"none"`)传入 prefill cache op;`mla_blackwell` 新增 FP8 分支,当 `latent_cache.dtype == uint8` 时切换到 `mla_decode_fp8` kernel 执行
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本 PR 功能方向清晰,FP8 KV Cache 量化的 kernel 实现逻辑基本正确;compiled_mla 全局缓存需增加 kernel 类型区分,FP8 量化精度影响建议补充验证结果。PR 规范(标题/描述)需完善后合入。
| from mla_decode_fp16 import BlackwellMultiHeadLatentAttentionForwardFP16 | ||
|
|
||
| # from mla_decode_fp8 import BlackwellMultiHeadLatentAttentionForwardFP8 | ||
| if use_fp8_cache_kv: |
There was a problem hiding this comment.
🟡 建议 compiled_mla 全局缓存不区分 FP8/FP16 kernel 类型
当前逻辑:compiled_mla 只在 None 时编译一次并持久复用。若在同一进程内曾以 use_fp8_cache_kv=False(FP16 kernel)完成初始化,后续以 use_fp8_cache_kv=True 调用时,compiled_mla 仍指向 FP16 版本,传入 FP8 tensor 会引发 dtype mismatch 运行时错误(反之亦然)。
建议修复:使用独立变量或 dict 分别缓存两个 kernel 的编译结果:
global compiled_mla_fp8, compiled_mla_fp16
if use_fp8_cache_kv:
if compiled_mla_fp8 is None:
compiled_mla_fp8 = cute.compile(mla, ...)
compiled_mla_fp8(...)
else:
if compiled_mla_fp16 is None:
compiled_mla_fp16 = cute.compile(mla, ...)
compiled_mla_fp16(...)| Load<T, VecSize>(&kv_nope[ori_idx], &src_vec); | ||
| Store<T, VecSize>(src_vec, &kv_cache[tgt_idx]); | ||
|
|
||
| if constexpr (std::is_same_v<CT, __nv_fp8_e4m3>) { |
There was a problem hiding this comment.
❓ 疑问 FP8 量化未使用 scale 因子
当前实现等价于 scale=1.0 的静态量化(仅 clamp 至 FP8 e4m3 的 ±448 范围)。若 MLA KV cache 的激活值实际分布在较小量级(如 ±10),FP8 e4m3 在该范围内只有约 4 个指数级别,精度损失可能不可忽略。请确认:
- 是否有针对 DeepSeek-R1 等目标模型的量化精度对比数据?
- 是否刻意省略 scale(如激活已经过归一化处理)?建议在 PR 描述的 Accuracy Tests 段补充说明。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 Required 任务有 1 个失败、0 个运行中、0 个等待中,暂不建议合入;失败点为主测试任务的 diff coverage 门禁。
2 任务状态汇总日志列说明:失败任务直接使用日志链接,运行中任务链接到对应 Job。 2.1 Required任务 : 9/10 通过
2.2 可选任务 — 28/31 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 代码覆盖率不足(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例: 无。日志显示 根因详情: 关键日志: 修复建议:
修复建议摘要: 关联变更: 链接: 查看日志 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7876 +/- ##
==========================================
Coverage ? 63.63%
==========================================
Files ? 462
Lines ? 64492
Branches ? 9889
==========================================
Hits ? 41037
Misses ? 20673
Partials ? 2782
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
EmmonsCurse
left a comment
There was a problem hiding this comment.
LGTM~ Skip coverage check as it mainly relies on tests with sm_version >= 100
📋 Review 摘要
PR 概述:为 MLA Attention 的 Prefill Cache 写入及 Blackwell GPU 的 Decode 路径增加 FP8 KV Cache 量化支持。
变更范围:
custom_ops/gpu_ops/append_attn/(CUDA kernel)、fastdeploy/model_executor/layers/attention/(Python 调度层)影响面 Tag:
[Quantization][OP]问题
fastdeploy/model_executor/layers/attention/mla_attention_backend.py:1003compiled_mla全局缓存未区分 FP8/FP16 kernel 类型custom_ops/gpu_ops/append_attn/mla_cache_kernel.cuh:233📝 PR 规范检查
PR 标题为
commit,完全不符合[Tag] 描述格式;PR 描述所有 section 均为空占位符,不合规。标题建议(可直接复制):
[Quantization] Add FP8 KV cache support for MLA prefill write and Blackwell decodePR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
总体评价
本 PR 功能方向清晰,FP8 KV Cache 量化的 kernel 实现逻辑基本正确;
compiled_mla全局缓存需增加 kernel 类型区分,FP8 量化精度影响建议补充验证结果。PR 规范(标题/描述)需完善后合入。