[Feature] support decode attention for mix#7688
[Feature] support decode attention for mix#7688lizhenyun01 wants to merge 11 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览CI 仍在运行中:1 个 Required 任务已失败(审批问题),6 个 Required 任务运行中,需处理审批后再关注整体结果。
2 任务状态汇总2.1 Required任务 : 3/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)Approval — 审批流程(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 联系相关RD完成4项必要审批后rerun 链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7688 +/- ##
==========================================
Coverage ? 71.28%
==========================================
Files ? 399
Lines ? 55649
Branches ? 8697
==========================================
Hits ? 39667
Misses ? 13225
Partials ? 2757
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览❌ 存在 1 个 Required 任务失败,需优先处理后方可合并。
2 任务状态汇总2.1 Required任务 : 1/2 通过
2.2 可选任务 — 11/17 通过
3 失败详情(仅 required)Approval — 审批缺失(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请相关RD审批(dangqingqing/jeff41404/liuzichang01/jiangjiajun等) 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-07 12:20:34
📋 Review 摘要
PR 概述:为 Hopper(SM90+)架构新增 decode_append_attention 分块并行 decode attention kernel,支持 C16(fp16/bf16 KV cache)和静态 C8(INT8 KV cache),通过 USE_DECODE_ATTENTION=1 环境变量开启。
变更范围:custom_ops/gpu_ops/append_attention/、fastdeploy/model_executor/layers/attention/、fastdeploy/worker/、fastdeploy/envs.py
影响面 Tag:[OP] [Feature]
📝 PR 规范检查
PR 标题 Tag [Feature] 合规。但 ## Modifications、## Usage or Command、## Accuracy Tests 三个段落均为空(仅含 HTML 注释占位符),且 Checklist 全部未勾选,需补全。
标题建议(可直接复制):
[Feature] Add decode_append_attention kernel for C16/C8 KV cache (SM90+)
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
为 Hopper(SM90+)架构新增 decode attention 分块并行实现(`decode_append_attention` kernel),支持 C16(fp16/bf16 KV cache)和静态 C8(INT8 KV cache)两种量化格式。在 flash_attn 开启的情况下,通过 `export USE_DECODE_ATTENTION=1` 启用,旧版 SM80 路径(`gpu_ops/append_attn/`)保持不变。
## Modifications
- `custom_ops/gpu_ops/append_attention/`:新增 decode attention 实现目录,包含:
- `attention_func.cuh`:核心计算函数(QK、softmax update、PV、输出写回、multi-warp merge)
- `decode_append_attention_c16_impl.cuh`:C16(fp16/bf16)KV cache 实现
- `decode_append_attention_c8_impl.cuh`:静态 C8(INT8)KV cache 实现
- `config_for_attention.cu`:decode 前置配置 kernel(预计算 block indices、chunk size)
- `cu_tensor_map.cuh`、`mem_util.cuh`、`mma_tensor_op.cuh`、`utils.cuh`:辅助头文件
- `template_config.json`:模板实例化配置
- `custom_ops/gpu_ops/decode_append_attention.cu`:新 `decode_append_attention` CUDA op 主体
- `custom_ops/gpu_ops/decoder_write_cache_with_rope.cu`:RoPE 写入 cache op 配套更新
- `custom_ops/gpu_ops/cpp_extensions.cc`:注册新 op(`PD_BUILD_STATIC_OP`)
- `custom_ops/setup_ops.py`:在 SM90+(CC >= 90)分支新增源文件列表及模板生成步骤
- `custom_ops/utils/auto_gen_template_attention.py`:新版模板生成脚本
- `fastdeploy/envs.py`:新增 `USE_DECODE_ATTENTION` 环境变量(默认 0)
- `fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py`:Python 封装
- `fastdeploy/model_executor/layers/attention/ops/config_for_attention.py`:Python 封装
- `fastdeploy/model_executor/layers/attention/append_attn_backend.py`:条件分配分块缓冲区(decode_block_indices、decode_tmp_workspace 等 6 个)
- `fastdeploy/model_executor/layers/attention/flash_attn_backend.py`:集成新 decode attention 调用路径
- `fastdeploy/worker/gpu_model_runner.py`、`metax_model_runner.py`、`input_batch.py`:配套适配
- `tests/operators/attention/`:新增单测(test_decode_append_attention.py、test_decode_append_attention_c16.py)及 benchmark
## Usage or Command
```bash
# 在启用 flash_attn 的情况下,开启 decode_append_attention(仅支持 SM90+/Hopper)
export USE_DECODE_ATTENTION=1
```
## Accuracy Tests
N/A(本次 PR 未提供精度对比结果)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/append_attention/attention_func.cuh:88 |
#pragma unroll 误放于非循环语句前 |
| ❓ 疑问 | fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py |
Python 封装仅做 is_cuda() 检查,SM80 机器设置 USE_DECODE_ATTENTION=1 时可能触发 ImportError |
总体评价
新增 decode_append_attention kernel 的整体设计清晰,通过环境变量特性开关实现向后兼容,测试文件一并提交。建议补全 PR 描述,并在 Python 封装侧增加 SM 版本检查以防用户误用。
| #pragma unroll | ||
| for (uint32_t fx = 0; fx < num_frags_x; ++fx) { | ||
| const uint32_t base_offset = q_idx_base + fx * 16 + tx_offset; | ||
| #pragma unroll |
There was a problem hiding this comment.
🟡 建议 #pragma unroll 指令误放在非循环语句前
此处 #pragma unroll 作用的下一条语句是 const int j = ty;,并非循环,编译器会忽略此指令(部分编译器会产生警告)。应将该 pragma 删除。
const uint32_t tx_offset = tx / 8;
#pragma unroll
for (uint32_t fx = 0; fx < num_frags_x; ++fx) {
const uint32_t base_offset = q_idx_base + fx * 16 + tx_offset;
// 删除这里多余的 #pragma unroll
const int j = ty;
Motivation
C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_ATTENTION=1
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.