[Cherry-Pick][Feature] support decode attention for mix(#7688) by lizhenyun01 · Pull Request #7729 · PaddlePaddle/FastDeploy

lizhenyun01 · 2026-05-07T05:14:55Z

Motivation

C16/静态C8 attention支持，使用方式：flash_attn开启情况下export USE_DECODE_ATTENTION=1

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-07T05:15:01Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-07T06:58:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 14:56:04

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3f29b01
Merge base: 66dea60 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

❌ 存在 1 个 Required 失败任务需优先处理（Approval 审批检查未通过），另有 8 个 Required 任务因上游依赖被跳过（不阻塞合并）。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
24(0)	24	5	6	0	1	12

2 任务状态汇总

2.1 Required任务 : 1/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	9s	PR问题：PR缺少5项必要审批（自定义op/spec_decode/envs.py修改）	请联系所需 RD 对各受保护模块完成审批	Job	-
⏭️	8 个必选任务已跳过（`FD-Build-Linux` 等下游构建/测试任务，含主测试任务 `Run FastDeploy Unit Tests and Coverage`）	-	-	-	-	-
✅	其余 1 个必选任务通过（`Pre Commit`）	-	-	-	-	-

⚠️ 主测试任务说明：Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 处于跳过状态（因上游 FD-Clone-Linux / code-clone 可选任务失败导致级联跳过），未实际执行，当前不阻塞合并。

2.2 可选任务 — 4/14 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	12s	Job	-
❌	`FD-Clone-Linux / code-clone`	15s	Job	-
❌	`FD-Clone-Linux-ILUVATAR / code-clone`	16s	Job	-
❌	`FD-Clone-Linux-XPU / code-clone`	51s	Job	-
❌	`Trigger Jenkins for PR`	41s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 4 个可选任务通过（各 Workflow 的 `check_bypass / Check bypass`）	-	-	-

3 失败详情（仅 required）

Approval — 审批检查（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批检查
置信度: 高
根因摘要: PR 触犯 5 项审批规则，缺少指定 RD 的审批
分析器: 通用分析(fallback)

根因详情:
scripts/check_approval.sh 检测到本 PR 修改了多个受保护模块，但尚未获得对应责任人的 Review Approve。共触发 5 项规则：

添加自定义 Op → 需 FastDeploy RD（@qingqing01 / @Jiang-Jia-Jun / @heavengate）中至少 1 人审批
添加自定义 Op → 需 PaddlePaddle RD（@jeff41404 / @yongqiangma）中至少 1 人审批
修改 fastdeploy/spec_decode 或 custom_ops/gpu_ops/speculate_decoding → 需 FastDeploy RD（@freeliuzc / @Deleter-D）中至少 1 人审批
修改 fastdeploy/envs.py → 需 FastDeploy RD（@Jiang-Jia-Jun / @yuanlehome / @rainyfly / @Wanglongzhi2001）中至少 1 人审批
Cherry-Pick 规则检查 → PR 来自 develop 分支时标题须含 [Cherry-Pick] 及原 PR 号，或未获相应 RD 审批

关键日志:

==> PR title: [Feature] support decode attention for mix(#7688)
There are 5 approved errors.
##[error]Process completed with exit code 6.

修复建议:

在 GitHub PR 页面 @提及以下责任人，请求 Approve：
- Custom Op：@qingqing01 或 @Jiang-Jia-Jun 或 @heavengate
- Custom Op（PaddlePaddle）：@jeff41404 或 @yongqiangma
- spec_decode / custom_ops 变更：@freeliuzc 或 @Deleter-D
- envs.py 变更：@Jiang-Jia-Jun 或 @yuanlehome 或 @rainyfly 或 @Wanglongzhi2001
如本 PR 为 Cherry-Pick，请修正标题格式为 [Cherry-Pick] ... (#原develop PR号)，并获取 @qingqing01 / @Jiang-Jia-Jun / @heavengate 其中一人审批

修复建议摘要: 请联系所需 RD 对各受保护模块完成审批

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 15:07:05

📋 Review 摘要

PR 概述：Cherry-Pick #7688，为 mix 场景新增 decode attention 支持（C16 fp16/bf16 KV cache 与静态 C8 量化），通过环境变量 USE_DECODE_ATTENTION=1 开启，当前 CUDA kernel 仅在 sm90+（H100/Hopper）平台编译
变更范围：custom_ops/gpu_ops/append_attention/（CUDA kernel）、model_executor/layers/attention/ops/（Python wrapper）、worker/gpu_model_runner.py、spec_decode/mtp.py、tests/operators/attention/
影响面 Tag：[OP] [Feature] [Speculative Decoding] [Metax]

📝 PR 规范检查

标题格式 [Cherry-Pick][Feature] support decode attention for mix(#7688) 符合 Cherry-Pick 规范（[Cherry-Pick][Tag] 标题(#原PR号) 格式正确）。但描述中 Modifications、Usage or Command、Accuracy Tests 三段落仅有占位注释，内容为空，需补充。

PR 描述建议（可直接复制）：

## Motivation

Cherry-Pick from #7688。为 mix 场景（混合 Prefill+Decode 请求）新增 decode attention 支持，涵盖 C16（fp16/bf16 KV cache）和静态 C8 量化两种 KV cache 格式。在 flash_attn 开启情况下通过环境变量启用，当前仅支持 sm90+（H100/Hopper 及以上）平台。

## Modifications

- 新增 CUDA kernel：`custom_ops/gpu_ops/decode_append_attention.cu`、`decoder_write_cache_with_rope.cu`，以及 `custom_ops/gpu_ops/append_attention/` 下 C8/C16 实现（attention_func.cuh、decode_append_attention_c8_impl.cuh、decode_append_attention_c16_impl.cuh 等）
- 新增 Python wrapper：`fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py`、`config_for_attention.py`、`decoder_write_cache_with_rope.py`
- 在 `fastdeploy/envs.py` 新增 `USE_DECODE_ATTENTION` 环境变量（默认 0，关闭）
- 在 `fastdeploy/worker/gpu_model_runner.py` 和 `metax_model_runner.py` 集成 decode attention buffer 分配与 forward_meta 传入
- 在 `fastdeploy/spec_decode/mtp.py` 增加 decode attention 相关输入透传
- 在 `custom_ops/setup_ops.py` 添加编译入口（仅限 cc>=90/Hopper 及以上）
- 新增测试：`tests/operators/attention/test_decode_append_attention.py`、`test_decode_append_attention_c16.py`、`benchmark_decode_attention.py`

## Usage or Command

在 flash_attn backend 开启情况下，设置如下环境变量启用 decode attention：`export USE_DECODE_ATTENTION=1`（注意：当前编译仅支持 sm90+ Hopper 及以上平台）

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`fastdeploy/spec_decode/mtp.py`	Cherry-Pick 合并后存在两段完全相同的 if 代码块（重复赋值），应清理其中一个
❓ 疑问	`fastdeploy/worker/metax_model_runner.py`	`fix ci` 提交删除了 decode buffer 到 `forward_meta` 的赋值，Metax 平台开启 `USE_DECODE_ATTENTION=1` 时是否能正确工作？
📝 PR 规范	PR 描述	`Modifications`、`Usage or Command`、`Accuracy Tests` 段落内容为空

注意：新增 decode attention op 在 setup_ops.py 中仅在 cc >= 90（Hopper 及以上）条件下编译，A100（sm80）等平台即使设置 USE_DECODE_ATTENTION=1 也不会加载该 op，建议在文档或描述中注明硬件要求，避免用户困惑。

总体评价

整体实现结构清晰，CUDA kernel、Python wrapper、SM 门控和测试均已配套，Cherry-Pick 格式也正确。主要问题是 cherry-pick 合并后 mtp.py 存在重复代码块需要清理，以及 Metax 平台的 decode buffer 赋值路径需要作者确认。

lizhenyun01 added 11 commits May 7, 2026 13:03

support c8 decode attention

2265dc9

support c16 attention && backend

4c922bc

opt kernel

de6450d

fix

111230a

opt larger batch

03263a0

inplace out

cb64cb3

fix input_batch && remove fast_math

b1acb37

fix xpu

a5e394f

fix bug

6a5b3c6

fix ci

307e5a8

opt and fix mtp

3f29b01

lizhenyun01 had a problem deploying to Metax_ci May 7, 2026 05:14 — with GitHub Actions Failure

lizhenyun01 changed the title ~~[Feature] support decode attention for mix(#7688)~~ [Cherry-Pick][Feature] support decode attention for mix(#7688) May 7, 2026

PaddlePaddle-bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729
lizhenyun01 wants to merge 11 commits intoPaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6

lizhenyun01 commented May 7, 2026

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

PaddlePaddle-bot commented May 7, 2026

Approval

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lizhenyun01 commented May 7, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 7, 2026

Uh oh!

PaddlePaddle-bot commented May 7, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 1/10 通过

2.2 可选任务 — 4/14 通过

3 失败详情（仅 required）

Approval

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants