Skip to content

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729

Open
lizhenyun01 wants to merge 11 commits intoPaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6
Open

[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729
lizhenyun01 wants to merge 11 commits intoPaddlePaddle:release/2.6from
lizhenyun01:dec_attn_2.6

Conversation

@lizhenyun01
Copy link
Copy Markdown
Collaborator

Motivation

C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_ATTENTION=1

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 7, 2026

Thanks for your contribution!

@lizhenyun01 lizhenyun01 changed the title [Feature] support decode attention for mix(#7688) [Cherry-Pick][Feature] support decode attention for mix(#7688) May 7, 2026
@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 14:56:04

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

存在 1 个 Required 失败任务需优先处理(Approval 审批检查未通过),另有 8 个 Required 任务因上游依赖被跳过(不阻塞合并)。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
24(0) 24 5 6 0 1 12

2 任务状态汇总

2.1 Required任务 : 1/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 9s PR问题:PR缺少5项必要审批(自定义op/spec_decode/envs.py修改) 请联系所需 RD 对各受保护模块完成审批 Job -
⏭️ 8 个必选任务已跳过(FD-Build-Linux 等下游构建/测试任务,含主测试任务 Run FastDeploy Unit Tests and Coverage - - - - -
其余 1 个必选任务通过(Pre Commit - - - - -

⚠️ 主测试任务说明Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 处于 跳过 状态(因上游 FD-Clone-Linux / code-clone 可选任务失败导致级联跳过),未实际执行,当前不阻塞合并。

2.2 可选任务 — 4/14 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 12s Job -
FD-Clone-Linux / code-clone 15s Job -
FD-Clone-Linux-ILUVATAR / code-clone 16s Job -
FD-Clone-Linux-XPU / code-clone 51s Job -
Trigger Jenkins for PR 41s Job -
⏸️ CI_HPU - - -
其余 4 个可选任务通过(各 Workflow 的 check_bypass / Check bypass - - -

3 失败详情(仅 required)

Approval — 审批检查(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批检查
  • 置信度: 高
  • 根因摘要: PR 触犯 5 项审批规则,缺少指定 RD 的审批
  • 分析器: 通用分析(fallback)

根因详情:
scripts/check_approval.sh 检测到本 PR 修改了多个受保护模块,但尚未获得对应责任人的 Review Approve。共触发 5 项规则:

  1. 添加自定义 Op → 需 FastDeploy RD(@qingqing01 / @Jiang-Jia-Jun / @heavengate)中至少 1 人审批
  2. 添加自定义 Op → 需 PaddlePaddle RD(@jeff41404 / @yongqiangma)中至少 1 人审批
  3. 修改 fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decoding → 需 FastDeploy RD(@freeliuzc / @Deleter-D)中至少 1 人审批
  4. 修改 fastdeploy/envs.py → 需 FastDeploy RD(@Jiang-Jia-Jun / @yuanlehome / @rainyfly / @Wanglongzhi2001)中至少 1 人审批
  5. Cherry-Pick 规则检查 → PR 来自 develop 分支时标题须含 [Cherry-Pick] 及原 PR 号,或未获相应 RD 审批

关键日志:

==> PR title: [Feature] support decode attention for mix(#7688)
There are 5 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 在 GitHub PR 页面 @提及以下责任人,请求 Approve:
    • Custom Op:@qingqing01@Jiang-Jia-Jun@heavengate
    • Custom Op(PaddlePaddle):@jeff41404@yongqiangma
    • spec_decode / custom_ops 变更:@freeliuzc@Deleter-D
    • envs.py 变更:@Jiang-Jia-Jun@yuanlehome@rainyfly@Wanglongzhi2001
  2. 如本 PR 为 Cherry-Pick,请修正标题格式为 [Cherry-Pick] ... (#原develop PR号),并获取 @qingqing01 / @Jiang-Jia-Jun / @heavengate 其中一人审批

修复建议摘要: 请联系所需 RD 对各受保护模块完成审批

链接: 查看日志

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 15:07:05

📋 Review 摘要

PR 概述:Cherry-Pick #7688,为 mix 场景新增 decode attention 支持(C16 fp16/bf16 KV cache 与静态 C8 量化),通过环境变量 USE_DECODE_ATTENTION=1 开启,当前 CUDA kernel 仅在 sm90+(H100/Hopper)平台编译
变更范围custom_ops/gpu_ops/append_attention/(CUDA kernel)、model_executor/layers/attention/ops/(Python wrapper)、worker/gpu_model_runner.pyspec_decode/mtp.pytests/operators/attention/
影响面 Tag[OP] [Feature] [Speculative Decoding] [Metax]

📝 PR 规范检查

标题格式 [Cherry-Pick][Feature] support decode attention for mix(#7688) 符合 Cherry-Pick 规范([Cherry-Pick][Tag] 标题(#原PR号) 格式正确)。但描述中 ModificationsUsage or CommandAccuracy Tests 三段落仅有占位注释,内容为空,需补充。

PR 描述建议(可直接复制):

## Motivation

Cherry-Pick from #7688。为 mix 场景(混合 Prefill+Decode 请求)新增 decode attention 支持,涵盖 C16(fp16/bf16 KV cache)和静态 C8 量化两种 KV cache 格式。在 flash_attn 开启情况下通过环境变量启用,当前仅支持 sm90+(H100/Hopper 及以上)平台。

## Modifications

- 新增 CUDA kernel:`custom_ops/gpu_ops/decode_append_attention.cu``decoder_write_cache_with_rope.cu`,以及 `custom_ops/gpu_ops/append_attention/` 下 C8/C16 实现(attention_func.cuh、decode_append_attention_c8_impl.cuh、decode_append_attention_c16_impl.cuh 等)
- 新增 Python wrapper:`fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py``config_for_attention.py``decoder_write_cache_with_rope.py`
-`fastdeploy/envs.py` 新增 `USE_DECODE_ATTENTION` 环境变量(默认 0,关闭)
-`fastdeploy/worker/gpu_model_runner.py``metax_model_runner.py` 集成 decode attention buffer 分配与 forward_meta 传入
-`fastdeploy/spec_decode/mtp.py` 增加 decode attention 相关输入透传
-`custom_ops/setup_ops.py` 添加编译入口(仅限 cc>=90/Hopper 及以上)
- 新增测试:`tests/operators/attention/test_decode_append_attention.py``test_decode_append_attention_c16.py``benchmark_decode_attention.py`

## Usage or Command

在 flash_attn backend 开启情况下,设置如下环境变量启用 decode attention:`export USE_DECODE_ATTENTION=1`(注意:当前编译仅支持 sm90+ Hopper 及以上平台)

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 fastdeploy/spec_decode/mtp.py Cherry-Pick 合并后存在两段完全相同的 if 代码块(重复赋值),应清理其中一个
❓ 疑问 fastdeploy/worker/metax_model_runner.py fix ci 提交删除了 decode buffer 到 forward_meta 的赋值,Metax 平台开启 USE_DECODE_ATTENTION=1 时是否能正确工作?
📝 PR 规范 PR 描述 ModificationsUsage or CommandAccuracy Tests 段落内容为空

注意:新增 decode attention op 在 setup_ops.py 中仅在 cc >= 90(Hopper 及以上)条件下编译,A100(sm80)等平台即使设置 USE_DECODE_ATTENTION=1 也不会加载该 op,建议在文档或描述中注明硬件要求,避免用户困惑。

总体评价

整体实现结构清晰,CUDA kernel、Python wrapper、SM 门控和测试均已配套,Cherry-Pick 格式也正确。主要问题是 cherry-pick 合并后 mtp.py 存在重复代码块需要清理,以及 Metax 平台的 decode buffer 赋值路径需要作者确认。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants