Skip to content

[Feature] support decode attention for mix#7688

Open
lizhenyun01 wants to merge 11 commits intoPaddlePaddle:developfrom
lizhenyun01:h_dec_attn
Open

[Feature] support decode attention for mix#7688
lizhenyun01 wants to merge 11 commits intoPaddlePaddle:developfrom
lizhenyun01:h_dec_attn

Conversation

@lizhenyun01
Copy link
Copy Markdown
Collaborator

Motivation

C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_ATTENTION=1

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 1, 2026

Thanks for your contribution!

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 1, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 12:24:45

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

CI 仍在运行中:1 个 Required 任务已失败(审批问题),6 个 Required 任务运行中,需处理审批后再关注整体结果。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 25 2 7 2 0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:缺少4项必要审批,涉及自定义op和关键文件修改 联系相关RD完成4项必要审批后rerun Job -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - Job -
Run Base Tests / base_tests - 运行中 - Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
Run Four Cards Tests / run_4_cards_tests - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
其余 3 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 15s Job -
Trigger Jenkins for PR - Job -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ CI_HPU - - -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 审批流程(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批流程
  • 置信度: 高
  • 根因摘要: PR缺少4个必要审批,涉及自定义op和关键文件修改
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到 4 项审批缺失。该 PR 新增了 custom op 相关代码,并修改了 fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decodingfastdeploy/envs.py 等关键路径,按照仓库规则需要指定的 FastDeploy RD 及 PaddlePaddle RD 进行逐项审批。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. custom op 审批(FastDeploy RD):请 @dangqingqing / @jiangjiajun / @DENGKAIPENG 中任意一人审批
  2. custom op 审批(PaddlePaddle RD):请 @gaoxiang / @mayongqiang 中任意一人审批
  3. spec_decode/custom_ops 修改审批:请 @liuzichang01 / @wangyanpeng04 中任意一人审批
  4. envs.py 修改审批:请 @jiangjiajun / @liuyuanle / @chenjian26 / @wanglongzhi 中任意一人审批

修复建议摘要: 联系相关RD完成4项必要审批后rerun

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 1, 2026

Codecov Report

❌ Patch coverage is 68.29268% with 26 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@0397ab5). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...l_executor/layers/attention/append_attn_backend.py 0.00% 10 Missing and 1 partial ⚠️
...el_executor/layers/attention/flash_attn_backend.py 33.33% 4 Missing and 2 partials ⚠️
fastdeploy/spec_decode/mtp.py 66.66% 5 Missing ⚠️
...cutor/layers/attention/ops/config_for_attention.py 85.71% 0 Missing and 1 partial ⚠️
...or/layers/attention/ops/decode_append_attention.py 88.88% 0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py 88.88% 0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7688   +/-   ##
==========================================
  Coverage           ?   71.28%           
==========================================
  Files              ?      399           
  Lines              ?    55649           
  Branches           ?     8697           
==========================================
  Hits               ?    39667           
  Misses             ?    13225           
  Partials           ?     2757           
Flag Coverage Δ
GPU 71.28% <68.29%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 20:58:21

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

❌ 存在 1 个 Required 任务失败,需优先处理后方可合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
19(0) 19 12 2 2 3 0

2 任务状态汇总

2.1 Required任务 : 1/2 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:PR缺少4个必要审批(自定义算子及敏感文件) 联系相关RD审批(dangqingqing/jeff41404等) Job -
其余 1 个必选任务通过 - - - - -

2.2 可选任务 — 11/17 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 12s Job -
xpu_build_test / xpu-build-test - Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
⏸️ FD-Build-Linux / fd-build - - -
其余 11 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 审批缺失(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 审批缺失
  • 置信度: 高
  • 根因摘要: PR缺少4个必要审批,涉及自定义算子及敏感文件修改
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到本 PR 缺少 4 项必要审批。本次 PR 修改了自定义算子相关目录(fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decoding)以及敏感配置文件(fastdeploy/envs.py),这些改动需要对应模块负责人的显式 Approve 方可合并。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifying [fastdeploy/spec_decode, custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @dangqingqing / @jiangjiajun / @DENGKAIPENG 中任意一位 FastDeploy RD Approve(自定义算子)
  2. @gaoxiang / @mayongqiang 中任意一位 PaddlePaddle RD Approve(自定义算子)
  3. 请 @liuzichang01 / @wangyanpeng04 中任意一位 FastDeploy RD Approve(spec_decode/custom_ops 目录)
  4. @jiangjiajun / @liuyuanle / @chenjian26 / @wanglongzhi 中任意一位 FastDeploy RD Approve(envs.py)

修复建议摘要: 请相关RD审批(dangqingqing/jeff41404/liuzichang01/jiangjiajun等)

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 12:20:34

📋 Review 摘要

PR 概述:为 Hopper(SM90+)架构新增 decode_append_attention 分块并行 decode attention kernel,支持 C16(fp16/bf16 KV cache)和静态 C8(INT8 KV cache),通过 USE_DECODE_ATTENTION=1 环境变量开启。
变更范围custom_ops/gpu_ops/append_attention/fastdeploy/model_executor/layers/attention/fastdeploy/worker/fastdeploy/envs.py
影响面 Tag[OP] [Feature]

📝 PR 规范检查

PR 标题 Tag [Feature] 合规。但 ## Modifications## Usage or Command## Accuracy Tests 三个段落均为空(仅含 HTML 注释占位符),且 Checklist 全部未勾选,需补全。

标题建议(可直接复制):

  • [Feature] Add decode_append_attention kernel for C16/C8 KV cache (SM90+)

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
为 Hopper(SM90+)架构新增 decode attention 分块并行实现(`decode_append_attention` kernel),支持 C16(fp16/bf16 KV cache)和静态 C8(INT8 KV cache)两种量化格式。在 flash_attn 开启的情况下,通过 `export USE_DECODE_ATTENTION=1` 启用,旧版 SM80 路径(`gpu_ops/append_attn/`)保持不变。

## Modifications
- `custom_ops/gpu_ops/append_attention/`:新增 decode attention 实现目录,包含:
  - `attention_func.cuh`:核心计算函数(QK、softmax update、PV、输出写回、multi-warp merge)
  - `decode_append_attention_c16_impl.cuh`:C16(fp16/bf16)KV cache 实现
  - `decode_append_attention_c8_impl.cuh`:静态 C8(INT8)KV cache 实现
  - `config_for_attention.cu`:decode 前置配置 kernel(预计算 block indices、chunk size)
  - `cu_tensor_map.cuh``mem_util.cuh``mma_tensor_op.cuh``utils.cuh`:辅助头文件
  - `template_config.json`:模板实例化配置
- `custom_ops/gpu_ops/decode_append_attention.cu`:新 `decode_append_attention` CUDA op 主体
- `custom_ops/gpu_ops/decoder_write_cache_with_rope.cu`:RoPE 写入 cache op 配套更新
- `custom_ops/gpu_ops/cpp_extensions.cc`:注册新 op(`PD_BUILD_STATIC_OP`- `custom_ops/setup_ops.py`:在 SM90+(CC >= 90)分支新增源文件列表及模板生成步骤
- `custom_ops/utils/auto_gen_template_attention.py`:新版模板生成脚本
- `fastdeploy/envs.py`:新增 `USE_DECODE_ATTENTION` 环境变量(默认 0)
- `fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py`:Python 封装
- `fastdeploy/model_executor/layers/attention/ops/config_for_attention.py`:Python 封装
- `fastdeploy/model_executor/layers/attention/append_attn_backend.py`:条件分配分块缓冲区(decode_block_indices、decode_tmp_workspace 等 6 个)
- `fastdeploy/model_executor/layers/attention/flash_attn_backend.py`:集成新 decode attention 调用路径
- `fastdeploy/worker/gpu_model_runner.py``metax_model_runner.py``input_batch.py`:配套适配
- `tests/operators/attention/`:新增单测(test_decode_append_attention.py、test_decode_append_attention_c16.py)及 benchmark

## Usage or Command
```bash
# 在启用 flash_attn 的情况下,开启 decode_append_attention(仅支持 SM90+/Hopper)
export USE_DECODE_ATTENTION=1
```

## Accuracy Tests
N/A(本次 PR 未提供精度对比结果)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 custom_ops/gpu_ops/append_attention/attention_func.cuh:88 #pragma unroll 误放于非循环语句前
❓ 疑问 fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py Python 封装仅做 is_cuda() 检查,SM80 机器设置 USE_DECODE_ATTENTION=1 时可能触发 ImportError

总体评价

新增 decode_append_attention kernel 的整体设计清晰,通过环境变量特性开关实现向后兼容,测试文件一并提交。建议补全 PR 描述,并在 Python 封装侧增加 SM 版本检查以防用户误用。

#pragma unroll
for (uint32_t fx = 0; fx < num_frags_x; ++fx) {
const uint32_t base_offset = q_idx_base + fx * 16 + tx_offset;
#pragma unroll
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 #pragma unroll 指令误放在非循环语句前

此处 #pragma unroll 作用的下一条语句是 const int j = ty;,并非循环,编译器会忽略此指令(部分编译器会产生警告)。应将该 pragma 删除。

  const uint32_t tx_offset = tx / 8;
#pragma unroll
  for (uint32_t fx = 0; fx < num_frags_x; ++fx) {
    const uint32_t base_offset = q_idx_base + fx * 16 + tx_offset;
    // 删除这里多余的 #pragma unroll
    const int j = ty;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants