[Feature] support decode attention for mix by lizhenyun01 · Pull Request #7688 · PaddlePaddle/FastDeploy

lizhenyun01 · 2026-05-01T16:20:54Z

Motivation

C16/静态C8 attention支持，使用方式：flash_attn开启情况下export USE_DECODE_ATTENTION=1

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-01T16:21:00Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-01T16:31:42Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-07 12:24:45

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3cb6c35
Merge base: 0397ab5 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

CI 仍在运行中：1 个 Required 任务已失败（审批问题），6 个 Required 任务运行中，需处理审批后再关注整体结果。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	25	2	7	2	0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	PR问题：缺少4项必要审批，涉及自定义op和关键文件修改	联系相关RD完成4项必要审批后rerun	Job	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	Job	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
✅	其余 3 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	15s	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
⏸️	`CI_HPU`	-	-	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 审批流程（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批流程
置信度: 高
根因摘要: PR缺少4个必要审批，涉及自定义op和关键文件修改
分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到 4 项审批缺失。该 PR 新增了 custom op 相关代码，并修改了 fastdeploy/spec_decode、custom_ops/gpu_ops/speculate_decoding 和 fastdeploy/envs.py 等关键路径，按照仓库规则需要指定的 FastDeploy RD 及 PaddlePaddle RD 进行逐项审批。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

custom op 审批（FastDeploy RD）：请 @dangqingqing / @jiangjiajun / @DENGKAIPENG 中任意一人审批
custom op 审批（PaddlePaddle RD）：请 @gaoxiang / @mayongqiang 中任意一人审批
spec_decode/custom_ops 修改审批：请 @liuzichang01 / @wangyanpeng04 中任意一人审批
envs.py 修改审批：请 @jiangjiajun / @liuyuanle / @chenjian26 / @wanglongzhi 中任意一人审批

修复建议摘要: 联系相关RD完成4项必要审批后rerun

链接: 查看日志

codecov-commenter · 2026-05-01T17:49:01Z

Codecov Report

❌ Patch coverage is 68.29268% with 26 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@0397ab5). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...l_executor/layers/attention/append_attn_backend.py	0.00%	10 Missing and 1 partial ⚠️
...el_executor/layers/attention/flash_attn_backend.py	33.33%	4 Missing and 2 partials ⚠️
fastdeploy/spec_decode/mtp.py	66.66%	5 Missing ⚠️
...cutor/layers/attention/ops/config_for_attention.py	85.71%	0 Missing and 1 partial ⚠️
...or/layers/attention/ops/decode_append_attention.py	88.88%	0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py	88.88%	0 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7688   +/-   ##
==========================================
  Coverage           ?   71.28%           
==========================================
  Files              ?      399           
  Lines              ?    55649           
  Branches           ?     8697           
==========================================
  Hits               ?    39667           
  Misses             ?    13225           
  Partials           ?     2757

Flag	Coverage Δ
GPU	`71.28% <68.29%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-05-06T13:01:03Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 20:58:21

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3d19ff8
Merge base: 0397ab5 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

❌ 存在 1 个 Required 任务失败，需优先处理后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
19(0)	19	12	2	2	3	0

2 任务状态汇总

2.1 Required任务 : 1/2 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	PR问题：PR缺少4个必要审批（自定义算子及敏感文件）	联系相关RD审批（dangqingqing/jeff41404等）	Job	-
✅	其余 1 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 11/17 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	12s	Job	-
⏳	`xpu_build_test / xpu-build-test`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
⏸️	`FD-Build-Linux / fd-build`	-	-	-
✅	其余 11 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 审批缺失（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 审批缺失
置信度: 高
根因摘要: PR缺少4个必要审批，涉及自定义算子及敏感文件修改
分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到本 PR 缺少 4 项必要审批。本次 PR 修改了自定义算子相关目录（fastdeploy/spec_decode、custom_ops/gpu_ops/speculate_decoding）以及敏感配置文件（fastdeploy/envs.py），这些改动需要对应模块负责人的显式 Approve 方可合并。

关键日志:

0. You must have one FastDeploy RD (qingqing01, Jiang-Jia-Jun, heavengate) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404, yongqiangma) approval for adding custom op.
2. You must have one FastDeploy RD (freeliuzc, Deleter-D) approval for modifying [fastdeploy/spec_decode, custom_ops/gpu_ops/speculate_decoding].
3. You must have one FastDeploy RD (Jiang-Jia-Jun, yuanlehome, rainyfly, Wanglongzhi2001) approval for modifying [fastdeploy/envs.py].
There are 4 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 @dangqingqing / @jiangjiajun / @DENGKAIPENG 中任意一位 FastDeploy RD Approve（自定义算子）
请 @gaoxiang / @mayongqiang 中任意一位 PaddlePaddle RD Approve（自定义算子）
请 @liuzichang01 / @wangyanpeng04 中任意一位 FastDeploy RD Approve（spec_decode/custom_ops 目录）
请 @jiangjiajun / @liuyuanle / @chenjian26 / @wanglongzhi 中任意一位 FastDeploy RD Approve（envs.py）

修复建议摘要: 请相关RD审批（dangqingqing/jeff41404/liuzichang01/jiangjiajun等）

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-07 12:20:34

📋 Review 摘要

PR 概述：为 Hopper（SM90+）架构新增 decode_append_attention 分块并行 decode attention kernel，支持 C16（fp16/bf16 KV cache）和静态 C8（INT8 KV cache），通过 USE_DECODE_ATTENTION=1 环境变量开启。
变更范围：custom_ops/gpu_ops/append_attention/、fastdeploy/model_executor/layers/attention/、fastdeploy/worker/、fastdeploy/envs.py
影响面 Tag：[OP] [Feature]

📝 PR 规范检查

PR 标题 Tag [Feature] 合规。但 ## Modifications、## Usage or Command、## Accuracy Tests 三个段落均为空（仅含 HTML 注释占位符），且 Checklist 全部未勾选，需补全。

标题建议（可直接复制）：

[Feature] Add decode_append_attention kernel for C16/C8 KV cache (SM90+)

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
为 Hopper（SM90+）架构新增 decode attention 分块并行实现（`decode_append_attention` kernel），支持 C16（fp16/bf16 KV cache）和静态 C8（INT8 KV cache）两种量化格式。在 flash_attn 开启的情况下，通过 `export USE_DECODE_ATTENTION=1` 启用，旧版 SM80 路径（`gpu_ops/append_attn/`）保持不变。

## Modifications
- `custom_ops/gpu_ops/append_attention/`：新增 decode attention 实现目录，包含：
  - `attention_func.cuh`：核心计算函数（QK、softmax update、PV、输出写回、multi-warp merge）
  - `decode_append_attention_c16_impl.cuh`：C16（fp16/bf16）KV cache 实现
  - `decode_append_attention_c8_impl.cuh`：静态 C8（INT8）KV cache 实现
  - `config_for_attention.cu`：decode 前置配置 kernel（预计算 block indices、chunk size）
  - `cu_tensor_map.cuh`、`mem_util.cuh`、`mma_tensor_op.cuh`、`utils.cuh`：辅助头文件
  - `template_config.json`：模板实例化配置
- `custom_ops/gpu_ops/decode_append_attention.cu`：新 `decode_append_attention` CUDA op 主体
- `custom_ops/gpu_ops/decoder_write_cache_with_rope.cu`：RoPE 写入 cache op 配套更新
- `custom_ops/gpu_ops/cpp_extensions.cc`：注册新 op（`PD_BUILD_STATIC_OP`）
- `custom_ops/setup_ops.py`：在 SM90+（CC >= 90）分支新增源文件列表及模板生成步骤
- `custom_ops/utils/auto_gen_template_attention.py`：新版模板生成脚本
- `fastdeploy/envs.py`：新增 `USE_DECODE_ATTENTION` 环境变量（默认 0）
- `fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py`：Python 封装
- `fastdeploy/model_executor/layers/attention/ops/config_for_attention.py`：Python 封装
- `fastdeploy/model_executor/layers/attention/append_attn_backend.py`：条件分配分块缓冲区（decode_block_indices、decode_tmp_workspace 等 6 个）
- `fastdeploy/model_executor/layers/attention/flash_attn_backend.py`：集成新 decode attention 调用路径
- `fastdeploy/worker/gpu_model_runner.py`、`metax_model_runner.py`、`input_batch.py`：配套适配
- `tests/operators/attention/`：新增单测（test_decode_append_attention.py、test_decode_append_attention_c16.py）及 benchmark

## Usage or Command
```bash
# 在启用 flash_attn 的情况下，开启 decode_append_attention（仅支持 SM90+/Hopper）
export USE_DECODE_ATTENTION=1
```

## Accuracy Tests
N/A（本次 PR 未提供精度对比结果）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`custom_ops/gpu_ops/append_attention/attention_func.cuh:88`	`#pragma unroll` 误放于非循环语句前
❓ 疑问	`fastdeploy/model_executor/layers/attention/ops/decode_append_attention.py`	Python 封装仅做 `is_cuda()` 检查，SM80 机器设置 `USE_DECODE_ATTENTION=1` 时可能触发 ImportError

总体评价

新增 decode_append_attention kernel 的整体设计清晰，通过环境变量特性开关实现向后兼容，测试文件一并提交。建议补全 PR 描述，并在 Python 封装侧增加 SM 版本检查以防用户误用。

PaddlePaddle-bot · 2026-05-07T04:26:23Z

+#pragma unroll
+  for (uint32_t fx = 0; fx < num_frags_x; ++fx) {
+    const uint32_t base_offset = q_idx_base + fx * 16 + tx_offset;
+#pragma unroll


🟡 建议 #pragma unroll 指令误放在非循环语句前

此处 #pragma unroll 作用的下一条语句是 const int j = ty;，并非循环，编译器会忽略此指令（部分编译器会产生警告）。应将该 pragma 删除。

const uint32_t tx_offset = tx / 8; #pragma unroll for (uint32_t fx = 0; fx < num_frags_x; ++fx) { const uint32_t base_offset = q_idx_base + fx * 16 + tx_offset; // 删除这里多余的 #pragma unroll const int j = ty;

lizhenyun01 added 3 commits May 1, 2026 12:46

support c8 decode attention

da8e1b2

support c16 attention && backend

44eafaa

opt kernel

7ab45be

lizhenyun01 temporarily deployed to Metax_ci May 1, 2026 16:20 — with GitHub Actions Inactive