[Optimization] Decode attention support by lizhenyun01 · Pull Request #5767 · PaddlePaddle/FastDeploy

lizhenyun01 · 2025-12-25T09:19:52Z

Motivation

attention优化及重构第一部分：
- attention重构，合并投机解码/非投机解码分支，消除冗余逻辑
- 拆分decoder_write_cache_with_rope为单独算子，便于维护
- 新增decode attention backend，当前只支持PD分离下D节点
- 优化decode attention C8kernel性能，优化后在group_size=14下单步投机场景性能提升5%-113%

TODO：
- ROPE，write_cache重构及投机解码等分支融合
- C16 C4支持
- 单测完善
- backend逐步替换append_attention

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-25T09:20:00Z

Thanks for your contribution!

* [Optimizer] Support decode attention static c8 op * [Feature] Support decode attention backend * code style fix

codecov-commenter · 2026-01-13T16:08:03Z

Codecov Report

❌ Patch coverage is 35.21127% with 92 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fb37423). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...ayers/attention/decode_append_attention_backend.py	21.90%	81 Missing and 1 partial ⚠️
fastdeploy/platforms/cuda.py	0.00%	2 Missing and 1 partial ⚠️
fastdeploy/spec_decode/mtp.py	0.00%	1 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	0.00%	1 Missing and 1 partial ⚠️
...cutor/layers/attention/ops/config_for_attention.py	85.71%	0 Missing and 1 partial ⚠️
...or/layers/attention/ops/decode_append_attention.py	88.88%	0 Missing and 1 partial ⚠️
...ers/attention/ops/decoder_write_cache_with_rope.py	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5767   +/-   ##
==========================================
  Coverage           ?   66.55%           
==========================================
  Files              ?      389           
  Lines              ?    51364           
  Branches           ?     8005           
==========================================
  Hits               ?    34186           
  Misses             ?    14720           
  Partials           ?     2458

Flag	Coverage Δ
GPU	`66.55% <35.21%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lizhenyun01 added 2 commits December 25, 2025 16:59

[Optimizer] Support decode attention static c8 op

12f25a3

[Feature] Support decode attention backend

154d98e

lizhenyun01 had a problem deploying to Metax_ci December 25, 2025 09:19 — with GitHub Actions Error

code style fix

a3a9b47

lizhenyun01 had a problem deploying to Metax_ci December 25, 2025 09:25 — with GitHub Actions Failure

heavengate pushed a commit that referenced this pull request Jan 9, 2026

[Cherry-Pick][Optimization]Decode attention support(#5767) (#5833)

2e04b4e

* [Optimizer] Support decode attention static c8 op * [Feature] Support decode attention backend * code style fix

lizhenyun01 added 5 commits January 12, 2026 19:37

fix bug

d93011c

fix code style

ae52944

fix bug

c0c09e9

[BugFix] fix search for config of decode attention

693f94a

fix

920645e

lizhenyun01 had a problem deploying to Metax_ci January 12, 2026 11:38 — with GitHub Actions Failure

lizhenyun01 added 2 commits January 13, 2026 19:57

support unittest

48e3cb8

close use_fast_math tmporaty

f806fac

lizhenyun01 had a problem deploying to Metax_ci January 13, 2026 11:58 — with GitHub Actions Failure

Merge branch 'develop' into decode_attention

cd348e1

lizhenyun01 temporarily deployed to Metax_ci January 13, 2026 12:09 — with GitHub Actions Inactive

lizhenyun01 added 2 commits February 3, 2026 10:18

Merge remote-tracking branch 'origin/develop' into decode_attention

d196327

resolve conflict

c5973ec

lizhenyun01 temporarily deployed to Metax_ci February 3, 2026 02:22 — with GitHub Actions Inactive

resolve conflict

0b708aa

lizhenyun01 temporarily deployed to Metax_ci February 3, 2026 04:10 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] Decode attention support#5767

[Optimization] Decode attention support#5767
lizhenyun01 wants to merge 14 commits intoPaddlePaddle:developfrom
lizhenyun01:decode_attention

lizhenyun01 commented Dec 25, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Dec 25, 2025

Uh oh!

codecov-commenter commented Jan 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lizhenyun01 commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Uh oh!

paddle-bot bot commented Dec 25, 2025

Uh oh!

codecov-commenter commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lizhenyun01 commented Dec 25, 2025 •

edited

Loading

codecov-commenter commented Jan 13, 2026 •

edited

Loading