【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]#7718
Conversation
Adds rank-2 block_tables_headwise plumbing for c16 multi-query attention path. Updates template_config.json so the codegen produces explicit instantiations matching the new impl signature (added optional block_table_headwise param).
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览✅ 当前无 Required 任务失败;但存在 7 个 Workflow 处于
2 任务状态汇总2.1 Required任务 : 0/0 通过
暂无 Required 任务数据。 2.2 可选任务 — 1/2 通过
3 失败详情(仅 required)无 required 失败任务。 |
- gpu_model_runner: _maybe_slice_block_tables_headwise now is_dummy_or_profile_run-aware so captured CUDA graph records non-null sidecar; identity-stride dummy seeding aligned with kernel shape assert (dim0 == bsz * kv_num_heads). - input_batch: InputBatch.swap_states + ProposerInputBatch.swap_states clone-then-copy swap block_tables_headwise[i*kv_local:(i+1)*kv_local] row groups so head-wise rows follow slot moves on both target and proposer paths. - gpu_model_runner._process_reorder: in-place clear forward_batch_reqs_list before repopulating from share_inputs.index_to_batch_id; prevents stale tail entries from leaking into logprob-settings consumers (Option A: post-hoc rebuild). - gpu_model_runner: docstring corrected to match C16 kernel sentinel handling (multiquery_attention_c16_impl.cuh L215-223 / L605-613); -1 sentinel reads block 0 as harmless placeholder, SWA mask zeroes the contribution. No fallback to flat block_tables. - benchmarks/yaml: add eb45-21b-a3b-32k-bf16-kv50-512s.yaml for PR2 bench geometry. Refs: T53 PR2 PaddlePaddle#7718.
self.input_batch is not constructed yet during _dummy_prefill_inputs and CUDA-graph capture, so reading self.input_batch.kv_num_heads_local crashed the worker before the bench server could start. Use self.model_config.kv_num_heads (set in init_share_inputs before warmup) which has the same TP-aware value.
The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in [0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in {-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any request whose allocated IDs cross the num_gpu_blocks boundary (i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)). This commit normalizes IDs at the backend boundary in append_attn_backend.py using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a fail-fast assert to catch any residual OOB. The hotfix is bench-only; the canonical fix (per-head independent allocator pools) is deferred to PR1 v5 (RFC-PR1-reanchored.md §3). Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator. Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS) .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7) .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1) Files: 2 changed (1 backend hotfix, 1 envs validator)
…mixed Boolean fancy indexing and .item() CPU sync inside forward_mixed crash CUDA graph capture (cudaError 900 cudaErrorStreamCaptureUnsupported). The paddle.where normalization is graph-safe (static-shape elementwise ops). Assert was debug-only; normalization alone is the actual OOB fix.
- prefix_cache_manager: replace shared flat heap with kv_num_heads independent heaps; allocate/recycle now per-head with rank-2 [kv_num_heads][N] nested-list contract per RFC-PR2 §3 - gpu_model_runner: warmup base = idx * fill_blocks (not cross-head flat); rank-2 buffer shape preserved per kernel ABI - append_attn_backend: revert flat % num_gpu_blocks HOTFIX (silent aliasing); replace with FD_T53_DEBUG_BLOCK_TABLES gated assert - tests: 4 per-head value-space invariants, no MagicMock - .gitignore: ignore runs/ bench output dir Closes T53-PR2-OOB-blocker (kernel ABI now matches producer).
…c [opus v2 D-revised]
….data<int>() Adds dtype guards before .data<int>() reads of: - set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks (in AppendAttentionKernel, lines 100-105/186/187/285) - mask_offset.get() (in AppendAttention L599 and AppendAttentionWithOutput L763) Catches accidental INT64/FP dtype before UB. Matches existing PD_CHECK style from set_flags.cu / set_mask_value.cu.
…p_size Guards against silent under-allocation when kv_num_heads_global is not a multiple of tp_size (and >= tp_size). The kv<tp replication path is explicitly excluded from the assert, preserving existing GQA/MQA behavior.
…l-recycle accuracy PaddlePaddle-bot review on PR PaddlePaddle#7718 noted that integer division in the available_gpu_resource property zeros out fractional values when fewer than kv_num_heads logical blocks are free, causing the metric to underreport partial recycle progress. The scheduler can then refuse admissible requests because it sees 0 capacity even though several heads' worth of blocks are actually available. Switch to float division so the metric matches the legacy [0, 1] continuous value-domain and dashboards / scheduler see true availability. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
PaddlePaddle-bot review on PR PaddlePaddle#7718 asked why recycle_request_swa_head_cache short-circuits on total_tokens % block_size != 0. Document the rationale: the in-flight decode token is mid-write to the tail block, so releasing it now races with the next decode write. Recycle resumes on the next step that lands on a clean boundary. Comment-only change. No code semantics altered. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
|
Acknowledged on the
Also addressed in this push:
Please let us know if either deferral blocks merge — happy to scope an inline minimal version if so. |
|
@PaddlePaddle-bot — re: A3 (operator C++ unit tests for Both are acknowledged and deliberately deferred out of this PR:
Both items are tracked in the task checkpoint and will be raised as separate PRs after this one merges. — bob-cloudforge |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-06 17:41:47
📋 Review 摘要
PR 概述:为 AppendAttention 引入离散 head-wise block_idx,让 SWA head 走 per-head 短表、full-attention head 走原表,减少不必要的 KV page 读取;同步新增 per-head free heap 分配器及 SWA recycle 状态管理。
变更范围:custom_ops/gpu_ops/append_attn/、fastdeploy/cache_manager/、fastdeploy/engine/sched/、fastdeploy/config.py、fastdeploy/worker/gpu_model_runner.py
影响面 Tag:[OP] [KVCache] [Scheduler] [FDConfig]
📝 PR 规范检查
标题问题:[Kernel] 不在官方 Tag 列表中;[cf] 后缀及 【Hackathon 10th Spring No.53】 前缀均非标准格式;规范要求标题仅含一个官方 Tag。
标题建议(可直接复制):
[Feature] Optimize AppendAttention for discrete head-wise block_idx
PR 描述建议(可直接复制,checklist §D2 完整结构):
## Motivation
Hackathon 10th Spring Task No.53 PR2/2。当 SWA 与 full-attention head 共存时,现有 AppendAttention 对所有 KV head 均走 uniform `block_tables` 行。离散 `block_tables_headwise` 布局(物理 rank-2:`[batch * local_kv_heads, max_blocks_per_head]`)允许 SWA-head CTA 走更短/稀疏的行,full head 保留完整上下文行,减少 block-id 加载和 K/V page 读取(recycle OFF 基准下)。ABI 为加法性:无 `block_tables_headwise` 的调用者走原有路径不变。
## Modifications
- `custom_ops/gpu_ops/append_attention.cu`:将 `block_tables_headwise` 穿引至三个入口函数;新增 INT32 dtype guard(`PD_CHECK`);增加 `sink_size`/`head_wise_full_hidden` 参数
- `custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh`:以 per-head 行替换 uniform `block_tables` 行走;保留 `-1` sentinel guard,SWA mask 将贡献清零。c8/c4 推至 PR3
- `custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json}`、`cpp_extensions.cc`:穿引 `block_tables_headwise` 至 kernel header/template config/PHI op 签名
- `fastdeploy/model_executor/layers/attention/append_attn_backend.py`:新增 `_get_block_tables_headwise()` helper;穿引 `sink_size` 和 `head_wise_full_hidden`
- `fastdeploy/model_executor/layers/attention/ops/append_attention.py`:`block_tables_headwise` 设为 keyword-only;`use_output=True` 路径守卫
- `fastdeploy/engine/sched/resource_manager_v1.py`:新增 GQA 整除性守卫及 head-wise SWA recycle 状态管理方法
- `fastdeploy/cache_manager/prefix_cache_manager.py`:新增 per-head 独立 free heap 分配/回收;head-wise 与 prefix cache 互斥守卫
- `fastdeploy/config.py`:engine-main FDConfig T53 fixture,镜像 worker 端配置
- `tests/cache_manager/`(8 个测试)+ `tests/layers/test_append_attention_head_wise_shapes.py`:head-wise 合约 smoke test
## Usage or Command
```bash
export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_RATIO=0.5 # 前半 KV head 指定为 SWA
```
spec 验收:recycle OFF,1D uniform vs 2D discrete block_idx,TTFT/TBT 均提升 ≥5%。
## Accuracy Tests
| `block_idx` 模式 | 硬件 | TTFT (ms) | TBT (ms) | Δ TTFT | Δ TBT |
|---|---|---|---|---|---|
| 1D (uniform) | H100 / H20 / B200 | 待测 | 待测 | baseline | baseline |
| 2D (discrete, 优化) | 同上 | 待测 | 待测 | **≥+5%** | **≥+5%** |
需 H/B 卡验收(cc @luotao1),merge 前须补全。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/cache_manager/prefix_cache_manager.py:197 |
available_gpu_resource 读取空列表,head-wise 模式下恒返回 0.0,调度器误判资源耗尽 |
| 🟡 建议 | fastdeploy/engine/sched/resource_manager_v1.py:299 |
_num_swa_heads() 中 assert 用于运行时配置校验,-O 模式下静默失效 |
| 🟡 建议 | fastdeploy/worker/gpu_model_runner.py |
per checklist A6:gpu_model_runner.py 有变更,需确认是否需同步至其他硬件 runner(xpu/dcu/gcu/hpu 等) |
总体评价
PR 架构设计清晰,additive ABI 保持向后兼容,INT32 dtype guard 与 -1 sentinel clamp 等防御性编码值得肯定。但 available_gpu_resource 属性名拼写错误(单数 vs 复数)导致的 P0 bug 会使 head-wise 模式下所有新请求被调度器拒绝,需在 merge 前修复后重测。
| @property | ||
| def available_gpu_resource(self): | ||
| if getattr(self, "head_wise", False) and self.num_gpu_blocks > 0: | ||
| head_free = len(getattr(self, "gpu_free_head_wise_block_list", [])) |
There was a problem hiding this comment.
🔴 Bug available_gpu_resource 读取的是 gpu_free_head_wise_block_list(单数,始终为空列表),而非实际持有 free block 的 gpu_free_head_wise_block_lists(复数,per-head heaps)。
在 _init_head_wise_free_list 中:
self.gpu_free_head_wise_block_list = [] # ← 永远为空,兼容性占位因此当 head_wise=True 时,head_free = 0,available_gpu_resource 恒返回 0.0,调度器误判资源耗尽,所有新请求将被拒绝/阻塞。
建议修复:
head_free = sum(len(h) for h in self.gpu_free_head_wise_block_lists)
return (head_free / max(1, self.kv_num_heads)) / self.num_gpu_blocks| return 0 | ||
| tp_size = max(1, int(getattr(self.config.parallel_config, "tensor_parallel_size", 1) or 1)) | ||
| # GQA/MQA divisibility guard: when kv >= tp, kv must be divisible by tp | ||
| # (Paddle's TP shards KV heads evenly). The kv < tp replication path is |
There was a problem hiding this comment.
🟡 建议 _num_swa_heads() 中用 assert 做运行时配置参数校验。Python -O(优化模式)下 assert 会被完全移除,校验静默失效,后续 kv_num_heads_global // tp_size 可能触发 ZeroDivisionError 或产生错误的 head 分配。
建议改为显式 raise:
if (kv_num_heads_global >= tp_size) and (kv_num_heads_global % tp_size != 0):
raise ValueError(
f"GQA/MQA constraint violated: kv_num_heads={kv_num_heads_global} "
f"not divisible by tp_size={tp_size} "
f"(only kv<tp replication path or exact divisibility supported)"
)
PR2 Body —
【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]Motivation
Hackathon 10th Spring Task No.53 PR2 of 2. Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.
When SWA and full-attention heads coexist in one layer, the current AppendAttention path walks the same uniform
block_tablesrow for every KV head. The discreteblock_tables_headwiselayout (rank-2 logical[batch, kv_head, block], physical[batch * local_kv_heads, max_blocks_per_head]) lets SWA-head CTAs walk a shorter / sparser row while full heads preserve the existing full-context row. That reduces unnecessary block-id loads and K/V page reads under the required recycle OFF benchmark.The ABI is additive: callers without
block_tables_headwiseuse the legacy path unchanged; callers with the head-wise table take the new kernel-visible fast path.Modifications
Total stacked diff: 31 files, +2360/-49, grouped below. The
PR2-only deltablock lists what this PR adds on top of PR1.Stacked surface (PR1 producer + PR2 kernel + shared tests)
custom_ops/gpu_ops/)append_attention.cu,append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_impl.cuh, multiquery_attention_c16_kernel.h, template_config.json},cpp_extensions.ccfastdeploy/)cache_manager/prefix_cache_manager.py,engine/sched/resource_manager_v1.py,worker/{gpu_model_runner.py, input_batch.py, worker_process.py},model_executor/{forward_meta.py, layers/attention/append_attn_backend.py, layers/attention/ops/append_attention.py, models/paddleformers/base.py},engine/request.py,spec_decode/mtp.py,config.py,envs.pytests/)tests/cache_manager/test_{per_head_heaps, head_wise_freelist, head_wise_extend_validation, head_wise_abort_reset, head_wise_tp_consistency, swa_recycle, swa_recycle_legacy_relief, benchmark_head_wise_swa}.py,tests/layers/test_append_attention_head_wise_shapes.pybenchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml,.gitignorePR2-only delta (changes added on top of PR1 #7717)
custom_ops/gpu_ops/append_attention.cublock_tables_headwisethroughAppendAttentionKernel,AppendAttention, andAppendAttentionWithOutput; addPD_CHECK(.dtype() == INT32)dtype guards on every Python-supplied.data<int>()read (set_max_lengths,encoder_num_blocks,kv_num_blocks,decoder_num_blocks,mask_offset); makeblock_tables_headwisekeyword-only on the Python op; addsink_size/head_wise_full_hiddenparameters; threadsink_sizeintoappend_attention_with_output_gpu()(was hardcoded0).custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuhblock_tablesrow walk with per-head row selection fromblock_tables_headwisewhen present; preserve existingblock_id < 0 → 0clamp at the load site (-1sentinel = evicted SWA slot, mask zeroes contribution). c8/c4 variants deferred to PR3.custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json},custom_ops/gpu_ops/cpp_extensions.ccblock_tables_headwisetensor through kernel headers, template config, and the PHI op signature.fastdeploy/model_executor/layers/attention/append_attn_backend.py_get_block_tables_headwise(forward_meta)helper (per-call read offorward_meta, thenforward_meta.cache_manager, elseNone); thread the tensor as a kwarg into bothappend_attention()andappend_attention_with_output()call sites; passsink_sizeandhead_wise_full_hiddento the with-output path.fastdeploy/model_executor/layers/attention/ops/append_attention.pyblock_tables_headwisekeyword-only on both ops; guardhead_wise_full_hidden > 0in theuse_output=Truepath withassert head_wise_full_hidden == 0(dual-call merge stays inappend_attention()only; with-output path deferred to PR3).fastdeploy/engine/sched/resource_manager_v1.pyassert (kv_num_heads_global < tp_size) or (kv_num_heads_global % tp_size == 0)GQA divisibility guard beforekv_num_heads_global // tp_size.tests/layers/test_append_attention_head_wise_shapes.pyThe c16 kernel is the only flavor consumed in PR2. c8 / c4 / write-path mirrors and the graph-blacklist update are intentionally deferred to PR3. Safety in PR2 = legacy uniform
block_tableswalk + existingblock_id<0fallback + SWA mask zero-contribution.Clean-room note: PR2 uses public PR #6702 only as behavior/reference context. No
Co-authored-bytrailer; prose acknowledgement only.Usage or Command
No user-facing API change. The optimized path is active when PR1 provides
block_tables_headwiseand head-wise SWA is enabled:Spec acceptance must be measured with timely SWA recycle OFF, comparing 1D uniform
block_idxagainst 2D discreteblock_idx.Accuracy Tests
Spec PR2 acceptance — recycle OFF; H/B card; 1D uniform vs 2D discrete; both TTFT and TBT improve ≥5%:
block_idxmodeBenchmark:
FastDeploy/benchmarks/serving/benchmark_serving.pywithbenchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml.Correctness gates before push:
block_tables_headwise=Nonelegacy path unchanged.use_output=Trueanduse_output=Falseboth consume the same head-wise table contract.-1sentinel rows skip before K/V pointer derivation.tests/cache_manager/andtests/layers/green locally.CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7718/checks
Depends on: #7717 (ResourceManagerV1 head-wise SWA recycle, producer for
block_tables_headwise).Checklist
pre-commit run --all-filescleanblock_tables_headwise)Co-authored-bytrailer for PR2