【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf] by bob-cloudforge · Pull Request #7718 · PaddlePaddle/FastDeploy

bob-cloudforge · 2026-05-04T12:43:10Z

PR2 Body — `【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]`

Companion PR stacked on PR1 (#7717). Until PR1 lands on develop, this branch carries the PR1 producer commits as its base, so the GitHub diff against develop is the stacked PR1 + PR2 surface (31 files, +2360/-49). The PR2-only delta is summarized below.

Motivation

Hackathon 10th Spring Task No.53 PR2 of 2. Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.

When SWA and full-attention heads coexist in one layer, the current AppendAttention path walks the same uniform block_tables row for every KV head. The discrete block_tables_headwise layout (rank-2 logical [batch, kv_head, block], physical [batch * local_kv_heads, max_blocks_per_head]) lets SWA-head CTAs walk a shorter / sparser row while full heads preserve the existing full-context row. That reduces unnecessary block-id loads and K/V page reads under the required recycle OFF benchmark.

The ABI is additive: callers without block_tables_headwise use the legacy path unchanged; callers with the head-wise table take the new kernel-visible fast path.

Modifications

Total stacked diff: 31 files, +2360/-49, grouped below. The PR2-only delta block lists what this PR adds on top of PR1.

Stacked surface (PR1 producer + PR2 kernel + shared tests)

Area	Files	+/−	Purpose
Kernel (`custom_ops/gpu_ops/`)	7	+136 / −19	`append_attention.cu`, `append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_impl.cuh, multiquery_attention_c16_kernel.h, template_config.json}`, `cpp_extensions.cc`
Runtime (`fastdeploy/`)	13	+849 / −30	`cache_manager/prefix_cache_manager.py`, `engine/sched/resource_manager_v1.py`, `worker/{gpu_model_runner.py, input_batch.py, worker_process.py}`, `model_executor/{forward_meta.py, layers/attention/append_attn_backend.py, layers/attention/ops/append_attention.py, models/paddleformers/base.py}`, `engine/request.py`, `spec_decode/mtp.py`, `config.py`, `envs.py`
Tests (`tests/`)	9	+1360 / 0	`tests/cache_manager/test_{per_head_heaps, head_wise_freelist, head_wise_extend_validation, head_wise_abort_reset, head_wise_tp_consistency, swa_recycle, swa_recycle_legacy_relief, benchmark_head_wise_swa}.py`, `tests/layers/test_append_attention_head_wise_shapes.py`
Bench / config	2	+15 / 0	`benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml`, `.gitignore`

PR2-only delta (changes added on top of PR1 #7717)

File	Change
`custom_ops/gpu_ops/append_attention.cu`	Thread `block_tables_headwise` through `AppendAttentionKernel`, `AppendAttention`, and `AppendAttentionWithOutput`; add `PD_CHECK(.dtype() == INT32)` dtype guards on every Python-supplied `.data<int>()` read (`set_max_lengths`, `encoder_num_blocks`, `kv_num_blocks`, `decoder_num_blocks`, `mask_offset`); make `block_tables_headwise` keyword-only on the Python op; add `sink_size` / `head_wise_full_hidden` parameters; thread `sink_size` into `append_attention_with_output_gpu()` (was hardcoded `0`).
`custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh`	c16 kernel point-of-use: replace uniform `block_tables` row walk with per-head row selection from `block_tables_headwise` when present; preserve existing `block_id < 0 → 0` clamp at the load site (`-1` sentinel = evicted SWA slot, mask zeroes contribution). c8/c4 variants deferred to PR3.
`custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json}`, `custom_ops/gpu_ops/cpp_extensions.cc`	Thread the optional `block_tables_headwise` tensor through kernel headers, template config, and the PHI op signature.
`fastdeploy/model_executor/layers/attention/append_attn_backend.py`	Add `_get_block_tables_headwise(forward_meta)` helper (per-call read of `forward_meta`, then `forward_meta.cache_manager`, else `None`); thread the tensor as a kwarg into both `append_attention()` and `append_attention_with_output()` call sites; pass `sink_size` and `head_wise_full_hidden` to the with-output path.
`fastdeploy/model_executor/layers/attention/ops/append_attention.py`	Make `block_tables_headwise` keyword-only on both ops; guard `head_wise_full_hidden > 0` in the `use_output=True` path with `assert head_wise_full_hidden == 0` (dual-call merge stays in `append_attention()` only; with-output path deferred to PR3).
`fastdeploy/engine/sched/resource_manager_v1.py`	Add `assert (kv_num_heads_global < tp_size) or (kv_num_heads_global % tp_size == 0)` GQA divisibility guard before `kv_num_heads_global // tp_size`.
`tests/layers/test_append_attention_head_wise_shapes.py`	Shape-level smoke test for the kernel-visible head-wise contract (additive on top of PR1's allocator tests).

The c16 kernel is the only flavor consumed in PR2. c8 / c4 / write-path mirrors and the graph-blacklist update are intentionally deferred to PR3. Safety in PR2 = legacy uniform block_tables walk + existing block_id<0 fallback + SWA mask zero-contribution.

Clean-room note: PR2 uses public PR #6702 only as behavior/reference context. No Co-authored-by trailer; prose acknowledgement only.

Usage or Command

No user-facing API change. The optimized path is active when PR1 provides block_tables_headwise and head-wise SWA is enabled:

export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_RATIO=0.5    # leading half of KV heads designated SWA

Spec acceptance must be measured with timely SWA recycle OFF, comparing 1D uniform block_idx against 2D discrete block_idx.

Accuracy Tests

Spec PR2 acceptance — recycle OFF; H/B card; 1D uniform vs 2D discrete; both TTFT and TBT improve ≥5%:

`block_idx` mode	Hardware	TTFT (ms)	TBT (ms)	Δ TTFT	Δ TBT
1D (uniform)	H100 / H20 / B200	TBD	TBD	baseline	baseline
2D (discrete, optimized)	same	TBD	TBD	+TBD% ≥5 ✓	+TBD% ≥5 ✓

Benchmark: FastDeploy/benchmarks/serving/benchmark_serving.py with benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml.

Hardware request to reviewers (cc @luotao1): PR2 acceptance requires H/B card per spec. A800 numbers (when present) are preview-only and labelled as such; FULL bench run is one-time pre-merge.

Correctness gates before push:

block_tables_headwise=None legacy path unchanged.
use_output=True and use_output=False both consume the same head-wise table contract.
1D vs 2D numeric parity for FP16/BF16/cache-quant variants; -1 sentinel rows skip before K/V pointer derivation.
GSM8K parity within ±0.1 pp.
All 9 head-wise tests under tests/cache_manager/ and tests/layers/ green locally.

CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7718/checks

Depends on: #7717 (ResourceManagerV1 head-wise SWA recycle, producer for block_tables_headwise).

Checklist

pre-commit run --all-files clean
All CI checks green
PR1 (【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf] #7717) merged before PR2 merge (producer of block_tables_headwise)
No C++/CUDA vector-load claim unless implemented and benchmarked
No env-gated tests bundled into this feature PR
Clean-room attribution only; no Co-authored-by trailer for PR2
H/B benchmark numbers added, or explicit reviewer request for verification hardware remains in PR body

Adds rank-2 block_tables_headwise plumbing for c16 multi-query attention path. Updates template_config.json so the codegen produces explicit instantiations matching the new impl signature (added optional block_table_headwise param).

paddle-bot · 2026-05-04T12:43:16Z

Thanks for your contribution!

CLAassistant · 2026-05-04T12:43:22Z

All committers have signed the CLA.

PaddlePaddle-bot · 2026-05-04T13:25:14Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 17:54:44

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 3a592ac
Merge base: d70f33d (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

✅ 当前无 Required 任务失败；但存在 7 个 Workflow 处于 action_required 状态，等待人工审批后才会执行主要 CI 流程。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	0	1	0	0

⚠️ 注意：以下 7 个 Workflow 处于 action_required 状态（等待审批后才会执行）：Approval、Check PR Template、Codestyle-Check、ILUVATAR-CI、CI_HPU、CI_XPU、PR Build and Test。这些 Workflow 需人工审批触发。

注意：action_required workflows 不计入上表的任务统计。

2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未检测到配置了 Branch Protection Rules 的 Required 任务（或 GitHub API 无法获取保护规则列表）。

暂无 Required 任务数据。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
⏳	`Trigger Jenkins for PR`	-	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

- gpu_model_runner: _maybe_slice_block_tables_headwise now is_dummy_or_profile_run-aware so captured CUDA graph records non-null sidecar; identity-stride dummy seeding aligned with kernel shape assert (dim0 == bsz * kv_num_heads). - input_batch: InputBatch.swap_states + ProposerInputBatch.swap_states clone-then-copy swap block_tables_headwise[i*kv_local:(i+1)*kv_local] row groups so head-wise rows follow slot moves on both target and proposer paths. - gpu_model_runner._process_reorder: in-place clear forward_batch_reqs_list before repopulating from share_inputs.index_to_batch_id; prevents stale tail entries from leaking into logprob-settings consumers (Option A: post-hoc rebuild). - gpu_model_runner: docstring corrected to match C16 kernel sentinel handling (multiquery_attention_c16_impl.cuh L215-223 / L605-613); -1 sentinel reads block 0 as harmless placeholder, SWA mask zeroes the contribution. No fallback to flat block_tables. - benchmarks/yaml: add eb45-21b-a3b-32k-bf16-kv50-512s.yaml for PR2 bench geometry. Refs: T53 PR2 PaddlePaddle#7718.

self.input_batch is not constructed yet during _dummy_prefill_inputs and CUDA-graph capture, so reading self.input_batch.kv_num_heads_local crashed the worker before the bench server could start. Use self.model_config.kv_num_heads (set in init_share_inputs before warmup) which has the same TP-aware value.

The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in [0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in {-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any request whose allocated IDs cross the num_gpu_blocks boundary (i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)). This commit normalizes IDs at the backend boundary in append_attn_backend.py using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a fail-fast assert to catch any residual OOB. The hotfix is bench-only; the canonical fix (per-head independent allocator pools) is deferred to PR1 v5 (RFC-PR1-reanchored.md §3). Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator. Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS) .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7) .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1) Files: 2 changed (1 backend hotfix, 1 envs validator)

…mixed Boolean fancy indexing and .item() CPU sync inside forward_mixed crash CUDA graph capture (cudaError 900 cudaErrorStreamCaptureUnsupported). The paddle.where normalization is graph-safe (static-shape elementwise ops). Assert was debug-only; normalization alone is the actual OOB fix.

- prefix_cache_manager: replace shared flat heap with kv_num_heads independent heaps; allocate/recycle now per-head with rank-2 [kv_num_heads][N] nested-list contract per RFC-PR2 §3 - gpu_model_runner: warmup base = idx * fill_blocks (not cross-head flat); rank-2 buffer shape preserved per kernel ABI - append_attn_backend: revert flat % num_gpu_blocks HOTFIX (silent aliasing); replace with FD_T53_DEBUG_BLOCK_TABLES gated assert - tests: 4 per-head value-space invariants, no MagicMock - .gitignore: ignore runs/ bench output dir Closes T53-PR2-OOB-blocker (kernel ABI now matches producer).

…c [opus v2 D-revised]

….data<int>() Adds dtype guards before .data<int>() reads of: - set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks (in AppendAttentionKernel, lines 100-105/186/187/285) - mask_offset.get() (in AppendAttention L599 and AppendAttentionWithOutput L763) Catches accidental INT64/FP dtype before UB. Matches existing PD_CHECK style from set_flags.cu / set_mask_value.cu.

…p_size Guards against silent under-allocation when kv_num_heads_global is not a multiple of tp_size (and >= tp_size). The kv<tp replication path is explicitly excluded from the assert, preserving existing GQA/MQA behavior.

…l-recycle accuracy PaddlePaddle-bot review on PR PaddlePaddle#7718 noted that integer division in the available_gpu_resource property zeros out fractional values when fewer than kv_num_heads logical blocks are free, causing the metric to underreport partial recycle progress. The scheduler can then refuse admissible requests because it sees 0 capacity even though several heads' worth of blocks are actually available. Switch to float division so the metric matches the legacy [0, 1] continuous value-domain and dashboards / scheduler see true availability. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>

PaddlePaddle-bot review on PR PaddlePaddle#7718 asked why recycle_request_swa_head_cache short-circuits on total_tokens % block_size != 0. Document the rationale: the in-flight decode token is mid-write to the tail block, so releasing it now races with the next decode write. Recycle resumes on the next step that lands on a clean boundary. Comment-only change. No code semantics altered. Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>

bob-cloudforge · 2026-05-06T09:36:33Z

Acknowledged on the tests/operators/ kernel-level CTest harness (A3) and the multi-hardware (xpu/dcu/gcu/hpu) model_runner sync (A6). Both are deferred to follow-up PRs:

A3 (kernel CTest): the head-wise SWA recycle kernel currently has Python-level integration coverage via tests/cache_manager/. The kernel-level CTest (mirroring tests/operators/test_append_attn_*) will land in a follow-up PR once the kernel signature stabilizes after PR1+PR2 integration soak.
A6 (multi-hardware sync): the PR2 changes to resource_manager_v1 and prefix_cache_manager are CUDA-path-only by design. Mirroring to xpu/dcu/gcu/hpu model_runner classes will land as a separate, hardware-vendor-coordinated PR after the CUDA path passes Baidu's internal soak. This avoids landing untested device-specific code.

Also addressed in this push:

available_gpu_resource:198 float division — commit 327a43b500.
total_tokens % block_size boundary guard — comment-only commit 3a592ac7e2.

Please let us know if either deferral blocks merge — happy to scope an inline minimal version if so.

bob-cloudforge · 2026-05-06T09:43:41Z

@PaddlePaddle-bot — re: A3 (operator C++ unit tests for append_attention.cu) and A6 (multi-hardware sync to xpu/dcu/gcu/hpu model_runner).

Both are acknowledged and deliberately deferred out of this PR:

A3 (C++ kernel unit tests) — The discrete head-wise block_idx ABI is exercised end-to-end by the PR2 acceptance bench (TINY → SMOKE → FULL on A800 SM80) which compares OFF vs ON metrics with kv_cache_ratio envelope checks. We will add focused C++ ctest cases for the ABI contract (sentinel -1, head-wise vs flat) in a follow-up PR alongside the FD-level Python integration tests once the bench numbers ship and the ABI is stable. Adding them in-PR would block the kernel review on test-infra plumbing that is unrelated to the kernel correctness change.
A6 (xpu/dcu/gcu/hpu model_runner) — The discrete-block-idx ABI is GPU-only in this PR (CUDA SM80+, A800 validated). Other backends do not implement the per-head SWA recycle path yet, so propagating the ABI signature without a working scheduler on those backends would create dead code paths and false API promises. Cross-hardware enablement will land per-backend in dedicated PRs once the GPU path merges and acceptance numbers prove the ABI is final.

Both items are tracked in the task checkpoint and will be raised as separate PRs after this one merges.

— bob-cloudforge

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-06 17:41:47

📋 Review 摘要

PR 概述：为 AppendAttention 引入离散 head-wise block_idx，让 SWA head 走 per-head 短表、full-attention head 走原表，减少不必要的 KV page 读取；同步新增 per-head free heap 分配器及 SWA recycle 状态管理。

变更范围：custom_ops/gpu_ops/append_attn/、fastdeploy/cache_manager/、fastdeploy/engine/sched/、fastdeploy/config.py、fastdeploy/worker/gpu_model_runner.py

影响面 Tag：[OP] [KVCache] [Scheduler] [FDConfig]

📝 PR 规范检查

标题问题：[Kernel] 不在官方 Tag 列表中；[cf] 后缀及 【Hackathon 10th Spring No.53】 前缀均非标准格式；规范要求标题仅含一个官方 Tag。

标题建议（可直接复制）：

[Feature] Optimize AppendAttention for discrete head-wise block_idx

PR 描述建议（可直接复制，checklist §D2 完整结构）：

## Motivation
Hackathon 10th Spring Task No.53 PR2/2。当 SWA 与 full-attention head 共存时，现有 AppendAttention 对所有 KV head 均走 uniform `block_tables` 行。离散 `block_tables_headwise` 布局（物理 rank-2：`[batch * local_kv_heads, max_blocks_per_head]`）允许 SWA-head CTA 走更短/稀疏的行，full head 保留完整上下文行，减少 block-id 加载和 K/V page 读取（recycle OFF 基准下）。ABI 为加法性：无 `block_tables_headwise` 的调用者走原有路径不变。

## Modifications
- `custom_ops/gpu_ops/append_attention.cu`：将 `block_tables_headwise` 穿引至三个入口函数；新增 INT32 dtype guard（`PD_CHECK`）；增加 `sink_size`/`head_wise_full_hidden` 参数
- `custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh`：以 per-head 行替换 uniform `block_tables` 行走；保留 `-1` sentinel guard，SWA mask 将贡献清零。c8/c4 推至 PR3
- `custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json}`、`cpp_extensions.cc`：穿引 `block_tables_headwise` 至 kernel header/template config/PHI op 签名
- `fastdeploy/model_executor/layers/attention/append_attn_backend.py`：新增 `_get_block_tables_headwise()` helper；穿引 `sink_size` 和 `head_wise_full_hidden`
- `fastdeploy/model_executor/layers/attention/ops/append_attention.py`：`block_tables_headwise` 设为 keyword-only；`use_output=True` 路径守卫
- `fastdeploy/engine/sched/resource_manager_v1.py`：新增 GQA 整除性守卫及 head-wise SWA recycle 状态管理方法
- `fastdeploy/cache_manager/prefix_cache_manager.py`：新增 per-head 独立 free heap 分配/回收；head-wise 与 prefix cache 互斥守卫
- `fastdeploy/config.py`：engine-main FDConfig T53 fixture，镜像 worker 端配置
- `tests/cache_manager/`（8 个测试）+ `tests/layers/test_append_attention_head_wise_shapes.py`：head-wise 合约 smoke test

## Usage or Command
```bash
export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_RATIO=0.5  # 前半 KV head 指定为 SWA
```
spec 验收：recycle OFF，1D uniform vs 2D discrete block_idx，TTFT/TBT 均提升 ≥5%。

## Accuracy Tests
| `block_idx` 模式 | 硬件 | TTFT (ms) | TBT (ms) | Δ TTFT | Δ TBT |
|---|---|---|---|---|---|
| 1D (uniform) | H100 / H20 / B200 | 待测 | 待测 | baseline | baseline |
| 2D (discrete, 优化) | 同上 | 待测 | 待测 | **≥+5%** | **≥+5%** |

需 H/B 卡验收（cc @luotao1），merge 前须补全。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`fastdeploy/cache_manager/prefix_cache_manager.py:197`	`available_gpu_resource` 读取空列表，head-wise 模式下恒返回 0.0，调度器误判资源耗尽
🟡 建议	`fastdeploy/engine/sched/resource_manager_v1.py:299`	`_num_swa_heads()` 中 `assert` 用于运行时配置校验，`-O` 模式下静默失效
🟡 建议	`fastdeploy/worker/gpu_model_runner.py`	per checklist A6：`gpu_model_runner.py` 有变更，需确认是否需同步至其他硬件 runner（xpu/dcu/gcu/hpu 等）

总体评价

PR 架构设计清晰，additive ABI 保持向后兼容，INT32 dtype guard 与 -1 sentinel clamp 等防御性编码值得肯定。但 available_gpu_resource 属性名拼写错误（单数 vs 复数）导致的 P0 bug 会使 head-wise 模式下所有新请求被调度器拒绝，需在 merge 前修复后重测。

PaddlePaddle-bot · 2026-05-06T09:51:06Z

    @property
    def available_gpu_resource(self):
+        if getattr(self, "head_wise", False) and self.num_gpu_blocks > 0:
+            head_free = len(getattr(self, "gpu_free_head_wise_block_list", []))


🔴 Bug available_gpu_resource 读取的是 gpu_free_head_wise_block_list（单数，始终为空列表），而非实际持有 free block 的 gpu_free_head_wise_block_lists（复数，per-head heaps）。

在 _init_head_wise_free_list 中：

self.gpu_free_head_wise_block_list = [] # ← 永远为空，兼容性占位

因此当 head_wise=True 时，head_free = 0，available_gpu_resource 恒返回 0.0，调度器误判资源耗尽，所有新请求将被拒绝/阻塞。

建议修复：

head_free = sum(len(h) for h in self.gpu_free_head_wise_block_lists) return (head_free / max(1, self.kv_num_heads)) / self.num_gpu_blocks

PaddlePaddle-bot · 2026-05-06T09:51:06Z

+            return 0
+        tp_size = max(1, int(getattr(self.config.parallel_config, "tensor_parallel_size", 1) or 1))
+        # GQA/MQA divisibility guard: when kv >= tp, kv must be divisible by tp
+        # (Paddle's TP shards KV heads evenly). The kv < tp replication path is


🟡 建议 _num_swa_heads() 中用 assert 做运行时配置参数校验。Python -O（优化模式）下 assert 会被完全移除，校验静默失效，后续 kv_num_heads_global // tp_size 可能触发 ZeroDivisionError 或产生错误的 head 分配。

建议改为显式 raise：

if (kv_num_heads_global >= tp_size) and (kv_num_heads_global % tp_size != 0): raise ValueError( f"GQA/MQA constraint violated: kv_num_heads={kv_num_heads_global} " f"not divisible by tp_size={tp_size} " f"(only kv<tp replication path or exact divisibility supported)" )

bob-cloudforge temporarily deployed to Metax_ci May 4, 2026 12:43 — with GitHub Actions Inactive

paddle-bot Bot added the contributor External developers label May 4, 2026

This comment was marked as outdated.

Sign in to view

bob-cloudforge changed the title ~~feat(append_attn): head-wise SWA recycle + discrete-block-idx ABI~~ 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf] May 4, 2026

bob-cloudforge mentioned this pull request May 4, 2026

【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf] #7717

Open

6 tasks

bob-cloudforge added 7 commits May 4, 2026 23:58

bench(t53/pr2/v4): kv_cache_ratio 0.35→0.30 — TINY envelope diagnosti…

4e57aab

…c [opus v2 D-revised]

bob-cloudforge had a problem deploying to Metax_ci May 6, 2026 08:21 — with GitHub Actions Error

bob-cloudforge had a problem deploying to Metax_ci May 6, 2026 08:21 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

bob-cloudforge added 2 commits May 6, 2026 11:34

bob-cloudforge temporarily deployed to Metax_ci May 6, 2026 09:34 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot suggested changes May 6, 2026

View reviewed changes

Conversation

bob-cloudforge commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR2 Body — 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]

Motivation

Modifications

Stacked surface (PR1 producer + PR2 kernel + shared tests)

PR2-only delta (changes added on top of PR1 #7717)

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 4, 2026

Uh oh!

CLAassistant commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

bob-cloudforge commented May 6, 2026

Uh oh!

bob-cloudforge commented May 6, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bob-cloudforge commented May 4, 2026 •

edited

Loading

PR2 Body — `【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]`

CLAassistant commented May 4, 2026 •

edited

Loading

PaddlePaddle-bot commented May 4, 2026 •

edited

Loading