Skip to content

【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]#7718

Open
bob-cloudforge wants to merge 11 commits intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr2-discrete-block-idx-v4
Open

【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]#7718
bob-cloudforge wants to merge 11 commits intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-053-pr2-discrete-block-idx-v4

Conversation

@bob-cloudforge
Copy link
Copy Markdown

@bob-cloudforge bob-cloudforge commented May 4, 2026

PR2 Body — 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf]

Companion PR stacked on PR1 (#7717). Until PR1 lands on develop, this branch carries the PR1 producer commits as its base, so the GitHub diff against develop is the stacked PR1 + PR2 surface (31 files, +2360/-49). The PR2-only delta is summarized below.


Motivation

Hackathon 10th Spring Task No.53 PR2 of 2. Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.

When SWA and full-attention heads coexist in one layer, the current AppendAttention path walks the same uniform block_tables row for every KV head. The discrete block_tables_headwise layout (rank-2 logical [batch, kv_head, block], physical [batch * local_kv_heads, max_blocks_per_head]) lets SWA-head CTAs walk a shorter / sparser row while full heads preserve the existing full-context row. That reduces unnecessary block-id loads and K/V page reads under the required recycle OFF benchmark.

The ABI is additive: callers without block_tables_headwise use the legacy path unchanged; callers with the head-wise table take the new kernel-visible fast path.

Modifications

Total stacked diff: 31 files, +2360/-49, grouped below. The PR2-only delta block lists what this PR adds on top of PR1.

Stacked surface (PR1 producer + PR2 kernel + shared tests)

Area Files +/− Purpose
Kernel (custom_ops/gpu_ops/) 7 +136 / −19 append_attention.cu, append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_impl.cuh, multiquery_attention_c16_kernel.h, template_config.json}, cpp_extensions.cc
Runtime (fastdeploy/) 13 +849 / −30 cache_manager/prefix_cache_manager.py, engine/sched/resource_manager_v1.py, worker/{gpu_model_runner.py, input_batch.py, worker_process.py}, model_executor/{forward_meta.py, layers/attention/append_attn_backend.py, layers/attention/ops/append_attention.py, models/paddleformers/base.py}, engine/request.py, spec_decode/mtp.py, config.py, envs.py
Tests (tests/) 9 +1360 / 0 tests/cache_manager/test_{per_head_heaps, head_wise_freelist, head_wise_extend_validation, head_wise_abort_reset, head_wise_tp_consistency, swa_recycle, swa_recycle_legacy_relief, benchmark_head_wise_swa}.py, tests/layers/test_append_attention_head_wise_shapes.py
Bench / config 2 +15 / 0 benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml, .gitignore

PR2-only delta (changes added on top of PR1 #7717)

File Change
custom_ops/gpu_ops/append_attention.cu Thread block_tables_headwise through AppendAttentionKernel, AppendAttention, and AppendAttentionWithOutput; add PD_CHECK(.dtype() == INT32) dtype guards on every Python-supplied .data<int>() read (set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks, mask_offset); make block_tables_headwise keyword-only on the Python op; add sink_size / head_wise_full_hidden parameters; thread sink_size into append_attention_with_output_gpu() (was hardcoded 0).
custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh c16 kernel point-of-use: replace uniform block_tables row walk with per-head row selection from block_tables_headwise when present; preserve existing block_id < 0 → 0 clamp at the load site (-1 sentinel = evicted SWA slot, mask zeroes contribution). c8/c4 variants deferred to PR3.
custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json}, custom_ops/gpu_ops/cpp_extensions.cc Thread the optional block_tables_headwise tensor through kernel headers, template config, and the PHI op signature.
fastdeploy/model_executor/layers/attention/append_attn_backend.py Add _get_block_tables_headwise(forward_meta) helper (per-call read of forward_meta, then forward_meta.cache_manager, else None); thread the tensor as a kwarg into both append_attention() and append_attention_with_output() call sites; pass sink_size and head_wise_full_hidden to the with-output path.
fastdeploy/model_executor/layers/attention/ops/append_attention.py Make block_tables_headwise keyword-only on both ops; guard head_wise_full_hidden > 0 in the use_output=True path with assert head_wise_full_hidden == 0 (dual-call merge stays in append_attention() only; with-output path deferred to PR3).
fastdeploy/engine/sched/resource_manager_v1.py Add assert (kv_num_heads_global < tp_size) or (kv_num_heads_global % tp_size == 0) GQA divisibility guard before kv_num_heads_global // tp_size.
tests/layers/test_append_attention_head_wise_shapes.py Shape-level smoke test for the kernel-visible head-wise contract (additive on top of PR1's allocator tests).

The c16 kernel is the only flavor consumed in PR2. c8 / c4 / write-path mirrors and the graph-blacklist update are intentionally deferred to PR3. Safety in PR2 = legacy uniform block_tables walk + existing block_id<0 fallback + SWA mask zero-contribution.

Clean-room note: PR2 uses public PR #6702 only as behavior/reference context. No Co-authored-by trailer; prose acknowledgement only.

Usage or Command

No user-facing API change. The optimized path is active when PR1 provides block_tables_headwise and head-wise SWA is enabled:

export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_RATIO=0.5    # leading half of KV heads designated SWA

Spec acceptance must be measured with timely SWA recycle OFF, comparing 1D uniform block_idx against 2D discrete block_idx.

Accuracy Tests

Spec PR2 acceptance — recycle OFF; H/B card; 1D uniform vs 2D discrete; both TTFT and TBT improve ≥5%:

block_idx mode Hardware TTFT (ms) TBT (ms) Δ TTFT Δ TBT
1D (uniform) H100 / H20 / B200 TBD TBD baseline baseline
2D (discrete, optimized) same TBD TBD +TBD% ≥5 ✓ +TBD% ≥5 ✓

Benchmark: FastDeploy/benchmarks/serving/benchmark_serving.py with benchmarks/yaml/eb45-21b-a3b-32k-bf16-kv50-512s.yaml.

Hardware request to reviewers (cc @luotao1): PR2 acceptance requires H/B card per spec. A800 numbers (when present) are preview-only and labelled as such; FULL bench run is one-time pre-merge.

Correctness gates before push:

  • block_tables_headwise=None legacy path unchanged.
  • use_output=True and use_output=False both consume the same head-wise table contract.
  • 1D vs 2D numeric parity for FP16/BF16/cache-quant variants; -1 sentinel rows skip before K/V pointer derivation.
  • GSM8K parity within ±0.1 pp.
  • All 9 head-wise tests under tests/cache_manager/ and tests/layers/ green locally.

CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7718/checks

Depends on: #7717 (ResourceManagerV1 head-wise SWA recycle, producer for block_tables_headwise).

Checklist

Adds rank-2 block_tables_headwise plumbing for c16 multi-query attention path.

Updates template_config.json so the codegen produces explicit instantiations matching the new impl signature (added optional block_table_headwise param).
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 4, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 4, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 4, 2026

CLA assistant check
All committers have signed the CLA.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 4, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-06 17:54:44

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

✅ 当前无 Required 任务失败;但存在 7 个 Workflow 处于 action_required 状态,等待人工审批后才会执行主要 CI 流程。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 1 0 1 0 0

⚠️ 注意:以下 7 个 Workflow 处于 action_required 状态(等待审批后才会执行):Approval、Check PR Template、Codestyle-Check、ILUVATAR-CI、CI_HPU、CI_XPU、PR Build and Test。这些 Workflow 需人工审批触发。

注意:action_required workflows 不计入上表的任务统计。


2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未检测到配置了 Branch Protection Rules 的 Required 任务(或 GitHub API 无法获取保护规则列表)。

暂无 Required 任务数据。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR - Job -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

@bob-cloudforge bob-cloudforge changed the title feat(append_attn): head-wise SWA recycle + discrete-block-idx ABI 【Hackathon 10th Spring No.53】[Feature][Kernel] Optimize AppendAttention for discrete head-wise block_idx [cf] May 4, 2026
- gpu_model_runner: _maybe_slice_block_tables_headwise now is_dummy_or_profile_run-aware so captured CUDA graph records non-null sidecar; identity-stride dummy seeding aligned with kernel shape assert (dim0 == bsz * kv_num_heads).
- input_batch: InputBatch.swap_states + ProposerInputBatch.swap_states clone-then-copy swap block_tables_headwise[i*kv_local:(i+1)*kv_local] row groups so head-wise rows follow slot moves on both target and proposer paths.
- gpu_model_runner._process_reorder: in-place clear forward_batch_reqs_list before repopulating from share_inputs.index_to_batch_id; prevents stale tail entries from leaking into logprob-settings consumers (Option A: post-hoc rebuild).
- gpu_model_runner: docstring corrected to match C16 kernel sentinel handling (multiquery_attention_c16_impl.cuh L215-223 / L605-613); -1 sentinel reads block 0 as harmless placeholder, SWA mask zeroes the contribution. No fallback to flat block_tables.
- benchmarks/yaml: add eb45-21b-a3b-32k-bf16-kv50-512s.yaml for PR2 bench geometry.

Refs: T53 PR2 PaddlePaddle#7718.
self.input_batch is not constructed yet during _dummy_prefill_inputs
and CUDA-graph capture, so reading self.input_batch.kv_num_heads_local
crashed the worker before the bench server could start. Use
self.model_config.kv_num_heads (set in init_share_inputs before warmup)
which has the same TP-aware value.
The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in
[0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but
the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in
{-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any
request whose allocated IDs cross the num_gpu_blocks boundary
(i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)).

This commit normalizes IDs at the backend boundary in append_attn_backend.py
using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a
fail-fast assert to catch any residual OOB. The hotfix is bench-only;
the canonical fix (per-head independent allocator pools) is deferred to
PR1 v5 (RFC-PR1-reanchored.md §3).

Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator.

Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS)
     .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7)
     .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1)

Files: 2 changed (1 backend hotfix, 1 envs validator)
…mixed

Boolean fancy indexing and .item() CPU sync inside forward_mixed
crash CUDA graph capture (cudaError 900 cudaErrorStreamCaptureUnsupported).
The paddle.where normalization is graph-safe (static-shape elementwise ops).
Assert was debug-only; normalization alone is the actual OOB fix.
- prefix_cache_manager: replace shared flat heap with kv_num_heads
  independent heaps; allocate/recycle now per-head with rank-2
  [kv_num_heads][N] nested-list contract per RFC-PR2 §3
- gpu_model_runner: warmup base = idx * fill_blocks (not cross-head
  flat); rank-2 buffer shape preserved per kernel ABI
- append_attn_backend: revert flat % num_gpu_blocks HOTFIX (silent
  aliasing); replace with FD_T53_DEBUG_BLOCK_TABLES gated assert
- tests: 4 per-head value-space invariants, no MagicMock
- .gitignore: ignore runs/ bench output dir

Closes T53-PR2-OOB-blocker (kernel ABI now matches producer).
….data<int>()

Adds dtype guards before .data<int>() reads of:
- set_max_lengths, encoder_num_blocks, kv_num_blocks, decoder_num_blocks
  (in AppendAttentionKernel, lines 100-105/186/187/285)
- mask_offset.get() (in AppendAttention L599 and AppendAttentionWithOutput L763)

Catches accidental INT64/FP dtype before UB. Matches existing PD_CHECK style
from set_flags.cu / set_mask_value.cu.
…p_size

Guards against silent under-allocation when kv_num_heads_global is not a
multiple of tp_size (and >= tp_size). The kv<tp replication path is
explicitly excluded from the assert, preserving existing GQA/MQA behavior.
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

…l-recycle accuracy

PaddlePaddle-bot review on PR PaddlePaddle#7718 noted that integer division in the
available_gpu_resource property zeros out fractional values when fewer
than kv_num_heads logical blocks are free, causing the metric to
underreport partial recycle progress. The scheduler can then refuse
admissible requests because it sees 0 capacity even though several
heads' worth of blocks are actually available.

Switch to float division so the metric matches the legacy [0, 1]
continuous value-domain and dashboards / scheduler see true availability.

Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot)
Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
PaddlePaddle-bot review on PR PaddlePaddle#7718 asked why
recycle_request_swa_head_cache short-circuits on
total_tokens % block_size != 0. Document the rationale: the in-flight
decode token is mid-write to the tail block, so releasing it now races
with the next decode write. Recycle resumes on the next step that lands
on a clean boundary.

Comment-only change. No code semantics altered.

Refs: review on PR PaddlePaddle#7718 (PaddlePaddle-bot)
Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
PaddlePaddle-bot

This comment was marked as outdated.

@bob-cloudforge
Copy link
Copy Markdown
Author

Acknowledged on the tests/operators/ kernel-level CTest harness (A3) and the multi-hardware (xpu/dcu/gcu/hpu) model_runner sync (A6). Both are deferred to follow-up PRs:

  • A3 (kernel CTest): the head-wise SWA recycle kernel currently has Python-level integration coverage via tests/cache_manager/. The kernel-level CTest (mirroring tests/operators/test_append_attn_*) will land in a follow-up PR once the kernel signature stabilizes after PR1+PR2 integration soak.
  • A6 (multi-hardware sync): the PR2 changes to resource_manager_v1 and prefix_cache_manager are CUDA-path-only by design. Mirroring to xpu/dcu/gcu/hpu model_runner classes will land as a separate, hardware-vendor-coordinated PR after the CUDA path passes Baidu's internal soak. This avoids landing untested device-specific code.

Also addressed in this push:

  • available_gpu_resource:198 float division — commit 327a43b500.
  • total_tokens % block_size boundary guard — comment-only commit 3a592ac7e2.

Please let us know if either deferral blocks merge — happy to scope an inline minimal version if so.

@bob-cloudforge
Copy link
Copy Markdown
Author

@PaddlePaddle-bot — re: A3 (operator C++ unit tests for append_attention.cu) and A6 (multi-hardware sync to xpu/dcu/gcu/hpu model_runner).

Both are acknowledged and deliberately deferred out of this PR:

  • A3 (C++ kernel unit tests) — The discrete head-wise block_idx ABI is exercised end-to-end by the PR2 acceptance bench (TINY → SMOKE → FULL on A800 SM80) which compares OFF vs ON metrics with kv_cache_ratio envelope checks. We will add focused C++ ctest cases for the ABI contract (sentinel -1, head-wise vs flat) in a follow-up PR alongside the FD-level Python integration tests once the bench numbers ship and the ABI is stable. Adding them in-PR would block the kernel review on test-infra plumbing that is unrelated to the kernel correctness change.
  • A6 (xpu/dcu/gcu/hpu model_runner) — The discrete-block-idx ABI is GPU-only in this PR (CUDA SM80+, A800 validated). Other backends do not implement the per-head SWA recycle path yet, so propagating the ABI signature without a working scheduler on those backends would create dead code paths and false API promises. Cross-hardware enablement will land per-backend in dedicated PRs once the GPU path merges and acceptance numbers prove the ABI is final.

Both items are tracked in the task checkpoint and will be raised as separate PRs after this one merges.

— bob-cloudforge

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-06 17:41:47

📋 Review 摘要

PR 概述:为 AppendAttention 引入离散 head-wise block_idx,让 SWA head 走 per-head 短表、full-attention head 走原表,减少不必要的 KV page 读取;同步新增 per-head free heap 分配器及 SWA recycle 状态管理。

变更范围custom_ops/gpu_ops/append_attn/fastdeploy/cache_manager/fastdeploy/engine/sched/fastdeploy/config.pyfastdeploy/worker/gpu_model_runner.py

影响面 Tag[OP] [KVCache] [Scheduler] [FDConfig]


📝 PR 规范检查

标题问题:[Kernel] 不在官方 Tag 列表中;[cf] 后缀及 【Hackathon 10th Spring No.53】 前缀均非标准格式;规范要求标题仅含一个官方 Tag。

标题建议(可直接复制):

  • [Feature] Optimize AppendAttention for discrete head-wise block_idx

PR 描述建议(可直接复制,checklist §D2 完整结构):

## Motivation
Hackathon 10th Spring Task No.53 PR2/2。当 SWA 与 full-attention head 共存时,现有 AppendAttention 对所有 KV head 均走 uniform `block_tables` 行。离散 `block_tables_headwise` 布局(物理 rank-2:`[batch * local_kv_heads, max_blocks_per_head]`)允许 SWA-head CTA 走更短/稀疏的行,full head 保留完整上下文行,减少 block-id 加载和 K/V page 读取(recycle OFF 基准下)。ABI 为加法性:无 `block_tables_headwise` 的调用者走原有路径不变。

## Modifications
- `custom_ops/gpu_ops/append_attention.cu`:将 `block_tables_headwise` 穿引至三个入口函数;新增 INT32 dtype guard(`PD_CHECK`);增加 `sink_size`/`head_wise_full_hidden` 参数
- `custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh`:以 per-head 行替换 uniform `block_tables` 行走;保留 `-1` sentinel guard,SWA mask 将贡献清零。c8/c4 推至 PR3
- `custom_ops/gpu_ops/append_attn/{append_attention_c16_impl.cuh, append_attention_kernel.h, multiquery_attention_c16_kernel.h, template_config.json}``cpp_extensions.cc`:穿引 `block_tables_headwise` 至 kernel header/template config/PHI op 签名
- `fastdeploy/model_executor/layers/attention/append_attn_backend.py`:新增 `_get_block_tables_headwise()` helper;穿引 `sink_size``head_wise_full_hidden`
- `fastdeploy/model_executor/layers/attention/ops/append_attention.py``block_tables_headwise` 设为 keyword-only;`use_output=True` 路径守卫
- `fastdeploy/engine/sched/resource_manager_v1.py`:新增 GQA 整除性守卫及 head-wise SWA recycle 状态管理方法
- `fastdeploy/cache_manager/prefix_cache_manager.py`:新增 per-head 独立 free heap 分配/回收;head-wise 与 prefix cache 互斥守卫
- `fastdeploy/config.py`:engine-main FDConfig T53 fixture,镜像 worker 端配置
- `tests/cache_manager/`(8 个测试)+ `tests/layers/test_append_attention_head_wise_shapes.py`:head-wise 合约 smoke test

## Usage or Command
```bash
export FD_HEAD_WISE_KV_CACHE=1
export FD_T53_HEAD_WISE_SWA_RATIO=0.5  # 前半 KV head 指定为 SWA
```
spec 验收:recycle OFF,1D uniform vs 2D discrete block_idx,TTFT/TBT 均提升 ≥5%。

## Accuracy Tests
| `block_idx` 模式 | 硬件 | TTFT (ms) | TBT (ms) | Δ TTFT | Δ TBT |
|---|---|---|---|---|---|
| 1D (uniform) | H100 / H20 / B200 | 待测 | 待测 | baseline | baseline |
| 2D (discrete, 优化) | 同上 | 待测 | 待测 | **≥+5%** | **≥+5%** |

需 H/B 卡验收(cc @luotao1),merge 前须补全。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug fastdeploy/cache_manager/prefix_cache_manager.py:197 available_gpu_resource 读取空列表,head-wise 模式下恒返回 0.0,调度器误判资源耗尽
🟡 建议 fastdeploy/engine/sched/resource_manager_v1.py:299 _num_swa_heads()assert 用于运行时配置校验,-O 模式下静默失效
🟡 建议 fastdeploy/worker/gpu_model_runner.py per checklist A6:gpu_model_runner.py 有变更,需确认是否需同步至其他硬件 runner(xpu/dcu/gcu/hpu 等)

总体评价

PR 架构设计清晰,additive ABI 保持向后兼容,INT32 dtype guard 与 -1 sentinel clamp 等防御性编码值得肯定。但 available_gpu_resource 属性名拼写错误(单数 vs 复数)导致的 P0 bug 会使 head-wise 模式下所有新请求被调度器拒绝,需在 merge 前修复后重测。

@property
def available_gpu_resource(self):
if getattr(self, "head_wise", False) and self.num_gpu_blocks > 0:
head_free = len(getattr(self, "gpu_free_head_wise_block_list", []))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug available_gpu_resource 读取的是 gpu_free_head_wise_block_list(单数,始终为空列表),而非实际持有 free block 的 gpu_free_head_wise_block_lists(复数,per-head heaps)。

_init_head_wise_free_list 中:

self.gpu_free_head_wise_block_list = []  # ← 永远为空,兼容性占位

因此当 head_wise=True 时,head_free = 0available_gpu_resource 恒返回 0.0,调度器误判资源耗尽,所有新请求将被拒绝/阻塞。

建议修复

head_free = sum(len(h) for h in self.gpu_free_head_wise_block_lists)
return (head_free / max(1, self.kv_num_heads)) / self.num_gpu_blocks

return 0
tp_size = max(1, int(getattr(self.config.parallel_config, "tensor_parallel_size", 1) or 1))
# GQA/MQA divisibility guard: when kv >= tp, kv must be divisible by tp
# (Paddle's TP shards KV heads evenly). The kv < tp replication path is
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 _num_swa_heads() 中用 assert 做运行时配置参数校验。Python -O(优化模式)下 assert 会被完全移除,校验静默失效,后续 kv_num_heads_global // tp_size 可能触发 ZeroDivisionError 或产生错误的 head 分配。

建议改为显式 raise

if (kv_num_heads_global >= tp_size) and (kv_num_heads_global % tp_size != 0):
    raise ValueError(
        f"GQA/MQA constraint violated: kv_num_heads={kv_num_heads_global} "
        f"not divisible by tp_size={tp_size} "
        f"(only kv<tp replication path or exact divisibility supported)"
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants