【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf]#7717
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 required 任务均已通过(当前未检测到 required 失败任务),可考虑合并。
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/2 通过
3 失败详情(仅 required)无 required 失败任务。 |
The PR1 head-wise allocator (PaddlePaddle#7717) emits flat global block IDs in [0, num_gpu_blocks * kv_num_heads) from a single shared min-heap, but the PR2 discrete kernel (PaddlePaddle#7718) ABI L1 expects per-head local IDs in {-1} ∪ [0, num_gpu_blocks). This causes cudaIllegalAddress on any request whose allocated IDs cross the num_gpu_blocks boundary (i.e. immediately on head index ≥ ceil(num_gpu_blocks / num_blocks)). This commit normalizes IDs at the backend boundary in append_attn_backend.py using `local = flat % num_gpu_blocks` (sentinel -1 preserved), with a fail-fast assert to catch any residual OOB. The hotfix is bench-only; the canonical fix (per-head independent allocator pools) is deferred to PR1 v5 (RFC-PR1-reanchored.md §3). Also adds FD_T53_HEAD_WISE_SWA_RATIO ∈ [0.0, 1.0] validator. Refs: .checkpoints/h10/task-53/design/PR2-HOTFIX-SPEC.md (Option B, OPUS-GATE PASS) .checkpoints/h10/task-53/design/CONTRACT-ORACLE.md (I2, I7) .checkpoints/h10/task-53/design/RFC-PR2-reanchored.md (ABI L1) Files: 2 changed (1 backend hotfix, 1 envs validator)
…not cache-ids) PaddlePaddle-bot flagged that _init_head_wise_free_list and the allocate/recycle paths exported the raw length of gpu_free_head_wise_block_list as free_gpu_block_num. That list holds num_gpu_blocks * kv_num_heads per-(block,head) cache ids, so the metric inflated by kv_num_heads (e.g. 8x for ERNIE-21B-A3B-Paddle). Divide by max(1, kv_num_heads) at all three sites so the exported counter stays in logical-block units, consistent with the legacy gpu_free_block_list semantics that downstream dashboards rely on. Refs: review on PR PaddlePaddle#7717 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
…sink-safe) PaddlePaddle-bot review on PR PaddlePaddle#7717 flagged the four 'if (block_id < 0) { block_id = 0; }' fallbacks in the c16 multiquery attention kernel as potentially unsafe — accessing block 0 when block_id == -1 looks like a silent OOB. Document the actual contract: block_id == -1 is the SWA recycle sentinel written by recycle_request_swa_head_cache (T53 PR1). The SWA mask built from chunk_start/chunk_end zeroes any contribution from this aged-out region in softmax, so the value loaded from block 0 is mathematically masked away. SAFETY argument: when sink_size > 0, recycle_from_floor = sink_blocks guarantees the sink window is never recycled, so block_id == -1 cannot occur inside the attended sink region. This is a comment-only change. No code semantics altered. Refs: review on PR PaddlePaddle#7717 (PaddlePaddle-bot) Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
|
Thanks for the review. Re: |
|
@PaddlePaddle-bot — re: These four gates (
If you would prefer the CLI surface added in this PR instead of follow-up, please confirm and I will append the — bob-cloudforge |
PR1 backport of PR2 commit 327a43b. Avoids integer-truncation underestimating available KV blocks when head_free % kv_num_heads != 0, which caused the scheduler to see 0 capacity on partial recycles and trigger false OOM rejections. Signed-off-by: bob-cloudforge <bob@cloudforge.solutions>
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-06 18:24:58
📋 Review 摘要
PR 概述:为 V1 KV-cache 调度器(ResourceManagerV1 + PrefixCacheManager)添加 head-wise SWA 缓存回收,使混合 SWA/full-attention 头的模型(ERNIE-4.5-21B-A3B)能够在 SWA 头超出滑动窗口后及时释放缓存,基准测试显示吞吐量提升约 +57%~+76%。
变更范围:cache_manager/、engine/sched/、custom_ops/gpu_ops/append_attn/、config.py、model_executor/models/
影响面 Tag:[KVCache] [Scheduler] [OP] [FDConfig] [Models]
📝 PR 规范检查
标题问题:当前标题 【Hackathon 10th Spring No.53】[Feature][KVCache] Support head-wise SWA cache recycle in ResourceManagerV1 [cf] 存在三处不规范:① 非官方前缀 【Hackathon 10th Spring No.53】;② 两个 Tag [Feature][KVCache](规范要求仅含一个官方 Tag);③ 非官方后缀 [cf]。
描述问题:PR body 中 Checklist 使用了自定义条目而非 §D2 标准模板条目,且已完成的事项(已新增单测、已提供精度数据)仍未勾选。
标题建议(可直接复制):
[Feature] Support head-wise SWA cache recycle in ResourceManagerV1
PR 描述建议(可直接复制,Checklist 已按实际情况勾选):
## Motivation
Hackathon 10th Spring Task No.53 — 离散 KV Cache 管理和 AppendAttention 算子的性能优化(PR1 of 2)。对于在同一层内混用 SWA(Sliding-Window Attention)头和全注意力头的模型(如 ERNIE-4.5-21B-A3B),V1 KV-cache 调度路径(`ResourceManagerV1` + `PrefixCacheManager`)为**所有头共享一个** `block_idx`,SWA 头在其窗口结束后仍占用缓存,导致吞吐量下降。本 PR 实现 head-wise SWA 布局,支持在 SWA 头超出滑动窗口后及时回收缓存,等效于 PR #6702 针对 V0 所做的工作。
## Modifications
| 模块 | 变更内容 |
|---|---|
| `fastdeploy/cache_manager/prefix_cache_manager.py` | 新增 per-request head-wise GPU free list;`allocate_gpu_blocks_head_wise` / `recycle_gpu_blocks_head_wise`;TP-aware 大小计算(`num_key_value_heads // tp_size`) |
| `fastdeploy/engine/sched/resource_manager_v1.py` | 新增 `recycle_request_swa_head_cache`(per-head cursor 单调推进);`_should_skip_swa_recycle_for_overlap`(检测 in-flight 传输);`_free_blocks` P4 清理 |
| `fastdeploy/model_executor/models/paddleformers/base.py` | 默认关闭的 ERNIE SWA fixture(window/sink/skip-freq/ratio),由 `FD_T53_HEAD_WISE_SWA_FIXTURE=1` 开关控制 |
| `fastdeploy/config.py` | engine-main FDConfig fixture,镜像 worker 侧 `head_wise_swa_ratio` 注入,使 `ResourceManagerV1._should_use_head_wise_swa` 能读取正确的 model_config |
| `custom_ops/gpu_ops/append_attn/` | 新增 `block_table_hw` / `block_tables_headwise` 可选参数;SWA sentinel guard(`block_id=-1` fallback to 0) |
| 互斥保护 | `enable_prefix_caching=True + FD_HEAD_WISE_KV_CACHE=1` 在初始化时抛出异常 |
| 环境变量 | `FD_HEAD_WISE_KV_CACHE=0` 默认关闭,关闭时行为与主线完全一致 |
## Usage or Command
```bash
# 启用 head-wise V1 cache + 及时 SWA 回收(四个变量需同时设置)
export FD_T53_HEAD_WISE_SWA_FIXTURE=1 # engine-main FDConfig fixture
export ENABLE_V1_KVCACHE_SCHEDULER=1 # 默认开启,显式列出以便说明
export FD_HEAD_WISE_KV_CACHE=1 # 启用 per-head block tables
export FD_T53_HEAD_WISE_SWA_RATIO=1.0 # SWA 回收比例(>0 即启用回收)
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
--max-model-len 32768
```
## Accuracy Tests
**Round 2(128 prompts,A800-80GB):**
| 配置 | 输出吞吐量(tok/s) | Δ |
|---|---|---|
| head-wise + recycle OFF | 706.29 | baseline |
| head-wise + recycle ON | 1107.98 | **+56.9%** ≥30% ✓ |
**Round 3(1024 prompts,A800-80GB):**
| 配置 | 输出吞吐量(tok/s) | Δ |
|---|---|---|
| head-wise + recycle OFF | 722.93 | baseline |
| head-wise + recycle ON | 1270.87 | **+75.8%** ≥30% ✓ |
Round 3 完整性:`completed=1024/1024`,`errors=0`,TTFT 改善 -48.0%(2708s → 1407s)。
基准配置:random fixed-IO dataset,input≈10.6k tokens avg / output≈4k tokens avg,request-rate=8,seed=42,A800-80GB(SM80)。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/cache_manager/prefix_cache_manager.py:196 |
available_gpu_resource 在 head-wise 启动窗口期返回 0.0,可能触发误报 OOM |
| ❓ 疑问 | custom_ops/gpu_ops/append_attn/multiquery_attention_c16_impl.cuh:223 |
CUDA sentinel block_id=0 依赖 SWA mask 完全覆盖的正确性,跨请求内存访问需补充验证 |
| 🟡 建议 | tests/operators/(缺失) |
A3 必查项:append_attention 新增 block_tables_headwise 参数,但 tests/operators/ 下无对应算子单测 |
总体评价
整体设计清晰,env-gate 默认关闭、P4 资源泄漏修复、TP-aware 头数计算等细节处理到位,A800 基准数据充分超出 Spec 要求(+56.9%~+75.8% ≥ 30%)。主要关注点为 head-wise 模式下 available_gpu_resource 启动窗口期的 0.0 返回和 CUDA sentinel block_id=0 的 mask 依赖性;tests/operators/ 算子测试缺失需在 merge 前补充。
|
|
||
| @property | ||
| def available_gpu_resource(self): | ||
| if getattr(self, "head_wise", False) and self.num_gpu_blocks > 0: |
There was a problem hiding this comment.
🟡 建议 available_gpu_resource 在 head_wise=True 模式下存在启动窗口期返回 0.0 的风险。
__init__ 将 gpu_free_head_wise_block_list 初始化为空列表 [],而 _init_head_wise_free_list() 在 update_cache_config 中才被调用。若有组件在两者之间检查 available_gpu_resource,会得到 (0/kv_num_heads)/num_gpu_blocks = 0.0,导致调度器误判资源耗尽并拒绝所有请求。
建议在初始化时返回 legacy 路径的值,直至 head-wise free list 被填充:
@property
def available_gpu_resource(self):
if getattr(self, "head_wise", False) and self.num_gpu_blocks > 0:
free_list = getattr(self, "gpu_free_head_wise_block_list", None)
if not free_list: # 尚未初始化,回退到 legacy 路径
return len(self.gpu_free_block_list) / self.num_gpu_blocks
head_free = len(free_list)
return (head_free / max(1, self.kv_num_heads)) / self.num_gpu_blocks
return len(self.gpu_free_block_list) / self.num_gpu_blocks if self.num_gpu_blocks > 0 else 0.0| // window is never recycled, so block_id==-1 cannot occur inside the | ||
| // attended sink region and the fallback to block 0 is provably out of range. | ||
| int block_id = __ldg(&block_table_now[kv_idx_base / BLOCK_SIZE]); | ||
| if (block_id < 0) { |
There was a problem hiding this comment.
❓ 疑问 block_id = 0 作为 recycled sentinel 的 fallback 依赖 SWA mask 完全覆盖已回收位置的正确性。
注释中的 SAFETY 保证是「SWA mask 从 chunk_start/chunk_end 构建,会 zero 掉该 aged-out 区域的贡献」。然而:
- 若
recycle_request_swa_head_cache中的 block-alignment 检查(total_tokens % block_size != 0时跳过)与 attention kernel 中chunk_start之间存在时序窗口,部分block_id=-1的 slot 可能处于尚未被 mask 覆盖的区域。 - Block 0 是合法的已分配物理块,GPU 会实际读取其 KV 数据(即使 mask 为 0)。在高并发场景下这虽不影响数值正确性,但属于跨请求内存访问。
建议在 PR 中补充以下验证:a) 精确说明 recycle 触发时机与 mask 边界的配合关系;b) 或改用专用哨兵块(如预分配一个全零的 block_id=0 副本)而非复用活跃块的物理地址。
Motivation
Hackathon 10th Spring Task No.53 — 离散 KV Cache 管理和 AppendAttention 算子的性能优化 (PR1 of 2). Spec: https://github.com/PaddlePaddle/community/blob/master/hackathon/hackathon_10th/【Hackathon_10th】开源贡献个人挑战赛春节特别季—任务合集.md#no53.
For models that mix Sliding-Window Attention (SWA) heads with full-attention heads inside the same layer, today's V1 KV-cache scheduling path (
ResourceManagerV1+PrefixCacheManager, gated by the default-onENABLE_V1_KVCACHE_SCHEDULER=1) allocates one sharedblock_idxper layer for all heads. SWA heads finish their window long before full-attn heads, but their cache stays pinned until the whole layer evicts. Throughput suffers.This PR teaches the V1 scheduler +
PrefixCacheManagerto manageblock_idxper head (head-wise SWA layout) and recycle a SWA head's cache as soon as it crosses its window — the per-head equivalent of what PR #6702 did for V0.Authorship: this PR is independently designed and implemented by the submitter for Hackathon 10th Spring No.53. The earlier community PR #6702 (V0, not merged) is referenced as prior art only; no code is lifted unattributed. Any future contributor work will be acknowledged via per-commit
Co-authored-bytrailers.RFC: PaddlePaddle/community#1364.
Modifications
fastdeploy/cache_manager/prefix_cache_manager.pygpu_free_block_list_head_wise[head]);allocate_gpu_blocks_head_wise/recycle_gpu_blocks_head_wise; TP-aware sizing (num_key_value_heads // tp_size)fastdeploy/engine/sched/resource_manager_v1.pyrecycle_request_swa_head_cache(per-head cursor advance ≥ window+sink);_should_skip_swa_recycle_for_overlap(per-requestcache_swap_metadata/cache_evict_metadatainspection); P4 cleanup in_free_blocksfastdeploy/model_executor/models/paddleformers/base.pyFD_T53_HEAD_WISE_SWA_FIXTURE=1fastdeploy/config.pypaddleformers/base.pyhead-wise SWA attribute injection soResourceManagerV1._should_use_head_wise_swa(engine-main) sees the samemodel_config.head_wise_swa_ratioas the worker. Gated onFD_T53_HEAD_WISE_SWA_FIXTURE.enable_prefix_caching=True + FD_HEAD_WISE_KV_CACHE=1raises atPrefixCacheManager.__init__FD_HEAD_WISE_KV_CACHE=0default — bit-identical when disabledTests use real lightweight objects +
object.__new__/AST or shape oracles (noMagicMock-only). PR2, not PR1, owns kernel-visibleblock_tables_headwise/ FP8 scale-layout changes.PR2 (separate) lands the AppendAttention rank-2
block_tables_headwiseABI + ForwardMeta wiring +kv_num_headsfield as a frozen-shape parameter; PR1 keepsshare_inputs.block_tables2D and reaches the +30% recycle gate via cache-manager-side changes only.Usage or Command
Accuracy Tests
Spec PR1 acceptance — throughput up ≥30% with timely SWA recycle vs without, same VRAM, fixed-IO dataset, V1 KV-cache scheduler on (
ENABLE_V1_KVCACHE_SCHEDULER=1, default):Round 2 (gate run — 128 prompts):
Round 3 (full run — 1024 prompts):
Round 3 integrity:
completed=1024/1024both arms,errors=0, mean TTFT improved -48.0% (2,708 s → 1,407 s).Benchmark:
FastDeploy/benchmarks/benchmark_serving.py— random fixed-IO dataset, input≈10.6k tokens avg / output≈4k tokens avg, request-rate=8, seed=42,--ignore-eos, server--max-concurrency=8192, YAMLeb45-21b-a3b-32k-bf16-kv50-512s.yaml(kv_cache_max_ratio=0.50, max_seq_len=512). Fixed-IO integrity: both arms produce identicaltotal_input_tokens=1,356,656/total_output_tokens=518,946for the 128-prompt gate run. Round 2 harness gate:completed=128,nonempty_errors=0. Round 3 target:completed=1024.Correctness:
tests/cache_manager/test_head_wise_*.py,tests/cache_manager/test_swa_recycle*.py, andtests/layers/test_append_attention_head_wise_shapes.py— real_FakeCacheManager+object.__new__(ResourceManagerV1)+ AST/shape oracles. NoMagicMock-only tests.CI run: https://github.com/PaddlePaddle/FastDeploy/pull/7717/checks
Companion PR: #7718 (AppendAttention rank-2 head-wise block_idx kernel optimisation)
Checklist
pre-commit run --all-filesclean