[BugFix] 为decode实例增加一个守护线程去监测预分配blocks超时#7965
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a configurable timeout-based cleanup mechanism to reclaim decode-side preallocated blocks when prefill never arrives, preventing long-lived resource leaks.
Changes:
- Introduces
FD_DECODE_PREALLOC_BLOCK_TIMEOUTenvironment setting for reclaim timeout configuration. - Starts a background monitor thread to periodically scan and reclaim timed-out preallocations.
- Emits warnings/errors and returns a 408-like
RequestOutputwhen reclaiming occurs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fastdeploy/envs.py | Adds a new env var to configure decode prealloc reclamation timeout. |
| fastdeploy/engine/common_engine.py | Adds a background monitor thread to reclaim timed-out preallocated blocks and notify request completion. |
| check_interval = max(10, min(timeout / 10, 60)) | ||
| while self.running: | ||
| time.sleep(check_interval) |
| for req_id in timed_out: | ||
| with self.resource_manager.lock: | ||
| if req_id not in self.resource_manager.requests: | ||
| continue | ||
| if any(r.request_id == req_id for r in self.resource_manager.running): | ||
| continue | ||
| self.llm_logger.warning(f"Reclaiming preallocated blocks for {req_id}: timeout {timeout}s") | ||
| self.resource_manager.pre_recycle_resource(req_id) |
| except Exception as e: | ||
| self.llm_logger.error(f"Prealloc timeout monitor error: {e}") |
| # Timeout (seconds) for D to reclaim preallocated blocks if P never follows through. 0 to disable. | ||
| "FD_DECODE_PREALLOC_BLOCK_TIMEOUT": lambda: int(os.getenv("FD_DECODE_PREALLOC_BLOCK_TIMEOUT", "600")), |
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 8/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 中)分析器: ci_analyze_unittest_fastdeploy | 错误类型: PR问题 | 置信度: 中 失败用例:
关键日志:
本次PR在 修复建议:
关联变更: 🟡 Approval — 需要人工审批该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。请通过人工审批。 |
| "FD_HPU_MEASUREMENT_MODE": lambda: os.getenv("FD_HPU_MEASUREMENT_MODE", "0"), | ||
| "FD_PREFILL_WAIT_DECODE_RESOURCE_SECONDS": lambda: int(os.getenv("FD_PREFILL_WAIT_DECODE_RESOURCE_SECONDS", "30")), | ||
| # Timeout (seconds) for D to reclaim preallocated blocks if P never follows through. 0 to disable. | ||
| "FD_DECODE_PREALLOC_BLOCK_TIMEOUT": lambda: int(os.getenv("FD_DECODE_PREALLOC_BLOCK_TIMEOUT", "600")), |
There was a problem hiding this comment.
这个耗时再加大点吧,避免出现p实例出现排队导致超时的情况,可以改成20分钟,1200s
| try: | ||
| now = time.time() | ||
| with self.resource_manager.lock: | ||
| skip_ids = ( |
There was a problem hiding this comment.
这里感觉不太鲁棒,可以给请求加一个is_preallocated标记,在在前面_process_allocate_resource_requests函数中标记为True,_process_prefilled_requests中成功处理后标记为False,这里就可以使用is_preallocated标记
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7965 +/- ##
==========================================
Coverage ? 67.86%
==========================================
Files ? 467
Lines ? 65216
Branches ? 10013
==========================================
Hits ? 44260
Misses ? 18112
Partials ? 2844
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-02 13:39:08
📋 Review 摘要
PR 概述:为 Decode 实例新增守护线程监测预分配 block 超时,防止 P 实例 crash 导致 block 永久泄漏
变更范围:Engine(common_engine.py)、环境变量(envs.py)
影响面 Tag:[Engine] [PD Disaggregation]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/envs.py:194 |
PR 描述声明默认 600s,代码实际为 1200s,描述-代码不一致 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | check-then-act TOCTOU 竞态窗口 |
F1 补充说明:
_prealloc_timeout_monitor在持锁扫描得到timed_out列表后释放锁,再逐个调用pre_recycle_resource。在释放锁到回收之间,_process_prefilled_requests可能已将该请求成功 prefill(设置is_preallocated=False并调用add_prefilled_request),但 monitor 仍会执行pre_recycle_resource释放其 blocks,导致活跃 decode 请求的 blocks 被错误回收。建议在for req_id in timed_out循环内、调用pre_recycle_resource之前,重新持锁检查req.is_preallocated是否仍为 True。
📝 PR 规范检查
PR 标题格式合规([BugFix] 为官方 Tag),描述结构基本完整。但 Usage or Command 和 Accuracy Tests 段仅含 HTML 注释(渲染后为空),Checklist 中勾选了"Add unit tests"但 diff 中未包含测试代码且未说明原因。
PR 描述建议(点击展开,可直接复制)
## Motivation
在PD分离架构中,P实例收到请求后会向D实例申请block预分配。若P实例在申请后crash或网络故障,D实例上的block将永久泄露,长期累积后导致429错误(block耗尽)。
## Modifications
1. **`fastdeploy/envs.py`** — 新增环境变量 `FD_DECODE_PREALLOC_BLOCK_TIMEOUT`(默认1200s,设为0禁用)
2. **`fastdeploy/engine/common_engine.py`** — 新增 `_prealloc_timeout_monitor()` 守护线程方法,定期检查已预分配但超时未完成的请求,自动回收block并返回408错误码
## Usage or Command
通过环境变量控制超时时间:
```bash
export FD_DECODE_PREALLOC_BLOCK_TIMEOUT=1200 # 默认1200s,设为0禁用
```
## Accuracy Tests
N/A(本次变更不影响模型输出)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- 本次变更为守护线程级 bugfix,暂无单测,建议后续补充集成测试
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体方案合理,通过守护线程超时回收解决了 PD 分离架构中 block 泄漏问题。主要需关注描述与代码默认值不一致,以及历史 TOCTOU 竞态问题仍未修复(可能导致正常 prefill 请求的 blocks 被误回收)。
| "FD_HPU_MEASUREMENT_MODE": lambda: os.getenv("FD_HPU_MEASUREMENT_MODE", "0"), | ||
| "FD_PREFILL_WAIT_DECODE_RESOURCE_SECONDS": lambda: int(os.getenv("FD_PREFILL_WAIT_DECODE_RESOURCE_SECONDS", "30")), | ||
| # Timeout (seconds) for D to reclaim preallocated blocks if P never follows through. 0 to disable. | ||
| "FD_DECODE_PREALLOC_BLOCK_TIMEOUT": lambda: int(os.getenv("FD_DECODE_PREALLOC_BLOCK_TIMEOUT", "1200")), |
There was a problem hiding this comment.
🟡 建议 PR 描述声明默认超时为 600s,但代码实际默认值为 "1200"(1200 秒)。
描述与实现不一致会误导用户配置。请统一:若 1200s 为期望值,修正 PR 描述;若 600s 为期望值,修正此处默认值。
…on timeouts.
Motivation
在PD分离架构中,P实例收到请求后会向D实例申请block预分配。若P实例在申请后crash或网络故障,D实例上的block将永久泄露,长期累积后导致429错误(block耗尽)。
Modifications
代码修改(2个文件)
fastdeploy/envs.py— 新增环境变量FD_DECODE_PREALLOC_BLOCK_TIMEOUT(默认600s,设为0禁用)fastdeploy/engine/common_engine.py— 新增_prealloc_timeout_monitor()守护线程方法,定期检查已预分配但超时未完成的请求,自动回收blockUsage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.