[feature] provide lock free scheduler#7937
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7937 +/- ##
==========================================
Coverage ? 63.73%
==========================================
Files ? 470
Lines ? 65254
Branches ? 9991
==========================================
Hits ? 41590
Misses ? 20857
Partials ? 2807
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务尚未全部通过:当前 required 失败任务数 2,等待处理的 required 任务数 0。请优先处理覆盖率门禁失败并完成 Approval。
2 任务状态汇总日志列说明:失败任务直接使用 CI 日志链接,运行中任务使用 Job 链接。 2.1 Required任务 : 8/10 通过
2.2 可选任务 — 28/32 通过
3 失败详情(仅 required)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率门禁失败(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例: 无。日志显示单元测试已通过,失败发生在覆盖率校验步骤。 根因详情: 关键日志: 修复建议:
修复建议摘要: 为新增调度代码补充单测覆盖 关联变更: Approval — 需要人工 Approval(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-27 15:49:11
📋 Review 摘要
PR 概述:新增无锁调度器 ResourceManagerV2(基于消息队列)及 V1 的 prefill_schedule() P 实例专用调度路径。
变更范围:fastdeploy/engine/sched/、fastdeploy/engine/common_engine.py、fastdeploy/output/token_processor.py、fastdeploy/envs.py
影响面 Tag:[Scheduler] [Engine] [PD Disaggregation] [DataProcessor]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/engine/common_engine.py:1131 |
splitwise_role=="prefill" 时无条件调用 prefill_schedule(),默认 ResourceManager 无此方法,会 AttributeError |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | event.wait() 无超时,存在永久阻塞风险 |
|
| F2 | release_slot 与 _enqueue_msg 之间存在竞态窗口 |
|
| F3 | _process_add_request 手动复制 V1 逻辑而非调用 super() |
|
| F4 | token_processor.py 中重复的 ENABLE_V1_KVCACHE_SCHEDULER 运行时校验 |
📝 PR 规范检查
标题 [feature] 不在官方 Tag 列表中(大小写不符),描述各 section 均为空,本次 PR 未修复。
标题建议(可直接复制):
[Scheduler] provide lock-free scheduler (ResourceManagerV2)
PR 描述建议(点击展开,可直接复制)
## Motivation
V1 调度器通过全局锁保护状态变更,外部线程(token_processor、splitwise connector 等)在高并发场景下频繁竞争锁,影响调度吞吐。V2 通过消息队列将所有状态变更序列化到 schedule() 单写线程,消除锁竞争。
## Modifications
- 新增 `fastdeploy/engine/sched/request_manager.py`:统一封装 slot 分配/释放接口(`acquire_slot` / `release_slot` / `get_available_position` / `available_batch`)
- 新增 `fastdeploy/engine/sched/resource_manager_v2.py`:继承 V1,替换全局锁为 `_NoOpLock`,外部线程调用通过 `deque` 消息队列异步化;P/D 分离的同步查询方法通过 `threading.Event` Future 机制实现
- `fastdeploy/engine/sched/__init__.py`:导出 `RequestManager`、`ResourceManagerV1`、`ResourceManagerV2`
- `fastdeploy/engine/common_engine.py`:新增 `ENABLE_V2_KVCACHE_SCHEDULER` 分支,优先于 V1 初始化;P 实例调度路径切换为 `prefill_schedule()`
- `fastdeploy/envs.py`:新增环境变量 `ENABLE_V2_KVCACHE_SCHEDULER`(默认 0)
- `fastdeploy/output/token_processor.py`:`_recycle_resources` 中新增 V2 分支,调用 `finish_requests`(同步入队)替代 V1 的 `finish_requests_async`
## Usage or Command
```bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
export ENABLE_V2_KVCACHE_SCHEDULER=1
# 启动服务后 V2 无锁调度器生效
```
## Accuracy Tests
N/A(调度器变更不影响模型输出)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
无锁调度器设计思路清晰,通过消息队列将状态变更序列化到调度线程。本次新增了 prefill_schedule() P 实例专用路径,但该方法被无条件调用于所有 resource_manager 实例存在兼容性 Bug,需修复。历史遗留的 F1(event.wait 无超时)、F2(竞态窗口)问题建议尽快跟进。
| # 2. Schedule requests | ||
| batch_request, error_tasks = self.resource_manager.schedule() | ||
| if self.cfg.scheduler_config.splitwise_role == "prefill": | ||
| batch_request, error_tasks = self.resource_manager.prefill_schedule() |
There was a problem hiding this comment.
🔴 Bug 当 ENABLE_V1_KVCACHE_SCHEDULER=0(使用默认 ResourceManager)且 splitwise_role=="prefill" 时,此处调用 self.resource_manager.prefill_schedule() 会触发 AttributeError,因为默认 ResourceManager 没有该方法。
prefill_schedule() 仅在 ResourceManagerV1(及继承它的 V2)中定义,但当前代码无条件地对所有 resource_manager 实例调用。
建议修复方式:
if self.cfg.scheduler_config.splitwise_role == "prefill" and hasattr(self.resource_manager, "prefill_schedule"):
batch_request, error_tasks = self.resource_manager.prefill_schedule()
else:
batch_request, error_tasks = self.resource_manager.schedule()或者在 ResourceManager 基类中定义 prefill_schedule() 并回退到 schedule(),确保接口一致。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.