[Optim][Cherry-pick] Reduce preemption occurrence when blocks not enough(#5696)#5808
Merged
Jiang-Jia-Jun merged 8 commits intoPaddlePaddle:release/2.4from Jan 8, 2026
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.4 #5808 +/- ##
==============================================
Coverage ? 58.81%
==============================================
Files ? 329
Lines ? 40836
Branches ? 6221
==============================================
Hits ? 24019
Misses ? 14945
Partials ? 1872
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
这是一个从#5696 cherry-pick过来的优化PR,旨在减少在block资源不足时的抢占发生频率。该PR引入了一个预留块(reserve block)机制,在调度新的prefill请求时,会为正在解码的请求预留部分block资源,避免反复发生"抢占->调度prefill->抢占"的重调度行为。
主要变更:
- 添加了三个新的环境变量来控制预留块机制,包括初始预留数、衰减率和最小预留数
- 修改了调度逻辑,在检查是否可以调度新的prefill请求时,会考虑为当前运行中的decode请求预留block
- 实现了预留块数量的衰减机制,在正常调度时逐渐减少预留,在发生抢占时重置为初始值
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| fastdeploy/envs.py | 添加了三个新的环境变量用于配置预留块机制的参数(初始值、衰减率、最小值) |
| fastdeploy/engine/sched/resource_manager_v1.py | 在ResourceManagerV1的初始化中添加了预留块相关的实例变量;修改了schedule()方法中的调度逻辑,在检查是否可以调度新prefill时计入预留块;在_trigger_preempt()中添加了预留块重置逻辑;在每次schedule()结束时实现预留块的衰减逻辑 |
1e8de96
into
PaddlePaddle:release/2.4
13 of 19 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
当解码 block不足时,调度会抢占正在解码的请求,释放对应的 block 资源。并分配给剩余的解码请求。
之前的调度逻辑,在 waiting 队列里有请求的时候,发现剩余的 block可以容纳下一条新请求部分的 chunk(new_token_num),就会将其调度回去做 prefill。在 block已经严重不足的时候,会造成反复 抢占->调度 prefill->抢占->调度 prefill的重调度行为,造成性能下降。
为了解决这一问题,在调度新请求做 prefill 时,考虑给正在解码的请求预留部分 block,只有在给正在解码的 每条请求所预留的 block剔除后,并且剩余的 block 还可以容纳整条当前需要 prefill 的请求,才把请求从 waiting 队列里调度出来做 prefill。
Modifications
新增环境变量:FD_RESERVE_OUTPUT_BLOCK_NUM_FOR_DECODE_WHEN_SCHEDULE_NEW_PREFILL
含义:从 waiting 队列里调度新请求做 prefill 时,需要给每条正在解码的请求预留的 block 数量,默认为 16
Usage or Command
v1 下默认使用
Accuracy Tests
None
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.