Skip to content

[Optim][Cherry-pick] Reduce preemption occurrence when blocks not enough(#5696)#5808

Merged
Jiang-Jia-Jun merged 8 commits intoPaddlePaddle:release/2.4from
rainyfly:optimize_scheduler_for_preemption_release24
Jan 8, 2026
Merged

[Optim][Cherry-pick] Reduce preemption occurrence when blocks not enough(#5696)#5808
Jiang-Jia-Jun merged 8 commits intoPaddlePaddle:release/2.4from
rainyfly:optimize_scheduler_for_preemption_release24

Conversation

@rainyfly
Copy link
Collaborator

@rainyfly rainyfly commented Dec 29, 2025

Motivation

当解码 block不足时,调度会抢占正在解码的请求,释放对应的 block 资源。并分配给剩余的解码请求。
之前的调度逻辑,在 waiting 队列里有请求的时候,发现剩余的 block可以容纳下一条新请求部分的 chunk(new_token_num),就会将其调度回去做 prefill。在 block已经严重不足的时候,会造成反复 抢占->调度 prefill->抢占->调度 prefill的重调度行为,造成性能下降。
为了解决这一问题,在调度新请求做 prefill 时,考虑给正在解码的请求预留部分 block,只有在给正在解码的 每条请求所预留的 block剔除后,并且剩余的 block 还可以容纳整条当前需要 prefill 的请求,才把请求从 waiting 队列里调度出来做 prefill。

Modifications

新增环境变量:FD_RESERVE_OUTPUT_BLOCK_NUM_FOR_DECODE_WHEN_SCHEDULE_NEW_PREFILL
含义:从 waiting 队列里调度新请求做 prefill 时,需要给每条正在解码的请求预留的 block 数量,默认为 16

Usage or Command

v1 下默认使用

Accuracy Tests

None

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Dec 29, 2025

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Dec 29, 2025

Codecov Report

❌ Patch coverage is 29.16667% with 17 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.4@fb59f56). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/sched/resource_manager_v1.py 29.16% 17 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.4    #5808   +/-   ##
==============================================
  Coverage               ?   58.81%           
==============================================
  Files                  ?      329           
  Lines                  ?    40836           
  Branches               ?     6221           
==============================================
  Hits                   ?    24019           
  Misses                 ?    14945           
  Partials               ?     1872           
Flag Coverage Δ
GPU 58.81% <29.16%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

这是一个从#5696 cherry-pick过来的优化PR,旨在减少在block资源不足时的抢占发生频率。该PR引入了一个预留块(reserve block)机制,在调度新的prefill请求时,会为正在解码的请求预留部分block资源,避免反复发生"抢占->调度prefill->抢占"的重调度行为。

主要变更:

  • 添加了三个新的环境变量来控制预留块机制,包括初始预留数、衰减率和最小预留数
  • 修改了调度逻辑,在检查是否可以调度新的prefill请求时,会考虑为当前运行中的decode请求预留block
  • 实现了预留块数量的衰减机制,在正常调度时逐渐减少预留,在发生抢占时重置为初始值

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File Description
fastdeploy/envs.py 添加了三个新的环境变量用于配置预留块机制的参数(初始值、衰减率、最小值)
fastdeploy/engine/sched/resource_manager_v1.py 在ResourceManagerV1的初始化中添加了预留块相关的实例变量;修改了schedule()方法中的调度逻辑,在检查是否可以调度新prefill时计入预留块;在_trigger_preempt()中添加了预留块重置逻辑;在每次schedule()结束时实现预留块的衰减逻辑

 into optimize_scheduler_for_preemption_release24
 into optimize_scheduler_for_preemption_release24
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 1e8de96 into PaddlePaddle:release/2.4 Jan 8, 2026
13 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants