[Model Runner] Refactor execute_model for GPU async scheduling#6176
[Model Runner] Refactor execute_model for GPU async scheduling#6176zhoutianzi666 merged 10 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
此 PR 旨在优化 not_need_stop 标志的内存复制操作,通过引入异步内存复制来提高性能。变更涉及在 GPU 上维护 not_need_stop 标志的单独副本,并使用固定内存(pinned memory)和自定义 CUDA 操作来管理 CPU 和 GPU 之间的数据传输。
Changes:
- 添加了新的 CUDA 操作
get_stop和set_stop用于管理 GPU 上的 stop 标志 - 在
ModelOutputData中添加了not_need_stop_gpu字段以支持 GPU 版本的 stop 标志 - 修改了
update_inputs_v1.cu以移除同步内存复制逻辑 - 更新了内存分配策略,使用固定内存(pinned memory)进行异步操作
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| custom_ops/gpu_ops/set_stop.cu | 新增 CUDA 操作文件,实现 get_stop 和 set_stop 函数用于访问和设置 GPU 上的 stop 标志 |
| custom_ops/gpu_ops/update_inputs_v1.cu | 移除了同步的 GPU-CPU 内存复制代码,现在直接使用 GPU 上的 not_need_stop |
| custom_ops/setup_ops.py | 将新的 set_stop.cu 文件添加到编译列表中 |
| fastdeploy/worker/output.py | 在 ModelOutputData 类中添加 not_need_stop_gpu 字段 |
| fastdeploy/worker/gpu_model_runner.py | 导入并使用新的 get_stop 和 set_stop 操作;初始化固定内存和 GPU 张量 |
| fastdeploy/model_executor/pre_and_post_process.py | 更新后处理逻辑以使用 GPU 版本的 stop 标志,并添加异步内存复制 |
…into not_need_stop
…into not_need_stop
a4e11ca to
0abaca0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6176 +/- ##
==========================================
Coverage ? 66.96%
==========================================
Files ? 384
Lines ? 50769
Branches ? 7921
==========================================
Hits ? 33996
Misses ? 14290
Partials ? 2483
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…into not_need_stop
| #include "helper.h" | ||
|
|
||
| paddle::Tensor GetStop(paddle::Tensor& not_need_stop) { | ||
| bool* not_need_stop_data = const_cast<bool*>(not_need_stop.data<bool>()); |
| # Transmit the model's output and stop generation signal via message queue. | ||
| # In the future, we will abandon this approach. | ||
| if envs.FD_USE_GET_SAVE_OUTPUT_V1: | ||
| if save_each_rank or model_output.mp_rank == 0: |
There was a problem hiding this comment.
如果是走 V1,这里会有同步问题吗?
还没测,如果是CPU上的操作理论上不会有同步
| model_output.seq_lens_decoder, | ||
| model_output.step_idx, | ||
| ) | ||
| share_inputs["preempted_idx"][:] = 0 |
There was a problem hiding this comment.
这里是因为什么加的?
只是换了个位置,这个操作之前是在post_process最后执行的,现在save_output提取出来了就跟着放在之后,不然会影响调度抢占的逻辑
| self._process_mm_features(req_dicts) | ||
| if has_prefill_task or has_decode_task: | ||
| self.share_inputs["not_need_stop"][0] = True | ||
| set_stop(self.share_inputs["not_need_stop"], True) |
There was a problem hiding this comment.
这里是因为直接 index 修改会影响 cpu/gpu place?
There was a problem hiding this comment.
这里是因为直接 index 修改会影响 cpu/gpu place?
是的,直接修改和读取都会影响,后续会找框架同学咨询一下
EmmonsCurse
left a comment
There was a problem hiding this comment.
LTGM for skip coverage~
Motivation
为实现 execute_model 阶段的 GPU 异步调度优化,需要将原有同步执行流程拆分为 前处理与模型执行、后处理 以及 token_id 返回 三个阶段。同时,not_need_stop 和 sampled_token_ids 在 后处理 阶段通过异步 GPU→CPU 拷贝发起传输,并在 save_output 阶段借助 CUDA event 进行必要的同步,从而减少调度层阻塞、提升 CPU–GPU 并行度。
Modifications
拆分execute_model(与原有逻辑相同):
增加 execute_model_overlap 接口(此PR下不生效,仅做准备):
share_inputs 增加:
增加 get_stop、set_stop自定义算子:
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.