Dispatch computation OPs before communication in standalone executor #47471
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
Others
Describe
相关背景
PR #47030 为c_allreduce_sum算子支持了多流overlap,但在ResNet50_bs128_pure_fp16 4机32卡场景下实测未有明显性能提升。观察Timeline可见模型中存在较多未实际overlap的空隙:
问题分析
c_allreduce_sum调度之后需要进行多流同步(当前新执行器在每个c_allreduce_sum之后插入depend OP,通过depend OP做event的关联和同步),这种多流同步隐式地引入了间接依赖。
考虑彼此之间无相互依赖的待调度OP {c_allreduce_sum + depend 、计算OP1、计算OP2},若优先调度计算OP再调度c_allreduce_sum + depend,则计算可以和通信进行overlap;反之若优先调度c_allreduce_sum + depend,depend OP引入的计算流和通信流的同步操作会阻塞计算OP1和计算OP2的kernel,导致OP1和OP2的计算操作只能在通信完成后启动,限制了通信和计算的overlap。
ParallelExecutor(PE)中针对这种case专门做过优化(#28962 ),优化的核心思想是在同时存在可调度的通信和计算OP时,优先调度计算OP,让具有流同步行为的通信OP尽量在计算OP之后启动,以保证尽可能大的overlap。优化后PE中ResNet50_bs128_pure_fp16 4机32卡overlap效果如下:
优化效果
本PR在新执行器中做类似优化。通过为Instruction引入调度优先级,并让通信OP的优先级低于计算OP,实现计算OP优先调度的目的。
因当前无ResNet多机测试环境,暂时以transformer_big_bs2560_fp32单机2卡为例,优化前后overlap效果如下:
后续若有多机测试环境再补充ResNet50_bs128_pure_fp16 4机32卡实测效果。
未来计划
本PR在新执行器中引入了较初步的算子调度优先级。后续若有需求进行一些更复杂的多流调度优化,需再对优先级系统进行完善和优化,包括: