Dispatch computation OPs before communication in standalone executor #47471

From00 · 2022-10-29T13:30:36Z

PR types

Performance optimization

PR changes

Others

Describe

相关背景
PR #47030 为c_allreduce_sum算子支持了多流overlap，但在ResNet50_bs128_pure_fp16 4机32卡场景下实测未有明显性能提升。观察Timeline可见模型中存在较多未实际overlap的空隙：

问题分析
c_allreduce_sum调度之后需要进行多流同步（当前新执行器在每个c_allreduce_sum之后插入depend OP，通过depend OP做event的关联和同步），这种多流同步隐式地引入了间接依赖。
考虑彼此之间无相互依赖的待调度OP {c_allreduce_sum + depend 、计算OP1、计算OP2}，若优先调度计算OP再调度c_allreduce_sum + depend，则计算可以和通信进行overlap；反之若优先调度c_allreduce_sum + depend，depend OP引入的计算流和通信流的同步操作会阻塞计算OP1和计算OP2的kernel，导致OP1和OP2的计算操作只能在通信完成后启动，限制了通信和计算的overlap。
ParallelExecutor(PE)中针对这种case专门做过优化（#28962 ），优化的核心思想是在同时存在可调度的通信和计算OP时，优先调度计算OP，让具有流同步行为的通信OP尽量在计算OP之后启动，以保证尽可能大的overlap。优化后PE中ResNet50_bs128_pure_fp16 4机32卡overlap效果如下：

优化效果
本PR在新执行器中做类似优化。通过为Instruction引入调度优先级，并让通信OP的优先级低于计算OP，实现计算OP优先调度的目的。

因当前无ResNet多机测试环境，暂时以transformer_big_bs2560_fp32单机2卡为例，优化前后overlap效果如下：

后续若有多机测试环境再补充ResNet50_bs128_pure_fp16 4机32卡实测效果。

未来计划
本PR在新执行器中引入了较初步的算子调度优先级。后续若有需求进行一些更复杂的多流调度优化，需再对优先级系统进行完善和优化，包括：

引入多级优先级：当前实际只有2级优先级，并通过在调度队列入队时进行区分处理（较高优先级插入队头，较低优先级插入队尾）实现优先级调度。后续若扩展多级优先级，调度时使用的双端队列数据结构需再升级为优先队列，同时若考虑优先队列可能带来较大的出队和入队开销，可能需要缓存调度序列。
为更多算子设置优先级：当前只对通信算子规划了优先级，后续需要考虑为更多OP规划优先级，包括拷贝OP、自定义stream的OP（Support custom stream for standalone executor #47411 ）、或者引入cost model后依据cost model信息做全局的OP优先级规划。

paddle-bot · 2022-10-29T13:30:40Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… dispatch-computation-before-communication-in-standalone-executor

zhiqiu

LGTM

Dispath computation OPs before communication in standalone executor

33de673

From00 added 2 commits October 31, 2022 03:56

Update code

7f5a749

Fix CI errors

f2c5b4c

From00 requested a review from zhiqiu November 1, 2022 06:03

From00 changed the title ~~Dispath computation OPs before communication in standalone executor~~ Dispatch computation OPs before communication in standalone executor Nov 1, 2022

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

22f0992

… dispatch-computation-before-communication-in-standalone-executor

zhiqiu approved these changes Nov 2, 2022

View reviewed changes

From00 merged commit 5ed487b into PaddlePaddle:develop Nov 2, 2022

From00 deleted the dispatch-computation-before-communication-in-standalone-executor branch April 5, 2023 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatch computation OPs before communication in standalone executor #47471

Dispatch computation OPs before communication in standalone executor #47471

From00 commented Oct 29, 2022 •

edited

Loading

paddle-bot bot commented Oct 29, 2022

zhiqiu left a comment

Dispatch computation OPs before communication in standalone executor #47471

Dispatch computation OPs before communication in standalone executor #47471

Conversation

From00 commented Oct 29, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Oct 29, 2022

zhiqiu left a comment

Choose a reason for hiding this comment

From00 commented Oct 29, 2022 •

edited

Loading