Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatch computation OPs before communication in standalone executor #47471

Conversation

From00
Copy link
Contributor

@From00 From00 commented Oct 29, 2022

PR types

Performance optimization

PR changes

Others

Describe

相关背景
PR #47030 为c_allreduce_sum算子支持了多流overlap,但在ResNet50_bs128_pure_fp16 4机32卡场景下实测未有明显性能提升。观察Timeline可见模型中存在较多未实际overlap的空隙:
image

问题分析
c_allreduce_sum调度之后需要进行多流同步(当前新执行器在每个c_allreduce_sum之后插入depend OP,通过depend OP做event的关联和同步),这种多流同步隐式地引入了间接依赖。
考虑彼此之间无相互依赖的待调度OP {c_allreduce_sum + depend 、计算OP1、计算OP2},若优先调度计算OP再调度c_allreduce_sum + depend,则计算可以和通信进行overlap;反之若优先调度c_allreduce_sum + depend,depend OP引入的计算流和通信流的同步操作会阻塞计算OP1和计算OP2的kernel,导致OP1和OP2的计算操作只能在通信完成后启动,限制了通信和计算的overlap。
ParallelExecutor(PE)中针对这种case专门做过优化(#28962 ),优化的核心思想是在同时存在可调度的通信和计算OP时,优先调度计算OP,让具有流同步行为的通信OP尽量在计算OP之后启动,以保证尽可能大的overlap。优化后PE中ResNet50_bs128_pure_fp16 4机32卡overlap效果如下:
image

优化效果
本PR在新执行器中做类似优化。通过为Instruction引入调度优先级,并让通信OP的优先级低于计算OP,实现计算OP优先调度的目的。

因当前无ResNet多机测试环境,暂时以transformer_big_bs2560_fp32单机2卡为例,优化前后overlap效果如下:
image
image

后续若有多机测试环境再补充ResNet50_bs128_pure_fp16 4机32卡实测效果。

未来计划
本PR在新执行器中引入了较初步的算子调度优先级。后续若有需求进行一些更复杂的多流调度优化,需再对优先级系统进行完善和优化,包括:

  • 引入多级优先级:当前实际只有2级优先级,并通过在调度队列入队时进行区分处理(较高优先级插入队头,较低优先级插入队尾)实现优先级调度。后续若扩展多级优先级,调度时使用的双端队列数据结构需再升级为优先队列,同时若考虑优先队列可能带来较大的出队和入队开销,可能需要缓存调度序列。
  • 为更多算子设置优先级:当前只对通信算子规划了优先级,后续需要考虑为更多OP规划优先级,包括拷贝OP、自定义stream的OP(Support custom stream for standalone executor #47411 )、或者引入cost model后依据cost model信息做全局的OP优先级规划。

@paddle-bot
Copy link

paddle-bot bot commented Oct 29, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@From00 From00 requested a review from zhiqiu November 1, 2022 06:03
@From00 From00 changed the title Dispath computation OPs before communication in standalone executor Dispatch computation OPs before communication in standalone executor Nov 1, 2022
… dispatch-computation-before-communication-in-standalone-executor
Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@From00 From00 merged commit 5ed487b into PaddlePaddle:develop Nov 2, 2022
@From00 From00 deleted the dispatch-computation-before-communication-in-standalone-executor branch April 5, 2023 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants