Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
Others
Describe
修复通信同步情况下,计算通信出现不重叠的情况。
背景
在profile resnet性能时,发现有几个fuse_allreduce通信和计算无法重叠的情况
![image](https://user-images.githubusercontent.com/10208305/100075953-9a544080-2e7b-11eb-821a-485146bcb202.png)
问题定位
op的拓扑依赖没有问题。继续实验:
![image](https://user-images.githubusercontent.com/10208305/100076218-e99a7100-2e7b-11eb-9f36-5678dd0b1199.png)
![image](https://user-images.githubusercontent.com/10208305/100076402-20708700-2e7c-11eb-9a37-c135f9a4908a.png)
1、关闭sync_nccl_allreduce下,通信和计算可以重叠。
2、调整fuse_allreduce大小,某些大小下通信可以和计算重叠。
确定拓扑依赖没问题,应该是调度器执行出了问题。查看timeline中op调度部分,另一个线程没有op调度计算线程。
定位执行器问题地点。开启执行log,发现以下拓扑依赖关系。
根据源码,调度情况为:
1、执行conv2d_grad,完成后续有三个op的入度为0,可被调度。根据代码,由于eager_delection和buffer_share的优先级为highest,插入队列优先调度,此时队列中op为(eager_delection, buffer_share, fused_allreduce)
2、执行eager_delection,无后继,此时队列中op为(buffer_share, fused_allreduce)
3、执行buffer_share,完成后relu_grad入度为0,可被调度。此时根据代码,该插入被队列而非发射出去。此时队列op为(fused_allreduce, relu_grad)
4、执行fused_allreduce,当allreduce开启同步时,虽然relu_grad和allreduce没有拓扑依赖,但是relu_grad必须等allreduce执行完成才可以被调度。
5、relu_grad执行,调度后续op。
这种线性调度,都是关键路径场景下,则会出现计算通信不重叠的情况。
调度优化
1、计算通信不重叠优化,commit0
![image](https://user-images.githubusercontent.com/10208305/100076704-75ac9880-2e7c-11eb-9097-f08373964295.png)
![image](https://user-images.githubusercontent.com/10208305/100076921-b2788f80-2e7c-11eb-8d86-34c5190406d3.png)
遇到(fuse_allreduce, relu_grad)这种队列的时候,把fuse_allreduce后面的op都给发射出去。这样后续计算op就可以和通信overlap了。
profile,没有问题
2、allreduce调度优化。上面timeline中,发现标1部分都是先调度计算op再调度通信op的。为此将通信op在计算op前调度。
此时timeline为,allreduce优先于计算op调度了。
性能测试
v100 32G机器,由于测试集群中存在不同cpu配置的机器。所以在相同任务下测试不同分支的速度,排除机器配置的影响。
resnet50测试
fp16测试
fp32测试
在1-2机下,由于通信耗时较低,可以看到性能基本持平。在4机的情况下,由于通信耗时的延长,较稳定的略微有性能提升。
Bert base seq_len=128
fp16
fp32
由于bert里面分叉较多,所以不会出现resnet不重叠的情况,性能基本无影响。