Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize fast graph executor #28962

Merged
merged 2 commits into from
Nov 26, 2020

Conversation

wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Nov 22, 2020

PR types

Performance optimization

PR changes

Others

Describe

修复通信同步情况下,计算通信出现不重叠的情况。

背景

在profile resnet性能时,发现有几个fuse_allreduce通信和计算无法重叠的情况
image

问题定位

op的拓扑依赖没有问题。继续实验:
1、关闭sync_nccl_allreduce下,通信和计算可以重叠。
2、调整fuse_allreduce大小,某些大小下通信可以和计算重叠。
确定拓扑依赖没问题,应该是调度器执行出了问题。查看timeline中op调度部分,另一个线程没有op调度计算线程。
image
定位执行器问题地点。开启执行log,发现以下拓扑依赖关系。
image
根据源码,调度情况为:
1、执行conv2d_grad,完成后续有三个op的入度为0,可被调度。根据代码,由于eager_delection和buffer_share的优先级为highest,插入队列优先调度,此时队列中op为(eager_delection, buffer_share, fused_allreduce)
2、执行eager_delection,无后继,此时队列中op为(buffer_share, fused_allreduce)
3、执行buffer_share,完成后relu_grad入度为0,可被调度。此时根据代码,该插入被队列而非发射出去。此时队列op为(fused_allreduce, relu_grad)
4、执行fused_allreduce,当allreduce开启同步时,虽然relu_grad和allreduce没有拓扑依赖,但是relu_grad必须等allreduce执行完成才可以被调度。
5、relu_grad执行,调度后续op。
这种线性调度,都是关键路径场景下,则会出现计算通信不重叠的情况。

调度优化

1、计算通信不重叠优化,commit0
遇到(fuse_allreduce, relu_grad)这种队列的时候,把fuse_allreduce后面的op都给发射出去。这样后续计算op就可以和通信overlap了。
profile,没有问题
image
2、allreduce调度优化。上面timeline中,发现标1部分都是先调度计算op再调度通信op的。为此将通信op在计算op前调度。
此时timeline为,allreduce优先于计算op调度了。
image

性能测试

v100 32G机器,由于测试集群中存在不同cpu配置的机器。所以在相同任务下测试不同分支的速度,排除机器配置的影响。

resnet50测试

fp16测试

节点*卡数 batch_size develop速度 PR速度 速度提升
1*1 256 1044.718 1042.597 -0.20%
1*1 128 1040.754 1046.252 0.53%
1*1 64 947.155 952.221 0.53%
1*8 256 8922.831 8940.075 0.19%
1*8 128 8149.6 8170.382 0.26%
1*8 64 6942.673 6942.018 -0.01%
2*8 256 15840.077 15921.634 0.51%
2*8 128 14198.233 14006.84 -1.35%
2*8 64 10961.613 10949.08 -0.11%
4*8 256 31267.758 31348.686 0.26%
4*8 128 28694.384 28787.301 0.32%
4*8 64 22064.946 22127.404 0.28%

fp32测试

节点*卡数 batch_size develop速度 PR速度 速度提升
1*1 128 366.581 364.762 -0.50%
1*1 64 350.231 349.72 -0.15%
1*1 32 322.401 323.364 0.30%
1*8 128 2789.791 2785.923 -0.14%
1*8 64 2635.64 2627.657 -0.30%
1*8 32 2471.258 2475.726 0.18%
2*8 128 5679.561 5671.237 -0.15%
2*8 64 5286.533 5296.494 0.19%
2*8 32 4414.319 4489.354 1.70%
4*8 128 11152.139 11162.055 0.09%
4*8 64 10321.324 10346.875 0.25%
4*8 32 8599.532 8632.848 0.39%

在1-2机下,由于通信耗时较低,可以看到性能基本持平。在4机的情况下,由于通信耗时的延长,较稳定的略微有性能提升。

Bert base seq_len=128

fp16

节点*卡数 batch_size develop速度 PR速度 速度提升
1*1 160 270.965 271.751 0.29%
1*1 96 275.518 276.309 0.29%
1*1 64 257.062 259.251 0.85%
1*8 160 1785.138 1784.106 -0.06%
1*8 96 1745.656 1744.256 -0.08%
1*8 64 1650.039 1642.526 -0.46%
2*8 160 3459.2 3440.833 -0.53%
2*8 96 3223.461 3215.489 -0.25%
2*8 64 3018.386 3020.835 0.08%
4*8 160 6945.281 6946.192 0.01%
4*8 96 6473.703 6493.536 0.31%
4*8 64 6031.451 6047.765 0.27%

fp32

节点*卡数 batch_size develop速度 PR速度 速度提升
1*1 96 140.07 139.895 -0.12%
1*1 64 137.7 137.892 0.14%
1*8 96 1105.955 1105.08 -0.08%
1*8 64 1074.707 1075.406 0.07%
2*8 96 2100.927 2102.287 0.06%
2*8 64 1984.305 1975.174 -0.46%
4*8 96 4199.608 4184.235 -0.37%
4*8 64 3924.045 3937.16 0.33%

由于bert里面分叉较多,所以不会出现resnet不重叠的情况,性能基本无影响。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangxicoding wangxicoding merged commit 173c22a into PaddlePaddle:develop Nov 26, 2020
QingshuChen pushed a commit to QingshuChen/Paddle that referenced this pull request Nov 30, 2020
@wangxicoding wangxicoding deleted the opt_fast_exec branch August 6, 2021 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants