optimize fast graph executor #28962

wangxicoding · 2020-11-22T18:41:46Z

PR types

Performance optimization

PR changes

Others

Describe

修复通信同步情况下，计算通信出现不重叠的情况。

背景

在profile resnet性能时，发现有几个fuse_allreduce通信和计算无法重叠的情况

问题定位

op的拓扑依赖没有问题。继续实验：
1、关闭sync_nccl_allreduce下，通信和计算可以重叠。
2、调整fuse_allreduce大小，某些大小下通信可以和计算重叠。
确定拓扑依赖没问题，应该是调度器执行出了问题。查看timeline中op调度部分，另一个线程没有op调度计算线程。

定位执行器问题地点。开启执行log，发现以下拓扑依赖关系。

根据源码，调度情况为：
1、执行conv2d_grad，完成后续有三个op的入度为0，可被调度。根据代码，由于eager_delection和buffer_share的优先级为highest，插入队列优先调度，此时队列中op为(eager_delection, buffer_share, fused_allreduce)
2、执行eager_delection，无后继，此时队列中op为(buffer_share, fused_allreduce)
3、执行buffer_share，完成后relu_grad入度为0，可被调度。此时根据代码，该插入被队列而非发射出去。此时队列op为(fused_allreduce, relu_grad)
4、执行fused_allreduce，当allreduce开启同步时，虽然relu_grad和allreduce没有拓扑依赖，但是relu_grad必须等allreduce执行完成才可以被调度。
5、relu_grad执行，调度后续op。
这种线性调度，都是关键路径场景下，则会出现计算通信不重叠的情况。

调度优化

1、计算通信不重叠优化，commit0
遇到(fuse_allreduce, relu_grad)这种队列的时候，把fuse_allreduce后面的op都给发射出去。这样后续计算op就可以和通信overlap了。
profile，没有问题

2、allreduce调度优化。上面timeline中，发现标1部分都是先调度计算op再调度通信op的。为此将通信op在计算op前调度。
此时timeline为，allreduce优先于计算op调度了。

性能测试

v100 32G机器，由于测试集群中存在不同cpu配置的机器。所以在相同任务下测试不同分支的速度，排除机器配置的影响。

resnet50测试

fp16测试

节点*卡数	batch_size	develop速度	PR速度	速度提升
1*1	256	1044.718	1042.597	-0.20%
1*1	128	1040.754	1046.252	0.53%
1*1	64	947.155	952.221	0.53%
1*8	256	8922.831	8940.075	0.19%
1*8	128	8149.6	8170.382	0.26%
1*8	64	6942.673	6942.018	-0.01%
2*8	256	15840.077	15921.634	0.51%
2*8	128	14198.233	14006.84	-1.35%
2*8	64	10961.613	10949.08	-0.11%
4*8	256	31267.758	31348.686	0.26%
4*8	128	28694.384	28787.301	0.32%
4*8	64	22064.946	22127.404	0.28%

fp32测试

节点*卡数	batch_size	develop速度	PR速度	速度提升
1*1	128	366.581	364.762	-0.50%
1*1	64	350.231	349.72	-0.15%
1*1	32	322.401	323.364	0.30%
1*8	128	2789.791	2785.923	-0.14%
1*8	64	2635.64	2627.657	-0.30%
1*8	32	2471.258	2475.726	0.18%
2*8	128	5679.561	5671.237	-0.15%
2*8	64	5286.533	5296.494	0.19%
2*8	32	4414.319	4489.354	1.70%
4*8	128	11152.139	11162.055	0.09%
4*8	64	10321.324	10346.875	0.25%
4*8	32	8599.532	8632.848	0.39%

在1-2机下，由于通信耗时较低，可以看到性能基本持平。在4机的情况下，由于通信耗时的延长，较稳定的略微有性能提升。

Bert base seq_len=128

fp16

节点*卡数	batch_size	develop速度	PR速度	速度提升
1*1	160	270.965	271.751	0.29%
1*1	96	275.518	276.309	0.29%
1*1	64	257.062	259.251	0.85%
1*8	160	1785.138	1784.106	-0.06%
1*8	96	1745.656	1744.256	-0.08%
1*8	64	1650.039	1642.526	-0.46%
2*8	160	3459.2	3440.833	-0.53%
2*8	96	3223.461	3215.489	-0.25%
2*8	64	3018.386	3020.835	0.08%
4*8	160	6945.281	6946.192	0.01%
4*8	96	6473.703	6493.536	0.31%
4*8	64	6031.451	6047.765	0.27%

fp32

节点*卡数	batch_size	develop速度	PR速度	速度提升
1*1	96	140.07	139.895	-0.12%
1*1	64	137.7	137.892	0.14%
1*8	96	1105.955	1105.08	-0.08%
1*8	64	1074.707	1075.406	0.07%
2*8	96	2100.927	2102.287	0.06%
2*8	64	1984.305	1975.174	-0.46%
4*8	96	4199.608	4184.235	-0.37%
4*8	64	3924.045	3937.16	0.33%

由于bert里面分叉较多，所以不会出现resnet不重叠的情况，性能基本无影响。

paddle-bot-old · 2020-11-22T18:41:50Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

gongweibao

LGTM

optimize fast graph executor, test=develop

0d0ca18

multi device ops scheduled optimize, test=develop

21b59fc

wangxicoding requested a review from gongweibao November 23, 2020 12:44

gongweibao approved these changes Nov 24, 2020

View reviewed changes

chalsliu approved these changes Nov 26, 2020

View reviewed changes

wangxicoding merged commit 173c22a into PaddlePaddle:develop Nov 26, 2020

QingshuChen pushed a commit to QingshuChen/Paddle that referenced this pull request Nov 30, 2020

optimize fast graph executor (PaddlePaddle#28962)

2d9847f

wangxicoding deleted the opt_fast_exec branch August 6, 2021 10:29

From00 mentioned this pull request Oct 31, 2022

Dispatch computation OPs before communication in standalone executor #47471

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize fast graph executor #28962

optimize fast graph executor #28962

wangxicoding commented Nov 22, 2020 •

edited

Loading

paddle-bot-old bot commented Nov 22, 2020

gongweibao left a comment

optimize fast graph executor #28962

optimize fast graph executor #28962

Conversation

wangxicoding commented Nov 22, 2020 • edited Loading

PR types

PR changes

Describe

背景

问题定位

调度优化

性能测试

resnet50测试

Bert base seq_len=128

paddle-bot-old bot commented Nov 22, 2020

gongweibao left a comment

Choose a reason for hiding this comment

wangxicoding commented Nov 22, 2020 •

edited

Loading