[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

Yancey1989 · 2018-06-04T06:17:57Z

For current distributed training implement, all send/recv opswas scheduled on GPU0, I have done some experiments about schedule the send/recv ops on different devices to overlap memcpy.

~~This feature would improve about 9% performance.~~

~~2* trainers + 4*pservers, 8 GPUs per trainer, parallelexecutor(num_threads=1)~~

~~1. develop branch~~

Pass = 0, Elapsed = 160, Training performance = 38.328424 imgs/s, Train accuracy = 0.011654, Test accuracy = 0.011765
Pass = 1, Elapsed = 156, Training performance = 39.400293 imgs/s, Train accuracy = 0.009553, Test accuracy = 0.009804

~~2. overlap memcpy branch~~

Pass = 0, Elapsed = 146, Training performance = 41.994200 imgs/s, Train accuracy = 0.010613, Test accuracy = 0.009649
Pass = 1, Elapsed = 144, Training performance = 42.671882 imgs/s, Train accuracy = 0.008629, Test accuracy = 0.010768

Please see the latest experiment details in the PR comment #11221

Yancey1989 · 2018-06-05T08:36:28Z

Duplicate #11086

Yancey1989 assigned Yancey1989, panyx0718 and typhoonzero Jun 4, 2018

Yancey1989 added the enhancement label Jun 4, 2018

Yancey1989 changed the title ~~Schedule send/recv op from gpu0 to all devices to overlap memcpy~~ [performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy Jun 5, 2018

Yancey1989 added this to DOING in PaddlePaddle Distributed Refactoring (Due: 201802) Jun 5, 2018

Yancey1989 mentioned this issue Jun 6, 2018

overlap rpc op memcpy in distributed training #11221

Merged

Yancey1989 closed this as completed in #11221 Jun 20, 2018

PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from DOING to DONE Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

Yancey1989 commented Jun 4, 2018 •

edited

Loading

Yancey1989 commented Jun 5, 2018

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

Comments

Yancey1989 commented Jun 4, 2018 • edited Loading

Yancey1989 commented Jun 5, 2018

Yancey1989 commented Jun 4, 2018 •

edited

Loading