[Performance] parallelize GPU to/from CPU tensor copying in distributed training #11086

typhoonzero · 2018-05-31T09:22:31Z

Now we only do GPU -> CPU tensor copying before send on GPU0, but we can make this parallel when we are running parallel executor with multiple GPUs: we can launch kernels on different GPUs at the same time to copy different gradient variables.

shanyi15 · 2018-08-15T10:24:19Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

typhoonzero added this to Perf TODOs in PaddlePaddle Distributed Refactoring (Due: 201802) May 31, 2018

Yancey1989 mentioned this issue Jun 5, 2018

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

Closed

shanyi15 closed this as completed Aug 15, 2018

PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from Perf TODOs to DONE Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] parallelize GPU to/from CPU tensor copying in distributed training #11086

[Performance] parallelize GPU to/from CPU tensor copying in distributed training #11086

typhoonzero commented May 31, 2018

shanyi15 commented Aug 15, 2018

[Performance] parallelize GPU to/from CPU tensor copying in distributed training #11086

[Performance] parallelize GPU to/from CPU tensor copying in distributed training #11086

Comments

typhoonzero commented May 31, 2018

shanyi15 commented Aug 15, 2018