Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

Closed
Yancey1989 opened this issue Jun 4, 2018 · 1 comment

Comments

@Yancey1989
Copy link
Contributor

Yancey1989 commented Jun 4, 2018

For current distributed training implement, all send/recv opswas scheduled on GPU0, I have done some experiments about schedule the send/recv ops on different devices to overlap memcpy.

This feature would improve about 9% performance.

2* trainers + 4*pservers, 8 GPUs per trainer, parallelexecutor(num_threads=1)

1. develop branch

Pass = 0, Elapsed = 160, Training performance = 38.328424 imgs/s, Train accuracy = 0.011654, Test accuracy = 0.011765
Pass = 1, Elapsed = 156, Training performance = 39.400293 imgs/s, Train accuracy = 0.009553, Test accuracy = 0.009804

2. overlap memcpy branch

Pass = 0, Elapsed = 146, Training performance = 41.994200 imgs/s, Train accuracy = 0.010613, Test accuracy = 0.009649
Pass = 1, Elapsed = 144, Training performance = 42.671882 imgs/s, Train accuracy = 0.008629, Test accuracy = 0.010768

Please see the latest experiment details in the PR comment #11221

@Yancey1989
Copy link
Contributor Author

Duplicate #11086

@Yancey1989 Yancey1989 changed the title Schedule send/recv op from gpu0 to all devices to overlap memcpy [performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy Jun 5, 2018
PaddlePaddle Distributed Refactoring (Due: 201802) automation moved this from DOING to DONE Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants