You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For current distributed training implement, all send/recv opswas scheduled on GPU0, I have done some experiments about schedule the send/recv ops on different devices to overlap memcpy.
This feature would improve about 9% performance.
2* trainers + 4*pservers, 8 GPUs per trainer, parallelexecutor(num_threads=1)
1. develop branch
Pass = 0, Elapsed = 160, Training performance = 38.328424 imgs/s, Train accuracy = 0.011654, Test accuracy = 0.011765
Pass = 1, Elapsed = 156, Training performance = 39.400293 imgs/s, Train accuracy = 0.009553, Test accuracy = 0.009804
2. overlap memcpy branch
Pass = 0, Elapsed = 146, Training performance = 41.994200 imgs/s, Train accuracy = 0.010613, Test accuracy = 0.009649
Pass = 1, Elapsed = 144, Training performance = 42.671882 imgs/s, Train accuracy = 0.008629, Test accuracy = 0.010768
Please see the latest experiment details in the PR comment #11221
The text was updated successfully, but these errors were encountered:
Yancey1989
changed the title
Schedule send/recv op from gpu0 to all devices to overlap memcpy
[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy
Jun 5, 2018
For current distributed training implement, all send/recv opswas scheduled on GPU0, I have done some experiments about schedule the send/recv ops on different devices to overlap memcpy.
This feature would improve about 9% performance.2* trainers + 4*pservers, 8 GPUs per trainer, parallelexecutor(num_threads=1)1. develop branch2. overlap memcpy branchPlease see the latest experiment details in the PR comment #11221
The text was updated successfully, but these errors were encountered: