-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The overlapping of communication over computation seems not been realized due to the GPU log #33
Comments
I'll try to reproduce this behavior on our cluster. In the meantime, try running with the message size set to 1. This should cause distributed to allreduce each parameter individually, as soon as they receive their gradients. This may negatively affect performance, because it will disable bucketing of transfers, but it should help force communication to overlap with computation (assuming everything is working correctly). |
Thank you @mcarilli. I have tried to set the message_size to 1, but still cannot find the overlapping. The AllReduceKernel and AllReduceKernelSmall are only called after all backward computation. Hope your testing result. |
Ok, I have an explanation. Any Pytorch version prior to pytorch/pytorch#7604 will not observe overlap of communication with computation, due to a bug in how Pytorch prioritized ops in the backwards pass. Evidently 0.4.0 does not contain the fix. I was able to reproduce the behavior you observed (no overlap) in the upstream 0.4.0 container (pytorch/pytorch:0.4-cuda9-cudnn7-devel). I then tried in another container which had a much more recent version of Pytorch installed, and saw comms and computation overlapping nicely. If you try with the most recent version of Pytorch and the most recent version of Apex, it should work. |
The latest stable release, Pytorch 0.4.1, should contain 7604, so it is also a viable option. |
You are right @mcarilli . We notice the overlap now. Thank you again. This issue can be closed |
* enable deprecated fused adam optimizer * enable deprecated fused lamb * reset the compiler arguments * syntax error * aligning the compiler arguments
We use the apex extension with pytorch 0.4.0, The system information is:
system: ubuntu 16.04.4
pytorch version: 0.4.0 with CUDA 9.1 and CUDNN 7.0.5
python version: 3.5.2
GPU: Tesla P100 *8
NVIDIA driver: 390.46
Model: ResNet 50
we set shared_parameter=False to enable the overlapping of communication over computation (We have read the source code and find that if the value is True, the communication will be done after all computations). The message_size is reduced to 10^6. We run 6 iterations and record the GPU log with nvidia profiler tool.
However, we found from the GPU log that the overlapping is not realized. The log of the 6th iteration is shown as follow. The first "AllreduceKennel" call is after the call of 'MaxPoolBackward', which is the end of backward computation. We have checked the other iterations and find the same thing.
Could you please tell us the reason for us or point out the our faults when use the apex extensions?Thanks for your help.
The text was updated successfully, but these errors were encountered: