New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about gradient accumulation #286
Comments
Gradient accumulation is not considered a special mode by Amp. The repeated calls to
are just doing their job, accumulating unscaled gradients from each backward pass into the .grad attributes of the parameters owned by the optimizer. So if you are accumulating over 4 iterations, amp doesn't automatically divide by 4 for you. For your particular case, if you want the operation of 8 GPUS + 4 steps of accumulation to be numerically equivalent to 32 GPUs without accumulation, you will have to manually say
so the example snippet would become
I will update the documentation to make this point clear. |
Got it, thank you! |
Please reopen if gradient accumulation still doesn’t behave as expected with manual averaging across iterations. It shouldn’t be bitwise accurate with the 32 gpu case, but it should be pretty close and should train similarly. |
This is different from the example given in the docs.... Which is right? |
I followed the steps described here: https://nvidia.github.io/apex/advanced.html?highlight=accumulate#gradient-accumulation-across-iterations to perform gradient accumulation.
I was expecting to have the same results when I train my model on 8 GPUs with 4 steps accumulation, and 32 GPU without accumulation (i.e. 2 settings with the same batch size), but this is not what I observed, i.e. accumulating the gradient does not work as well and performance is not as good for the same number of epochs. In the case of 8 GPUs + 4 steps, is the gradient automatically averaged over 32 batches, or should I divide my gradients by 4 before performing a step?
Thank you
The text was updated successfully, but these errors were encountered: