Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about gradient accumulation #286

Closed
glample opened this issue May 2, 2019 · 4 comments
Closed

question about gradient accumulation #286

glample opened this issue May 2, 2019 · 4 comments

Comments

@glample
Copy link

glample commented May 2, 2019

I followed the steps described here: https://nvidia.github.io/apex/advanced.html?highlight=accumulate#gradient-accumulation-across-iterations to perform gradient accumulation.

I was expecting to have the same results when I train my model on 8 GPUs with 4 steps accumulation, and 32 GPU without accumulation (i.e. 2 settings with the same batch size), but this is not what I observed, i.e. accumulating the gradient does not work as well and performance is not as good for the same number of epochs. In the case of 8 GPUs + 4 steps, is the gradient automatically averaged over 32 batches, or should I divide my gradients by 4 before performing a step?

Thank you

@mcarilli
Copy link
Contributor

mcarilli commented May 2, 2019

Gradient accumulation is not considered a special mode by Amp. The repeated calls to

   with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
       scaled_loss.backward()

are just doing their job, accumulating unscaled gradients from each backward pass into the .grad attributes of the parameters owned by the optimizer. So if you are accumulating over 4 iterations, amp doesn't automatically divide by 4 for you. For your particular case, if you want the operation of 8 GPUS + 4 steps of accumulation to be numerically equivalent to 32 GPUs without accumulation, you will have to manually say

    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)

so the example snippet would become

if iter%iters_to_accumulate == 0:
    # Every iters_to_accumulate iterations, unscale and step
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    # Gradient clipping if desired:
    # torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)
    optimizer.step()
    optimizer.zero_grad()
else:
    # Otherwise, accumulate gradients, don't unscale or step.
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

I will update the documentation to make this point clear.

@glample
Copy link
Author

glample commented May 2, 2019

Got it, thank you!

@glample glample closed this as completed May 2, 2019
@mcarilli
Copy link
Contributor

mcarilli commented May 3, 2019

Please reopen if gradient accumulation still doesn’t behave as expected with manual averaging across iterations. It shouldn’t be bitwise accurate with the 32 gpu case, but it should be pretty close and should train similarly.

@ggaemo
Copy link

ggaemo commented May 27, 2020

Gradient accumulation is not considered a special mode by Amp. The repeated calls to

   with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
       scaled_loss.backward()

are just doing their job, accumulating unscaled gradients from each backward pass into the .grad attributes of the parameters owned by the optimizer. So if you are accumulating over 4 iterations, amp doesn't automatically divide by 4 for you. For your particular case, if you want the operation of 8 GPUS + 4 steps of accumulation to be numerically equivalent to 32 GPUs without accumulation, you will have to manually say

    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)

so the example snippet would become

if iter%iters_to_accumulate == 0:
    # Every iters_to_accumulate iterations, unscale and step
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    # Gradient clipping if desired:
    # torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)
    optimizer.step()
    optimizer.zero_grad()
else:
    # Otherwise, accumulate gradients, don't unscale or step.
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

I will update the documentation to make this point clear.

This is different from the example given in the docs.... Which is right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants