question about gradient accumulation #286

glample · 2019-05-02T20:31:34Z

I followed the steps described here: https://nvidia.github.io/apex/advanced.html?highlight=accumulate#gradient-accumulation-across-iterations to perform gradient accumulation.

I was expecting to have the same results when I train my model on 8 GPUs with 4 steps accumulation, and 32 GPU without accumulation (i.e. 2 settings with the same batch size), but this is not what I observed, i.e. accumulating the gradient does not work as well and performance is not as good for the same number of epochs. In the case of 8 GPUs + 4 steps, is the gradient automatically averaged over 32 batches, or should I divide my gradients by 4 before performing a step?

Thank you

mcarilli · 2019-05-02T21:20:13Z

Gradient accumulation is not considered a special mode by Amp. The repeated calls to

   with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
       scaled_loss.backward()

are just doing their job, accumulating unscaled gradients from each backward pass into the .grad attributes of the parameters owned by the optimizer. So if you are accumulating over 4 iterations, amp doesn't automatically divide by 4 for you. For your particular case, if you want the operation of 8 GPUS + 4 steps of accumulation to be numerically equivalent to 32 GPUs without accumulation, you will have to manually say

    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)

so the example snippet would become

if iter%iters_to_accumulate == 0:
    # Every iters_to_accumulate iterations, unscale and step
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    # Gradient clipping if desired:
    # torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)
    optimizer.step()
    optimizer.zero_grad()
else:
    # Otherwise, accumulate gradients, don't unscale or step.
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

I will update the documentation to make this point clear.

glample · 2019-05-02T23:55:19Z

Got it, thank you!

mcarilli · 2019-05-03T00:38:42Z

Please reopen if gradient accumulation still doesn’t behave as expected with manual averaging across iterations. It shouldn’t be bitwise accurate with the 32 gpu case, but it should be pretty close and should train similarly.

ggaemo · 2020-05-27T11:07:05Z

Gradient accumulation is not considered a special mode by Amp. The repeated calls to
   with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
       scaled_loss.backward()
are just doing their job, accumulating unscaled gradients from each backward pass into the .grad attributes of the parameters owned by the optimizer. So if you are accumulating over 4 iterations, amp doesn't automatically divide by 4 for you. For your particular case, if you want the operation of 8 GPUS + 4 steps of accumulation to be numerically equivalent to 32 GPUs without accumulation, you will have to manually say
    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)
so the example snippet would become
if iter%iters_to_accumulate == 0:
    # Every iters_to_accumulate iterations, unscale and step
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    # Gradient clipping if desired:
    # torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
    for param in amp.master_params(optimizer):
        param.grad.div_(iters_to_accumulate)
    optimizer.step()
    optimizer.zero_grad()
else:
    # Otherwise, accumulate gradients, don't unscale or step.
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
I will update the documentation to make this point clear.

This is different from the example given in the docs.... Which is right?

glample closed this as completed May 2, 2019

Morizeyao mentioned this issue Jul 26, 2019

Need help with gradient accumulation implementation. Morizeyao/GPT2-Chinese#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about gradient accumulation #286

question about gradient accumulation #286

glample commented May 2, 2019 •

edited

mcarilli commented May 2, 2019 •

edited

glample commented May 2, 2019

mcarilli commented May 3, 2019

ggaemo commented May 27, 2020

question about gradient accumulation #286

question about gradient accumulation #286

Comments

glample commented May 2, 2019 • edited

mcarilli commented May 2, 2019 • edited

glample commented May 2, 2019

mcarilli commented May 3, 2019

ggaemo commented May 27, 2020

glample commented May 2, 2019 •

edited

mcarilli commented May 2, 2019 •

edited