RuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1) #6624

ArvinZhuang · 2021-03-22T06:01:19Z

🐛 Bug

When using self.all_gather in training_step to gather tensor with gradient function to compute and return loss throwing
RuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1)

I think the bug is that the forward(ctx, tensor, group=group.WORLD) in distributed.AllGatherGrad function has two arguments but the backward(ctx, *grad_output) only returns one output.

class AllGatherGrad(torch.autograd.Function):
    @staticmethod
    def forward(ctx, tensor, group=group.WORLD):
        ctx.group = group

        gathered_tensor = [
            torch.zeros_like(tensor) for _ in range(torch.distributed.get_world_size())
        ]

        torch.distributed.all_gather(gathered_tensor, tensor, group=group)
        gathered_tensor = torch.stack(gathered_tensor, dim=0)

        return gathered_tensor

    @staticmethod
    def backward(ctx, *grad_output):
        grad_output = torch.cat(grad_output)

        torch.distributed.all_reduce(
            grad_output,
            op=torch.distributed.ReduceOp.SUM,
            async_op=False,
            group=ctx.group
        )

        return grad_output[torch.distributed.get_rank()]

The error can be solved by changing return grad_output[torch.distributed.get_rank()] to return grad_output[torch.distributed.get_rank()], None

Environment

PyTorch Version: 1.7.1
OS: Linux
How you installed PyTorch: pip
Python version: 3.7
lightning version: 1.1.8

The text was updated successfully, but these errors were encountered:

ArvinZhuang added bug Something isn't working help wanted Open to be worked on labels Mar 22, 2021

ArvinZhuang mentioned this issue Mar 22, 2021

Match the number of outputs of backward with forward for AllGatherGrad #6625

Merged

11 tasks

Borda added the priority: 1 Medium priority task label Mar 23, 2021

kaushikb11 closed this as completed in #6625 Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1) #6624

RuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1) #6624

ArvinZhuang commented Mar 22, 2021

RuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1) #6624

RuntimeError: function AllGatherGradBackward returned an incorrect number of gradients (expected 2, got 1) #6624

Comments

ArvinZhuang commented Mar 22, 2021

🐛 Bug

Environment