Proper way to do contrastive learning with DDP & PT-Lightning #14390

kkarrancsu · 2022-08-25T15:50:33Z

kkarrancsu
Aug 25, 2022

I want to use DDP and experiment with contrastive losses. Since DDP processes each subset of the data independently, negative examples that could be used to increase the contrastive power cannot be taken into account using automatic optimization. Suppose I am training with 2 GPU's and each GPU sees a mini-batch of size 4. This leads to missing signal between (x1, x5), (x1, x6), (x1, x7), etc... since x1-x4 are on GPU1 and x5-x8 are on GPU2.

What is the recommended method to account for this in PT-Lightning?
One approach seems to be to use the on_train_batch_end() callback, and 1) gather the outputs from all GPUs, 2) compute the loss on rank=0, and 3) distribute that loss back to each GPU.

After computing loss, I'm unclear as to the mechanics for how to distribute that loss computed on rank=0 back to all of the GPU's so that the gradients are synced. Is this something that happens automatically under the hood, or do I need to do something w.r.t. manual optimization?

Answered by awaelchli

Aug 28, 2022

@kkarrancsu You are definitely on the right track here. In the LightningModule, you have this method for gathering a tensor from all processes:

tensors_from_all = self.all_gather(my_tensor)

What you want is to back-propagate through this all_gather function, and this is possible if you set

tensors_from_all = self.all_gather(my_tensor, sync_grad=True)

In your case, your training_step method could look something like this:

    def training_step(self, batch, batch_idx):
        outputs = self(batch)
        ...

        all_outputs = self.all_gather(outputs, sync_grads=True)

        loss = contrastive_loss_fn(all_outputs, ...)
        return loss

View full answer

turian · 2022-08-27T08:14:33Z

turian
Aug 27, 2022

@awaelchli @williamFalcon This is, in fact, the same tutorial I wanted for sync'ing stuff using torch.gather etc for VICreg!

0 replies

awaelchli · 2022-08-28T13:13:50Z

awaelchli
Aug 28, 2022
Maintainer

@kkarrancsu You are definitely on the right track here. In the LightningModule, you have this method for gathering a tensor from all processes:

tensors_from_all = self.all_gather(my_tensor)

What you want is to back-propagate through this all_gather function, and this is possible if you set

tensors_from_all = self.all_gather(my_tensor, sync_grad=True)

In your case, your training_step method could look something like this:

    def training_step(self, batch, batch_idx):
        outputs = self(batch)
        ...

        all_outputs = self.all_gather(outputs, sync_grads=True)

        loss = contrastive_loss_fn(all_outputs, ...)
        return loss

2 replies

kkarrancsu Aug 29, 2022
Author

Thanks @awaelchli - it is helpful to know I can do this in the training_step method.

In the listing you provided, did you mean to put sync_grads=True?

awaelchli Aug 30, 2022
Maintainer

@kkarrancsu You're welcome. Yes, it was a copy-paste error, I updated it. It should be sync_grads=True.

kkarrancsu · 2022-09-01T17:26:35Z

kkarrancsu
Sep 1, 2022
Author

Follow-up question, to do contrastive learning for DP, I believe that we need to implement the loss computation in the training_step_end() function. But for DDP, it looks like we can do it directly in the training_step() function with the code example above.

Does this mean we should have flags in our training_step() function to be able to support both?

Here is a stub:

def training_step(self, batch, batch_idx):
    if self.strategy.lower() == 'ddp':
        outputs = self.model(batch)
        all_outputs = self.all_gather(outputs, sync_grads=True)
        loss = self.loss(all_outputs)
        return loss
    elif self.strategy.lower() == 'dp':
        return {'model_output': model_output, 'labels': labels}

def training_step_end(self, batch_parts):
    if self.strategy.lower() == 'ddp':
        pass
    elif self.strategy.lower() == 'dp':
        model_outputs_all = batch_parts['model_output']
        labels_all = batch_parts['labels']
        l = self.loss(model_outputs_all, labels_all)
        return l

Two follow-up questions:
1 - would doing this prevent me from using TPUs in the future?
2 - do I return batch_parts for strategy=='ddp' in train_step_end() or just pass

0 replies

piconti · 2022-11-30T09:42:37Z

piconti
Nov 30, 2022

Hi! Thank you for the answer @awaelchli. I also want to use contrastive learning with DDP, and have an additional question on the code you mentioned here.
Should the code computing the loss be in a condition that only allows process with rank_0 to do so? Some older discussions and questions on similar topics seem to elude to it, but it's not clear.

By this I mean, should we have an additional condition in the example:

def training_step(self, batch, batch_idx):
   outputs = self(batch)
      ...
      all_outputs = self.all_gather(outputs, sync_grads=True)
      if self.trainer.is_global_zero: 
         loss = contrastive_loss_fn(all_outputs, ...)
                
         return loss
      else:
         return None

In such case, would using sync_grads=True mean the computed loss would not need to be distributed/broadcasted back to all workers and all the optimizers would have access to the corresponding gradients automatically for their step?

When computing the loss only on global_zero, I've had issue with a deadlock at the end of the validation sanity-check. There was no issue without the global zero condition, but computing the loss on all workers seems redundant. Any advice?

Thank you in advance!

2 replies

awaelchli Nov 30, 2022
Maintainer

Definitely not. The loss should be computed on all ranks. That you get a deadlock is understandable. All processes > 0 will return None in your example and skip the training step completely.

sync_grads means we can "backpropagate" through the entire graph of all processes.

piconti Nov 30, 2022

Thank you for the fast response! The example is misleading, I am actually summing this loss with other losses and not returning none in my training_step function, but the contrastive loss was only added to the overall loss on rank 0.
However, it makes sense that I need to compute it in all workers either way, thank you!

yipliu · 2023-12-10T11:09:50Z

yipliu
Dec 10, 2023

Excellent!

However, how about doing contrastive learning with accumulate_grad_batches & DDP?

3 replies

GitOutOfMyBed Apr 26, 2024

you can't accumulate gradients in contrastive learning. you need all the loss at the end to contrast against all of each other.

yipliu Apr 26, 2024

I'm sorry, I'm not quite sure what you mean.

In my opinion, I chose accumulate_grad_batches because I want to use a larger batch size

GitOutOfMyBed Apr 26, 2024

If you have n samples in each batch, if you do gradient accumulation you only do n^2 number of contrasts, when in fact you want to do (n*x)^2 number of contrasts. where x is the number of batches you go through while accumulating gradients before your optimization step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper way to do contrastive learning with DDP & PT-Lightning #14390

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Proper way to do contrastive learning with DDP & PT-Lightning #14390

Replies: 5 comments · 7 replies

awaelchli Aug 28, 2022 Maintainer

kkarrancsu Aug 29, 2022 Author

awaelchli Aug 30, 2022 Maintainer

kkarrancsu Sep 1, 2022 Author

awaelchli Nov 30, 2022 Maintainer

Replies: 5 comments 7 replies

awaelchli
Aug 28, 2022
Maintainer

kkarrancsu Aug 29, 2022
Author

awaelchli Aug 30, 2022
Maintainer

kkarrancsu
Sep 1, 2022
Author

awaelchli Nov 30, 2022
Maintainer