Support uneven DDP inputs with pytorch model.join #3325

edenlightning · 2020-09-02T13:54:58Z

cc @Borda @tchaton @rohitgr7 @akihironitta @awaelchli

stale · 2020-10-21T16:43:47Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca · 2020-10-21T19:53:46Z

Interested in this issue! Hopefully some progress is done soon 👍

xvr-hlt · 2020-11-19T22:57:43Z

Interested in this also :)

rohan-varma · 2020-12-02T00:51:57Z

Is there any progress on this issue? Happy to help in any way.

edenlightning · 2020-12-02T23:03:03Z

@rohan-varma that would be great!! Want to try and submit a draft PR? And we can help from there?

rohan-varma · 2020-12-04T01:26:05Z

@edenlightning Sounds good, I also pinged the slack channel for any feedback/discussions.

alanhdu · 2020-12-11T20:33:53Z

We'd also be very interested in this feature. Let us know if there's anything I can do to help!

rohan-varma · 2020-12-15T05:09:17Z

The PR #5141 is ready for review, in case anyone wants to take a look.

stale · 2021-02-13T12:13:25Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ananthsub · 2021-03-24T16:02:12Z

I discussed this more with @rohan-varma - DDP join docs: https://pytorch.org/docs/stable/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel.join

This module currently does not support custom distributed collective operations in the forward pass, such as SyncBatchNorm or other custom defined collectives in the model’s forward pass.

As the LightningModule is wrapped in another module which is then wrapped with DDP, the LightningModule's training_step becomes the forward pass run by the DDP wrapped module: https://github.com/PyTorchLightning/pytorch-lightning/blob/d471fa30b3bf95cfe601014bac544754067241ca/pytorch_lightning/plugins/training_type/ddp.py#L223-L227

As a result, any collective call (such as metric syncing or all_gather) that happens during training step will cause this to not work. Therefore I lean towards closing this out given the caveats. @awaelchli @justusschock what do you think?

awaelchli · 2021-03-24T23:56:44Z

agree, I also don't see how this can be supported at the moment.

tmbdev · 2021-12-25T08:09:32Z

@ananthsub commented 6 hours ago

However, the manner in which join tracks collectives can quickly run into issues with other collectives that run in the forward pass / training_step.

In PyTorch, the "with Join" construct is used as a simple wrapper around training steps. It should work in simple cases even if there are more complex cases where it doesn't work.

So, why not simply add an option to the trainer that enables wrapping the invocations of training_step with with Join? That should be pretty straightforward, and it would leave it up to users to determine when with Join is the right thing to use and when it doesn't work.

awaelchli · 2021-12-28T01:17:49Z

So, why not simply add an option to the trainer that enables wrapping the invocations of training_step with with Join?

The join here is specific to pytorch DDP. If it was implemented, it would have to live inside the DDP plugin/strategy. For simple cases it may work, but no collective calls are allowed except the ones under DDP.forward()/DDP.backward() if I understand correctly.

If we did want to do it "correctly", we would probably have to set throw_on_early_termination=True and then we must handle the error in all custom collective calls, including the ones in torchmetrics. I don't know if that would work, but it's probably not feasible.

carmocca · 2022-02-03T15:42:37Z

To recap, the plan would be:

Enable "join" as an optional feature of the DDP strategy: Trainer(strategy=DDPStrategy(uneven_input_support: bool). We could also add a registry string for it.
Add support for "joining" the training_step.
- Is there a benefit to doing it for validation_step and test_step? Probably not
- Could validation_step and test_step use UnrepeatedDistributedSampler just as trainer.predict? Probably yes.
When the feature is enabled, we don't automatically use the generic DistributedSampler as we wouldn't want to duplicate data to make inputs even.
- Could we use UnrepeatedDistributedSampler? Related issues:
  - torch.utils.data.DistributedSampler allow uneven inputs pytorch/pytorch#49180
  - Incorrect Validation Accuracy Due to Distributed Sampler pytorch/pytorch#25162
We print a big warning about how this feature is experimental and describe all its caveats.
This would be Torch 1.10+ only

Some sources:
https://pytorch.org/docs/stable/distributed.algorithms.join.html#torch.distributed.algorithms.Join
https://pytorch.org/tutorials/advanced/generic_join.html

awaelchli · 2022-02-05T00:57:50Z

Is there a benefit to doing it for validation_step and test_step? Probably not

I assume there is, if collectives are being used. For example, sync_dist=True in self.log or similar. However, we don't wrap the model in ddp during val and test, so join won't be available anyways.

When the feature is enabled, we don't automatically use the generic DistributedSampler as we wouldn't want to duplicate data to make inputs even.

pytorch/pytorch#49180 is great! Hopefully this will clarify the drop_last argument which has a slightly misleading/incomplete description :) We would indeed need the UnrepeatedDistributedSampler.

otaj · 2022-08-17T15:05:28Z

Hi, everyone, I'm gathering information on what is needed in order to support this properly.

Use torch.distributed.algorithms.Join (https://pytorch.org/docs/stable/distributed.algorithms.join.html) as a context manager in which is the model run.
Use UnrepeatedDistributedSamplerWrapper
Check for all modules, that could use syncing (such as SyncBatchNorm) and have them as arguments to the Join context manager from 1.
Figure out what to with calls to self.log(..., sync_dist=True)

Is that it? cc @awaelchli, @carmocca.

justusschock · 2022-08-18T12:07:00Z

@otaj Almost.

Additionally, all metrics from torchmetrics would have to be considered as well as they are also capable of issuing syncs on their own. And in general, the user can run arbitrary syncing calls within each of the steps which have to be considered as well (which will be the trickiest part I guess)

otaj · 2022-08-18T12:15:35Z

oh, those torchmetrics are going to be fun... 😅 I think capturing user calls can be solved with yet another contextmanager (our, custom one), what do you think?

justusschock · 2022-08-18T12:28:53Z

if we can capture user calls with that, it might work similarly with torchmetrics. So let's ignore those metrics for now and if you got a working solution for everything else, I'm sure we'll manage to integrate metrics with that :D

Borda · 2022-09-19T13:57:53Z

let's check the option with LigthingLite first 🦦

awaelchli · 2022-09-19T14:00:44Z

Here is the corresponding issue as suggested in planning: #14635

yygle · 2024-05-20T11:42:52Z

hi, any updates of this issue?

edenlightning added feature Is an improvement or enhancement help wanted Open to be worked on distributed Generic distributed-related topic labels Sep 2, 2020

stale bot added the won't fix This will not be worked on label Oct 21, 2020

stale bot removed the won't fix This will not be worked on label Oct 21, 2020

SkafteNicki mentioned this issue Nov 18, 2020

wrong test acc because redundant data in ddp mode #4732

Closed

This was referenced Dec 4, 2020

how to properly skip samples that cause inf/nan gradients/loss #4956

Closed

Support DDP with uneven input sizes without data duplication #5060

Closed

rohan-varma mentioned this issue Dec 15, 2020

Support for uneven inputs in LightningDDP #5141

Closed

11 tasks

stale bot added the won't fix This will not be worked on label Jan 14, 2021

Lightning-AI deleted a comment from stale bot Jan 14, 2021

stale bot removed the won't fix This will not be worked on label Jan 14, 2021

stale bot added the won't fix This will not be worked on label Feb 13, 2021

stale bot closed this as completed Feb 20, 2021

edenlightning removed the won't fix This will not be worked on label Feb 22, 2021

edenlightning reopened this Feb 22, 2021

edenlightning assigned ananthsub Feb 22, 2021

edenlightning added this to the 1.3 milestone Feb 22, 2021

edenlightning added the 3rd party Related to a 3rd-party label Apr 15, 2021

ananthsub reopened this Dec 25, 2021

kaushikb11 added the priority: 0 High priority task label Jan 24, 2022

kaushikb11 self-assigned this Jan 24, 2022

carmocca added this to the 1.6 milestone Feb 3, 2022

carmocca mentioned this issue Feb 14, 2022

#3325 #11810

Closed

12 tasks

carmocca unassigned ananthsub Feb 14, 2022

carmocca modified the milestones: 1.6, 1.7 Feb 28, 2022

awaelchli mentioned this issue Apr 26, 2022

test produces a warning when using DDP #12862

Closed

vaseline555 mentioned this issue Apr 27, 2022

DistributedEvalSamper hangs at the end of the script when using DDP SeungjunNah/DeepDeblur-PyTorch#44

Closed

carmocca unassigned kaushikb11 Jul 19, 2022

carmocca removed the priority: 0 High priority task label Jul 19, 2022

carmocca modified the milestones: pl:1.7, pl:future Jul 19, 2022

Borda assigned otaj Aug 8, 2022

otaj mentioned this issue Aug 18, 2022

[WIP] uneven input support for DDP #14284

Closed

12 tasks

awaelchli mentioned this issue Sep 10, 2022

Provide a backend agnostic Join for LightningLite #14635

Open

carmocca unassigned otaj Nov 7, 2022

awaelchli mentioned this issue Aug 21, 2023

Add EvaluationDistributedSampler and examples on distributed evaluation Lightning-AI/torchmetrics#1886

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support uneven DDP inputs with pytorch model.join #3325

Support uneven DDP inputs with pytorch model.join #3325

edenlightning commented Sep 2, 2020 •

edited by github-actions bot

Loading

stale bot commented Oct 21, 2020

carmocca commented Oct 21, 2020

xvr-hlt commented Nov 19, 2020

rohan-varma commented Dec 2, 2020

edenlightning commented Dec 2, 2020

rohan-varma commented Dec 4, 2020

alanhdu commented Dec 11, 2020

rohan-varma commented Dec 15, 2020

stale bot commented Feb 13, 2021

ananthsub commented Mar 24, 2021 •

edited

Loading

awaelchli commented Mar 24, 2021

tmbdev commented Dec 25, 2021

awaelchli commented Dec 28, 2021 •

edited

Loading

carmocca commented Feb 3, 2022 •

edited

Loading

awaelchli commented Feb 5, 2022 •

edited

Loading

otaj commented Aug 17, 2022

justusschock commented Aug 18, 2022

otaj commented Aug 18, 2022

justusschock commented Aug 18, 2022

Borda commented Sep 19, 2022

awaelchli commented Sep 19, 2022

yygle commented May 20, 2024

Support uneven DDP inputs with pytorch model.join #3325

Support uneven DDP inputs with pytorch model.join #3325

Comments

edenlightning commented Sep 2, 2020 • edited by github-actions bot Loading

stale bot commented Oct 21, 2020

carmocca commented Oct 21, 2020

xvr-hlt commented Nov 19, 2020

rohan-varma commented Dec 2, 2020

edenlightning commented Dec 2, 2020

rohan-varma commented Dec 4, 2020

alanhdu commented Dec 11, 2020

rohan-varma commented Dec 15, 2020

stale bot commented Feb 13, 2021

ananthsub commented Mar 24, 2021 • edited Loading

awaelchli commented Mar 24, 2021

tmbdev commented Dec 25, 2021

awaelchli commented Dec 28, 2021 • edited Loading

carmocca commented Feb 3, 2022 • edited Loading

awaelchli commented Feb 5, 2022 • edited Loading

otaj commented Aug 17, 2022

justusschock commented Aug 18, 2022

otaj commented Aug 18, 2022

justusschock commented Aug 18, 2022

Borda commented Sep 19, 2022

awaelchli commented Sep 19, 2022

yygle commented May 20, 2024

edenlightning commented Sep 2, 2020 •

edited by github-actions bot

Loading

ananthsub commented Mar 24, 2021 •

edited

Loading

awaelchli commented Dec 28, 2021 •

edited

Loading

carmocca commented Feb 3, 2022 •

edited

Loading

awaelchli commented Feb 5, 2022 •

edited

Loading