Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for uneven inputs in LightningDDP #5141

Closed
wants to merge 1 commit into from

Conversation

rohan-varma
Copy link

@rohan-varma rohan-varma commented Dec 15, 2020

What does this PR do?

Adds support for DDP join() API for uneven inputs support to PT lightning. See pytorch/pytorch#38174 for the PyTorch RFC. Fixes #3325

It is implemented in a backwards-compatible way by gating the code with a version check requiring torch >= 1.7.0 where this API is available. Open to discussion on better ideas on how to ensure the BC. Since PT lightning overrides PT DDP implementation, we will likely have to change this code again for PT 1.8 where these APIs have somewhat changed.

We also introduce rebuild_buckets feature back to PyTorch lightning (the logic to enable it was moved to Python in PT 1.7, but not updated on the lightning side), this is necessary since the DDP join() API assumes that buckets are rebuilt (which is always true in PT DDP). It also provides potential for performance improvement by ensuring our allreduce order corresponds with gradient ready order.

Tested using the script at https://gist.github.com/rohan-varma/3906e7f07669f0177801a9f753848550. The join() API is also extensively tested in the PyTorch codebase.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Dec 15, 2020

Codecov Report

Merging #5141 (0f157c6) into master (84bb9db) will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #5141   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         134     134           
  Lines        9907    9907           
======================================
  Hits         9206    9206           
  Misses        701     701           

@rohan-varma rohan-varma changed the title [WIP] Support for uneven inputs Support for uneven inputs Dec 15, 2020
@rohan-varma rohan-varma marked this pull request as ready for review December 15, 2020 05:06
@rohan-varma rohan-varma changed the title Support for uneven inputs Support for uneven inputs in LightningDDP Dec 15, 2020
justusschock
justusschock previously approved these changes Dec 15, 2020
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor thing with version comparison.

Can you maybe also add a todo for changing with PT1.8?

pytorch_lightning/utilities/__init__.py Outdated Show resolved Hide resolved
@rohan-varma
Copy link
Author

It looks like the CI error seems unrelated: https://github.com/PyTorchLightning/pytorch-lightning/pull/5141/checks?check_run_id=1558105867

Failed to import 'pytorch_lightning.profiler': no module named pytorch_lightning.profiler
Warning, treated as error:
[autosummary] failed to import 'pytorch_lightning.callbacks.Callback': no module named pytorch_lightning.callbacks.Callback
make: *** [Makefile:19: html] Error 2
Error: Process completed with exit code 2.

SeanNaren
SeanNaren previously approved these changes Dec 15, 2020
@SeanNaren SeanNaren self-requested a review December 15, 2020 16:30
@SeanNaren
Copy link
Contributor

Failing doc checks... Will investigate!

@tchaton tchaton changed the base branch from master to release/1.2-dev December 15, 2020 20:15
@tchaton tchaton dismissed stale reviews from SeanNaren and justusschock December 15, 2020 20:15

The base branch was changed.

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Few questions:

  1. How does this fit into the LightningModule lifecycle? Where should users set the context manager?
  2. Does this work with automatic optimization?
  3. Can we use the testing script you provided as a test?
  4. Is there any disadvantage of using this context manager when inputs are not uneven?

# during forward computation.
# This should be called only once during whole training period.
if DDP_JOIN_AND_REBUILD_BUCKETS_AVAILABLE and self.reducer._rebuild_buckets():
logging.info("Reducer buckets have been rebuilt in this iteration.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably a debug or rank_zero_debug message

@SkafteNicki SkafteNicki linked an issue Dec 16, 2020 that may be closed by this pull request
@rohan-varma
Copy link
Author

  • How does this fit into the LightningModule lifecycle? Where should users set the context manager?
  • Does this work with automatic optimization?
  • Can we use the testing script you provided as a test?
  • Is there any disadvantage of using this context manager when inputs are not uneven?

I don't have full context, but I took a stab at answering the questions below

  1. I'm relatively new to lightning, but my presumption is that we should use it similar to how other context managers that are supported in PT lightning for DDP are used. One example is the no_sync context manager that seems to be supported in PT lightning. If there are pointers on how the API for using this looks like, I'd be happy to take a look.
  2. I'm not sure what automatic optimization is in this context, does anyone else have insight into this?
  3. Does this mean we should add the script's logic as a unittest? I think that would be useful.
  4. There should be no perf disadvantage, we analyzed that case (and more performance benchmarking) in the original PyTorch PR: Join-based API to support DDP uneven inputs pytorch/pytorch#42577

@stale stale bot added the won't fix This will not be worked on label Dec 31, 2020
@SeanNaren SeanNaren added the feature Is an improvement or enhancement label Dec 31, 2020
@stale
Copy link

stale bot commented Jan 26, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jan 26, 2021
@Borda
Copy link
Member

Borda commented Jan 29, 2021

@rohan-varma @carmocca @tchaton how is it going here? 🐰

@stale stale bot removed the won't fix This will not be worked on label Jan 29, 2021
@awaelchli
Copy link
Member

We recently refactored DDP #5185 and now rely directly on the PyTorch implementation.

I can see that these codelines are already in pytorch master
https://github.com/pytorch/pytorch/blob/ebe26b81d2874992331e3cd48859c73a19517895/torch/nn/parallel/distributed.py#L674
which means this should be automatically supported by Lightning with newer pytorch versions.

@snie2012
Copy link

snie2012 commented Jan 30, 2021

So how can we use join with Lightning?

@carmocca
Copy link
Contributor

carmocca commented Feb 1, 2021

which means this should be automatically supported by Lightning with newer pytorch versions.

should we close this then? @awaelchli

So how can we use join with Lightning?

Just as you would with plain PyTorch, but needs 1.2 to be released first.

@awaelchli
Copy link
Member

Not sure, I think we still need to insert that join context manager somewhere in Lightning right @rohan-varma?
My comment was more about the few outdated lines in this PR.

@rohan-varma
Copy link
Author

Glad to see that the PT lightning implementation now directly relies on the pt DDP implementation!

@awaelchli Yes, we would need to insert the context manager somewhere. In this script: https://gist.github.com/rohan-varma/3906e7f07669f0177801a9f753848550 I did it by directly accessing the lightning DDP override, and calling .join() on it, this should work if it subclasses DDP. Not sure how other context managers such as DDP's no_sync for grads are used in lightning, is there a pointer to that?

@awaelchli
Copy link
Member

Your example script is very useful!

the ddp no_sync is used here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/c6b1a7264321ca81682dc1aaa4ba21edce67c399/pytorch_lightning/plugins/ddp_plugin.py#L148

and this context manager is then later used here in the training loop in the case of gradient accumulation happening:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c6b1a7264321ca81682dc1aaa4ba21edce67c399/pytorch_lightning/trainer/training_loop.py#L672

@justusschock What do you think, should we allow the plugin to return a training loop context similar to what is done with the training_step_context in the new plugins?

@snie2012
Copy link

snie2012 commented Feb 4, 2021

Is it a good idea to make join available by simply passing a flag to the trainer/datamodule?

@justusschock
Copy link
Member

@awaelchli That's probably a good idea, we could also move the ddp sync stuff there as well as the join. The training loop would be a bit cleaner then, since nothing explicitly related to ddp would be there.

@skylook
Copy link

skylook commented Feb 8, 2021

When will uneven inputs be supported or at least a simple way of fixing? Thx

@Borda Borda added the bug Something isn't working label Feb 8, 2021
@carmocca carmocca added this to the 1.3 milestone Feb 8, 2021
@rohan-varma
Copy link
Author

@ananthsub mentioned to me offline that he'd be taking over this PR, so cc'ing him here.

Base automatically changed from release/1.2-dev to master February 11, 2021 14:32
@tchaton
Copy link
Contributor

tchaton commented Feb 22, 2021

Hey @ananthsub,

Any updates on this PR ?

Best,
T.C

@edenlightning edenlightning removed the bug Something isn't working label Mar 23, 2021
@edenlightning edenlightning removed this from the 1.3 milestone Apr 15, 2021
@Borda
Copy link
Member

Borda commented May 11, 2021

@rohan-varma thx for your patience, I'll try to push it a bit...
@PyTorchLightning/core-contributors can I get your review here?

@awaelchli
Copy link
Member

In the linked issue we discussed and concluded that at the moment we can't integrate it #3325
We may want to come back to it when we enable more loop customization, but some features in Lightning wouldn't work well with join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement has conflicts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support uneven DDP inputs with pytorch model.join A question about metrics on ddp