Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support launching Lightning ddp with traditional command #7480

Merged
merged 24 commits into from
Jul 14, 2021

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented May 11, 2021

What does this PR do?

Fixes #7003

The Lightning environment provides a convenient way to launch DDP multi-gpu experiments. It launches the required number of processes automatically under the hood as explained in the docs. However, there is currently only a hacky way for the user to prevent this if they wish to launch all processes manually through the command line or with utilities like torch.distributed.launch. This PR adds detection of LOCAL_RANK variable in the environment and determines accordingly if processes need to be launched or not.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

@awaelchli awaelchli added the distributed Generic distributed-related topic label May 11, 2021
@codecov
Copy link

codecov bot commented May 11, 2021

Codecov Report

Merging #7480 (bb5964f) into master (6ce77a1) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #7480   +/-   ##
======================================
- Coverage      93%     92%   -0%     
======================================
  Files         216     216           
  Lines       14115   14112    -3     
======================================
- Hits        13088   13017   -71     
- Misses       1027    1095   +68     

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

@stale
Copy link

stale bot commented Jun 4, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jun 4, 2021
@awaelchli awaelchli added this to the v1.4 milestone Jun 8, 2021
@stale stale bot closed this Jun 14, 2021
@kaushikb11 kaushikb11 reopened this Jun 14, 2021
@stale stale bot removed the won't fix This will not be worked on label Jun 14, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Jun 14, 2021
@awaelchli awaelchli marked this pull request as ready for review July 7, 2021 10:24
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some comments:

Can you merge master? To make sure none of the recent changes have any impact.
Should we add a special test using torch.distributed.launch?

@mergify mergify bot added the has conflicts label Jul 7, 2021
@mergify mergify bot removed the has conflicts label Jul 7, 2021
@awaelchli
Copy link
Contributor Author

@carmocca adding a test for torch.distributed.launch/run would not really cover anything in this PR, as this would anyway invoke the torchelastic environment. What this PR is about is the manual way to launch individual processes (say in different terminals or a script).

I added such calls to the special_tests.sh file, please have a look.

It adds about 10s of special test time.

tests/special_tests.sh Outdated Show resolved Hide resolved
awaelchli and others added 2 commits July 8, 2021 01:28
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

tests/special_tests.sh Show resolved Hide resolved
@Borda Borda requested a review from justusschock July 9, 2021 11:38
@awaelchli
Copy link
Contributor Author

awaelchli commented Jul 9, 2021

Yesterday GPU tests passed flawlessly. Of course it fails today when I get reviews.
RIP
my prs are cursed, always.

@mergify mergify bot added the has conflicts label Jul 9, 2021
@edenlightning edenlightning modified the milestones: v1.4, v1.3.x Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic
Projects
None yet
8 participants