Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Add barriers before and after setup hook is run #7202

Merged
merged 6 commits into from
Apr 27, 2021

Conversation

ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Apr 24, 2021

What does this PR do?

Reported on slack:

Hi, when prepare_data_per_node=False, aren't the non-zero nodes still supposed to wait for prepare_data to be done on node 0 rank 0 before calling setup?

Currently we delay the process group initialization until accelerator.setup_environment. prepare_data is called before, which works fine for when lightning launches subprocess/spawns the trainer processes for distributed training, but not so for when the user has launched training outside of the trainer.

This adds a barrier immediately before and after the setup hook is called.

We indirectly address this by adding barriers before setup runs (to hedge against theprepare_data case) as well as after setup hooks run in case the model defines layers inside of setup in order to synchronize before the module is wrapped in DDP.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Apr 24, 2021

Codecov Report

Merging #7202 (da9bd6f) into master (52a5cee) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #7202    +/-   ##
=======================================
- Coverage      91%     87%    -4%     
=======================================
  Files         198     198            
  Lines       12647   12725    +78     
=======================================
- Hits        11531   11063   -468     
- Misses       1116    1662   +546     

@Evpok
Copy link

Evpok commented Apr 25, 2021

Hi, thanks for the patch! It looks sensible, but it doesn't seem to fix the issue on my end. The issue seem to be that at this point, DDP is not yet set up so the barrier is a no-op. I think the right place to put it would be between environment setup and module setup

@ananthsub
Copy link
Contributor Author

@awaelchli @tchaton this feels like a bug that prepare data can be called without other ranks waiting for it to finish, especially if setup depends on the results of prepare data being available. how can we conditionally move the process group initialization sooner so that we can use barrier when running with cluster environments that already create the processes?

@ananthsub ananthsub changed the title [fix] Add barrier after prepare data hooks are run [fix] Add barrier after setup hook is run Apr 26, 2021
Copy link

@Evpok Evpok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you still need a barrier before running setup()?

@awaelchli awaelchli added bug Something isn't working distributed Generic distributed-related topic labels Apr 26, 2021
@awaelchli
Copy link
Member

Only the barrier before the setup hook can solve the reported issue right? The barrier after setup may be optional.

@ananthsub
Copy link
Contributor Author

ananthsub commented Apr 26, 2021

Only the barrier before the setup hook can solve the reported issue right? The barrier after setup may be optional.

Yes you're right. Ideally we have a barrier immediately after prepare_data, not just before setup

The barrier after setup may be optional.

the barrier here could be helpful in DDP if users define layers in setup which we need to wait on before wrapping the module in DDP. I didn't see a barrier between setup and pre_dispatch

@carmocca
Copy link
Contributor

should these barriers be part of the trainer here: ... Or inside of each connector?

IMO inside each connector. This way the trainer code is a tiny bit easier to read :)

@ananthsub ananthsub changed the title [fix] Add barrier after setup hook is run [fix] Add barriers before and after setup hook is run Apr 27, 2021
@ananthsub ananthsub added this to the v1.3 milestone Apr 27, 2021
@ananthsub
Copy link
Contributor Author

@Evpok mind taking another look to see if this would mitigate/solve the issue for you?

@Evpok
Copy link

Evpok commented Apr 27, 2021

@ananthsub Yes it does, thank you for that!

@carmocca carmocca added the ready PRs ready to be merged label Apr 27, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@tchaton tchaton merged commit bab7225 into Lightning-AI:master Apr 27, 2021
@ananthsub ananthsub deleted the prepare-data-barrier branch April 27, 2021 16:20
@awaelchli
Copy link
Member

awaelchli commented Apr 27, 2021

Careful! You merged with TPU tests failing. We probably need to add an additional check in the barrier for TPU plugin.
cc @kaushikb11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants