Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaces ddp .spawn with subprocess #2029

Merged
merged 114 commits into from Jun 1, 2020
Merged

Replaces ddp .spawn with subprocess #2029

merged 114 commits into from Jun 1, 2020

Conversation

williamFalcon
Copy link
Contributor

@williamFalcon williamFalcon commented May 31, 2020

Problem

A ton of the reported issues on DDP when using a single node are because of the .spawn used in the implementation.

  • .test() not working properly
  • .ckpt artifacts
  • inability to use num_workers > 0 in dataloader
  • wrong visible cuda devices
  • all sorts of other issues

But more importantly... removes the need to pickle things!

Approach

Here we remove .spawn and instead use subprocess to spin-off independent script calls while treating the original script as the master.

Fixes

(Issue authors, please verify if it does)
Fixes #2028
Fixes #1981
Fixes #1972
Fixes #1970
Fixes #1943
Fixes #1942
Fixes #1890
Fixes #1834
Fixes #1831
Fixes #1774
Fixes #1710
Fixes #1704
Fixes #1694
Fixes #1682
Fixes #1479
Fixes #1461
Fixes #1376
Fixes #981
Fixes #965
Fixes #958

Maybe fixes (authors, please check if it does)
#1761
#1714
#1550
#1542
#1846

Additional checks

  • dp
  • ddp on interactive notebook or shell
  • ddp on SLURM cluster (single node)
  • ddp on SLURM cluster (multi node)
  • ddp_cpu (@neggert)

I verified the following cases, but @tullie @justusschock @ethanwharris @jeremyjordan check on your own to make sure we have no issues?

Review

Sorry to ping multiple people here lol, just want a good number of sanity checks since this is so critical.
@justusschock i had to comment out the signal thing... not sure what that was solving but it was causing the ctrl+c to not work in killing the processes. I did add a fix to the keyboard thing which now registers ctrl+c only once and avoids leaving hanging procs

@Borda Borda added the Important label Jun 1, 2020
@Borda Borda added this to in Progress in Key features - Roadmap v1.0 Jun 1, 2020
@williamFalcon williamFalcon merged commit 82a2029 into master Jun 1, 2020
Key features - Roadmap v1.0 automation moved this from in Progress to Done Jun 1, 2020
@Borda Borda deleted the ddp branch June 1, 2020 15:06
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that lowering max_epochs is a solution, just hiding the true issue :{

os.environ['LOCAL_RANK'] = '0'

# pull out the commands used to run the script and resolve the abs file path
command = sys.argv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes that it can be called only from cmd with Trainer like arguments...
but what about just a sprit with loaded params from a file...
cc @PyTorchLightning/core-contributors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example?
these tests were done with cli flags and also trainer args. this local rank is coming from the gpus flag in the trainer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also assumes that it is called as (python) path/to/some/script.py and not e.g. from a console entry point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _signal_kill_handler(*args):
return TrainerTrainLoopMixin.run_training_teardown(self)

orig_signal_handlers = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned @justusschock i disabled this since it was a blocker.
TBH, i don't know how this was allowed on master with the exceptions being thrown every time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with temporarily disabling this. In our CI these exceptions did not appear

tests/base/utils.py Show resolved Hide resolved
tests/trainer/test_dataloaders.py Show resolved Hide resolved
@mergify mergify bot requested a review from a team June 1, 2020 15:19
@s-rog
Copy link
Contributor

s-rog commented Jun 1, 2020

ddp on interactive notebook or shell

Does this merge make ddp work in jupyter notebooks now?

@Borda Borda mentioned this pull request Jun 1, 2020
Comment on lines -122 to +121
weights_summary: Optional[str] = 'full',
weights_summary: Optional[str] = 'top',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not reflected in the docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind shot a fix PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes unless it was not intentional. @williamFalcon yes or no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, forgot to update docs on this.
i propose we make it top going forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it here #2021, hope it's fine including it there.

@mergify mergify bot requested a review from a team June 1, 2020 20:11
This was referenced Jun 2, 2020
LoicGrobol added a commit to LoicGrobol/zeldarose that referenced this pull request Jun 7, 2020
justusschock pushed a commit that referenced this pull request Jun 29, 2020
* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* replace ddp spawn with subprocess

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix

* hot fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task
Projects
No open projects
7 participants