Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrainerDDPMixin.resolve_root_node_address fails if the host name contains a dash #1943

Closed
LoicGrobol opened this issue May 25, 2020 · 1 comment · Fixed by #1954 or #2029
Closed

TrainerDDPMixin.resolve_root_node_address fails if the host name contains a dash #1943

LoicGrobol opened this issue May 25, 2020 · 1 comment · Fixed by #1954 or #2029
Labels
help wanted Open to be worked on

Comments

@LoicGrobol
Copy link
Contributor

LoicGrobol commented May 25, 2020

🐛 Bug

When running under SLURM, if the nodes host names contains -, MASTER_ADDR omits the number part of the master host name (which of course makes everything else fail).

if MASTER_ADDR is not given, pl infers it from the node list using TrainerDDPMixin.resolve_root_node_address

To Reproduce

Steps to reproduce the behavior:

  • Use pl in a SLURM cluster where nodes host names contains at least a dash (e.g. jean-zay-ia810)

The job should fail with something like this

  File "<snip>/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 834, in fit
    self.ddp_train(task, model)
  File "<snip>/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 353, in ddp_train
    model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing_tasks)
  File "<snip>/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 957, in init_ddp_connection
    torch_distrib.init_process_group(torch_backend, rank=proc_rank, world_size=world_size)
  File "<snip>/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "<snip>/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known

In my case, looking into os.environ gives 'SLURM_NODELIST': 'jean-zay-ia[810,817-819]' and 'MASTER_ADDR': 'jean-zay-ia'.

The root cause being that

>>> resolve_root_node_address("jean-zay-ia[810,817-819]")
'jean-zay-ia'

Fixing it should not be hard, should I submit a PR?

@LoicGrobol LoicGrobol added the help wanted Open to be worked on label May 25, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on
Projects
None yet
1 participant