Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Validate Trainer settings against cluster environment settings #10107

Closed
awaelchli opened this issue Oct 24, 2021 · 2 comments · Fixed by #18292
Closed

RFC: Validate Trainer settings against cluster environment settings #10107

awaelchli opened this issue Oct 24, 2021 · 2 comments · Fixed by #18292
Assignees
Labels
distributed Generic distributed-related topic environment: slurm feature Is an improvement or enhancement

Comments

@awaelchli
Copy link
Member

awaelchli commented Oct 24, 2021

🚀 Feature

Introduce a validation check that verifies the Trainer settings like gpus, and number of nodes are set compatible with the cluster environment currently used.

Motivation

There have been several issues where users did not set the number of devices or the number of nodes correctly on clusters (#10098, #8993, #4612). The result: Processes hang for unknown reason. The user is confused, does not know what to do. It's also hard to debug for us because often the user just reports "DDP hangs, help!", and there could be a million other reasons why ddp could stall.

Pitch

Introduce a ClusterEnvironment.validate_distributed_settings method (name up for negotiation):

class ClusterEnvironment:

    def validate_distributed_settings(num_devices, num_nodes, ...):
        # implementation may vary depending on environment
        if self.world_size() != num_devices * num_nodes:
            # raise warning, e.g., user should check SLURM settings for num nodes, num tasks per node, etc.

This would be an optional method on the ClusterEnvironment. The argument list could include also other arguments.

In the accelerator connector, when setting up the environments, we would call this method. Roughly here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/pytorch_lightning/trainer/connectors/accelerator_connector.py#L794-L807

Alternatives

Alternative suggestions welcome.

Additional context

This would resolve #10098, #8993, #4612, and perhaps others.


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@awaelchli awaelchli added feature Is an improvement or enhancement distributed Generic distributed-related topic environment: slurm labels Oct 24, 2021
@awaelchli awaelchli self-assigned this Oct 24, 2021
@awaelchli awaelchli added this to To be approved in Lighting Sprint Oct 24, 2021
@tchaton
Copy link
Contributor

tchaton commented Oct 25, 2021

Dear @awaelchli,

Sounds like a good idea !

As a site note:

Do you think we should refactor:

 def select_cluster_environment(self) -> ClusterEnvironment: 
     if self._cluster_environment is not None: 
         return self._cluster_environment 
     if self.is_slurm_managing_tasks: 
         env = SLURMEnvironment() 
     elif TorchElasticEnvironment.is_using_torchelastic(): 
         env = TorchElasticEnvironment() 
     elif KubeflowEnvironment.is_using_kubeflow(): 
         env = KubeflowEnvironment() 
     elif LSFEnvironment.is_using_lsf(): 
         env = LSFEnvironment() 
     else: 
         env = LightningEnvironment() 
     return env 

as by providing a is_of_cluster_type function

 def select_cluster_environment(self) -> ClusterEnvironment: 
     if self._cluster_environment is not None: 
         return self._cluster_environment 
     if self.is_slurm_managing_tasks: 
         env = SLURMEnvironment() 
     elif TorchElasticEnvironment.is_of_cluster_type(): 
         env = TorchElasticEnvironment() 
     elif KubeflowEnvironment.is_of_cluster_type(): 
         env = KubeflowEnvironment() 
     elif LSFEnvironment.is_of_cluster_type(): 
         env = LSFEnvironment() 
     else: 
         env = LightningEnvironment() 
     return env

@awaelchli
Copy link
Member Author

Yes, generalizing the vairous is_using_xyz to a single method has crossed my mind too! I think that's a next logical step as we are evolving here :)

@tchaton tchaton added this to To do in Lightning Sprint via automation Nov 15, 2021
@tchaton tchaton moved this from To do to To be approved in Lightning Sprint Nov 15, 2021
@tchaton tchaton removed this from To be approved in Lightning Sprint Nov 15, 2021
@tchaton tchaton added this to To do in Lightning Sprint via automation Nov 15, 2021
@tchaton tchaton moved this from To do to To be approved in Lightning Sprint Nov 15, 2021
@awaelchli awaelchli changed the title Validate trainer settings against cluster environment settings RFC: Validate Trainer settings against cluster environment settings Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic environment: slurm feature Is an improvement or enhancement
Projects
No open projects
Lighting Sprint
  
To be approved
Status: No status
Lightning Sprint
To be approved
Development

Successfully merging a pull request may close this issue.

2 participants