RFC: Validate Trainer settings against cluster environment settings #10107

awaelchli · 2021-10-24T22:17:33Z

🚀 Feature

Introduce a validation check that verifies the Trainer settings like gpus, and number of nodes are set compatible with the cluster environment currently used.

Motivation

There have been several issues where users did not set the number of devices or the number of nodes correctly on clusters (#10098, #8993, #4612). The result: Processes hang for unknown reason. The user is confused, does not know what to do. It's also hard to debug for us because often the user just reports "DDP hangs, help!", and there could be a million other reasons why ddp could stall.

Pitch

Introduce a ClusterEnvironment.validate_distributed_settings method (name up for negotiation):

class ClusterEnvironment:

    def validate_distributed_settings(num_devices, num_nodes, ...):
        # implementation may vary depending on environment
        if self.world_size() != num_devices * num_nodes:
            # raise warning, e.g., user should check SLURM settings for num nodes, num tasks per node, etc.

This would be an optional method on the ClusterEnvironment. The argument list could include also other arguments.

In the accelerator connector, when setting up the environments, we would call this method. Roughly here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/pytorch_lightning/trainer/connectors/accelerator_connector.py#L794-L807

Alternatives

Alternative suggestions welcome.

Additional context

This would resolve #10098, #8993, #4612, and perhaps others.

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

tchaton · 2021-10-25T11:57:57Z

Dear @awaelchli,

Sounds like a good idea !

As a site note:

Do you think we should refactor:

 def select_cluster_environment(self) -> ClusterEnvironment: 
     if self._cluster_environment is not None: 
         return self._cluster_environment 
     if self.is_slurm_managing_tasks: 
         env = SLURMEnvironment() 
     elif TorchElasticEnvironment.is_using_torchelastic(): 
         env = TorchElasticEnvironment() 
     elif KubeflowEnvironment.is_using_kubeflow(): 
         env = KubeflowEnvironment() 
     elif LSFEnvironment.is_using_lsf(): 
         env = LSFEnvironment() 
     else: 
         env = LightningEnvironment() 
     return env

as by providing a is_of_cluster_type function

 def select_cluster_environment(self) -> ClusterEnvironment: 
     if self._cluster_environment is not None: 
         return self._cluster_environment 
     if self.is_slurm_managing_tasks: 
         env = SLURMEnvironment() 
     elif TorchElasticEnvironment.is_of_cluster_type(): 
         env = TorchElasticEnvironment() 
     elif KubeflowEnvironment.is_of_cluster_type(): 
         env = KubeflowEnvironment() 
     elif LSFEnvironment.is_of_cluster_type(): 
         env = LSFEnvironment() 
     else: 
         env = LightningEnvironment() 
     return env

awaelchli · 2021-10-25T13:30:33Z

Yes, generalizing the vairous is_using_xyz to a single method has crossed my mind too! I think that's a next logical step as we are evolving here :)

awaelchli added feature Is an improvement or enhancement distributed Generic distributed-related topic environment: slurm labels Oct 24, 2021

awaelchli self-assigned this Oct 24, 2021

awaelchli added this to To be approved in Lighting Sprint Oct 24, 2021

tchaton added this to To do in Lightning Sprint via automation Nov 15, 2021

tchaton moved this from To do to To be approved in Lightning Sprint Nov 15, 2021

tchaton removed this from To be approved in Lightning Sprint Nov 15, 2021

tchaton added this to To do in Lightning Sprint via automation Nov 15, 2021

tchaton moved this from To do to To be approved in Lightning Sprint Nov 15, 2021

awaelchli changed the title ~~Validate trainer settings against cluster environment settings~~ RFC: Validate Trainer settings against cluster environment settings Nov 17, 2021

awaelchli mentioned this issue Nov 17, 2021

Introduce ClusterEnvironment.detect() #10564

Merged

11 tasks

awaelchli mentioned this issue Feb 5, 2022

Distributed Pytorch-Lightning is not working with Dask workers/cluster #11729

Closed

awaelchli mentioned this issue Aug 7, 2022

RFC: Remove num_nodes Trainer argument and infer world size from cluster environment directly #14078

Open

awaelchli mentioned this issue Feb 21, 2023

Slurm gpus-per-task Error #16828

Open

This was referenced Aug 12, 2023

Crash when using a non-scalar together with scalar pytorch-metrics in distributed training #18290

Open

Validate Trainer settings against cluster environment #18292

Merged

awaelchli closed this as completed in #18292 Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Validate Trainer settings against cluster environment settings #10107

RFC: Validate Trainer settings against cluster environment settings #10107

awaelchli commented Oct 24, 2021 •

edited

tchaton commented Oct 25, 2021

awaelchli commented Oct 25, 2021

RFC: Validate Trainer settings against cluster environment settings #10107

RFC: Validate Trainer settings against cluster environment settings #10107

Comments

awaelchli commented Oct 24, 2021 • edited

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

tchaton commented Oct 25, 2021

awaelchli commented Oct 25, 2021

awaelchli commented Oct 24, 2021 •

edited