RFC: Validate Trainer settings against cluster environment settings #10107
Labels
distributed
Generic distributed-related topic
environment: slurm
feature
Is an improvement or enhancement
Projects
🚀 Feature
Introduce a validation check that verifies the Trainer settings like gpus, and number of nodes are set compatible with the cluster environment currently used.
Motivation
There have been several issues where users did not set the number of devices or the number of nodes correctly on clusters (#10098, #8993, #4612). The result: Processes hang for unknown reason. The user is confused, does not know what to do. It's also hard to debug for us because often the user just reports "DDP hangs, help!", and there could be a million other reasons why ddp could stall.
Pitch
Introduce a
ClusterEnvironment.validate_distributed_settings
method (name up for negotiation):This would be an optional method on the ClusterEnvironment. The argument list could include also other arguments.
In the accelerator connector, when setting up the environments, we would call this method. Roughly here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/pytorch_lightning/trainer/connectors/accelerator_connector.py#L794-L807
Alternatives
Alternative suggestions welcome.
Additional context
This would resolve #10098, #8993, #4612, and perhaps others.
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: