-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
Would it be possible for Lightning to raise an error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE
in case both are set?
With a single node the current behavior is:
SLURM_NTASKS == SLURM_NTASKS_PER_NODE
: Everything is fineSLURM_NTASKS > SLURM_NTASKS_PER_NODE
: Slurm doesn't let you schedule the job and raises an errorSLURM_NTASKS < SLURM_NTASKS_PER_NODE
: Lightning thinks there areSLURM_NTASKS_PER_NODE
devices but the job only runs onSLURM_NTASKS
devices.
Example scripts:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3
source .venv/bin/activate
srun python train_lightning.py
And train_lightning.py
:
from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os
def main():
print(
f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
)
model = BoringModel()
datamodule = BoringDataModule()
trainer = Trainer(max_epochs=100)
print(f"trainer.num_devices: {trainer.num_devices}")
trainer.fit(model, datamodule)
if __name__ == "__main__":
main()
This generates the following output:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2]
| Name | Type | Params | Mode
-----------------------------------------
0 | layer | Linear | 66 | train
-----------------------------------------
66 Trainable params
0 Non-trainable params
66 Total params
0.000 Total estimated model params size (MB)
1 Modules in train mode
0 Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.
LOCAL_RANK=0, SLURM_NTASKS=1, SLURM_NTASKS_PER_NODE=2
trainer.num_devices: 2
MEMBER: 1/1
indicates that only 1 GPU is used but trainer.num_devices
returns 2. nvidia-smi
also indicates that only a single device is used.
Not sure if there is a valid use case for SLURM_NTASKS < SLURM_NTASKS_PER_NODE
. But if there is not it would be awesome if Lightning could raise an error in this scenario.
The same error also happens if --ntasks-per-node
is not set. In this case Lightning assumes 2 devices (I guess based on CUDA_VISIBLE_DEVICES
) but in reality only a single one is used.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- available: True
- version: 12.4 - Lightning:
- lightning-utilities: 0.11.8
- pytorch-lightning: 2.4.0
- torch: 2.5.1
- torchmetrics: 1.4.3
- torchvision: 0.20.1 - Packages:
- aenum: 3.1.15
- aiohappyeyeballs: 2.4.3
- aiohttp: 3.10.10
- aiosignal: 1.3.1
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.9.3
- attrs: 24.2.0
- autocommand: 2.2.2
- backports.tarfile: 1.2.0
- certifi: 2024.8.30
- charset-normalizer: 3.4.0
- eval-type-backport: 0.2.0
- filelock: 3.16.1
- frozenlist: 1.5.0
- fsspec: 2024.10.0
- hydra-core: 1.3.2
- idna: 3.10
- importlib-metadata: 8.0.0
- importlib-resources: 6.4.0
- inflect: 7.3.1
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jinja2: 3.1.4
- lightly: 1.5.13
- lightning-utilities: 0.11.8
- markupsafe: 3.0.2
- more-itertools: 10.3.0
- mpmath: 1.3.0
- multidict: 6.1.0
- networkx: 3.4.2
- numpy: 2.1.3
- nvidia-cublas-cu12: 12.4.5.8
- nvidia-cuda-cupti-cu12: 12.4.127
- nvidia-cuda-nvrtc-cu12: 12.4.127
- nvidia-cuda-runtime-cu12: 12.4.127
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.2.1.3
- nvidia-curand-cu12: 10.3.5.147
- nvidia-cusolver-cu12: 11.6.1.9
- nvidia-cusparse-cu12: 12.3.1.170
- nvidia-nccl-cu12: 2.21.5
- nvidia-nvjitlink-cu12: 12.4.127
- nvidia-nvtx-cu12: 12.4.127
- omegaconf: 2.3.0
- packaging: 24.1
- pillow: 11.0.0
- platformdirs: 4.2.2
- propcache: 0.2.0
- psutil: 6.1.0
- pyarrow: 18.0.0
- pydantic: 2.9.2
- pydantic-core: 2.23.4
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.4.0
- pytz: 2024.2
- pyyaml: 6.0.2
- requests: 2.32.3
- setuptools: 75.3.0
- six: 1.16.0
- sympy: 1.13.1
- tomli: 2.0.1
- torch: 2.5.1
- torchmetrics: 1.4.3
- torchvision: 0.20.1
- tqdm: 4.66.6
- triton: 3.1.0
- typeguard: 4.3.0
- typing-extensions: 4.12.2
- urllib3: 2.2.3
- wheel: 0.43.0
- yarl: 1.17.1
- zipp: 3.19.2 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.12.3
- release: 6.8.0-38-generic
- version: Fixed typo in single_cpu_template #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024
More info
No response