Skip to content

Error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE #20391

@guarin

Description

@guarin

Bug description

Would it be possible for Lightning to raise an error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE in case both are set?

With a single node the current behavior is:

  • SLURM_NTASKS == SLURM_NTASKS_PER_NODE: Everything is fine
  • SLURM_NTASKS > SLURM_NTASKS_PER_NODE: Slurm doesn't let you schedule the job and raises an error
  • SLURM_NTASKS < SLURM_NTASKS_PER_NODE: Lightning thinks there are SLURM_NTASKS_PER_NODE devices but the job only runs on SLURM_NTASKS devices.

Example scripts:

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3

source .venv/bin/activate
srun python train_lightning.py

And train_lightning.py:

from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os


def main():
    print(
        f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
    )
    model = BoringModel()
    datamodule = BoringDataModule()
    trainer = Trainer(max_epochs=100)
    print(f"trainer.num_devices: {trainer.num_devices}")
    trainer.fit(model, datamodule)


if __name__ == "__main__":
    main()

This generates the following output:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2]

  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | layer | Linear | 66     | train
-----------------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
1         Modules in train mode
0         Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.
LOCAL_RANK=0, SLURM_NTASKS=1, SLURM_NTASKS_PER_NODE=2
trainer.num_devices: 2

MEMBER: 1/1 indicates that only 1 GPU is used but trainer.num_devices returns 2. nvidia-smi also indicates that only a single device is used.

Not sure if there is a valid use case for SLURM_NTASKS < SLURM_NTASKS_PER_NODE. But if there is not it would be awesome if Lightning could raise an error in this scenario.

The same error also happens if --ntasks-per-node is not set. In this case Lightning assumes 2 devices (I guess based on CUDA_VISIBLE_DEVICES) but in reality only a single one is used.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
  • CUDA:
    - GPU:
    - NVIDIA GeForce RTX 4090
    - NVIDIA GeForce RTX 4090
    - NVIDIA GeForce RTX 4090
    - NVIDIA GeForce RTX 4090
    - available: True
    - version: 12.4
  • Lightning:
    - lightning-utilities: 0.11.8
    - pytorch-lightning: 2.4.0
    - torch: 2.5.1
    - torchmetrics: 1.4.3
    - torchvision: 0.20.1
  • Packages:
    - aenum: 3.1.15
    - aiohappyeyeballs: 2.4.3
    - aiohttp: 3.10.10
    - aiosignal: 1.3.1
    - annotated-types: 0.7.0
    - antlr4-python3-runtime: 4.9.3
    - attrs: 24.2.0
    - autocommand: 2.2.2
    - backports.tarfile: 1.2.0
    - certifi: 2024.8.30
    - charset-normalizer: 3.4.0
    - eval-type-backport: 0.2.0
    - filelock: 3.16.1
    - frozenlist: 1.5.0
    - fsspec: 2024.10.0
    - hydra-core: 1.3.2
    - idna: 3.10
    - importlib-metadata: 8.0.0
    - importlib-resources: 6.4.0
    - inflect: 7.3.1
    - jaraco.collections: 5.1.0
    - jaraco.context: 5.3.0
    - jaraco.functools: 4.0.1
    - jaraco.text: 3.12.1
    - jinja2: 3.1.4
    - lightly: 1.5.13
    - lightning-utilities: 0.11.8
    - markupsafe: 3.0.2
    - more-itertools: 10.3.0
    - mpmath: 1.3.0
    - multidict: 6.1.0
    - networkx: 3.4.2
    - numpy: 2.1.3
    - nvidia-cublas-cu12: 12.4.5.8
    - nvidia-cuda-cupti-cu12: 12.4.127
    - nvidia-cuda-nvrtc-cu12: 12.4.127
    - nvidia-cuda-runtime-cu12: 12.4.127
    - nvidia-cudnn-cu12: 9.1.0.70
    - nvidia-cufft-cu12: 11.2.1.3
    - nvidia-curand-cu12: 10.3.5.147
    - nvidia-cusolver-cu12: 11.6.1.9
    - nvidia-cusparse-cu12: 12.3.1.170
    - nvidia-nccl-cu12: 2.21.5
    - nvidia-nvjitlink-cu12: 12.4.127
    - nvidia-nvtx-cu12: 12.4.127
    - omegaconf: 2.3.0
    - packaging: 24.1
    - pillow: 11.0.0
    - platformdirs: 4.2.2
    - propcache: 0.2.0
    - psutil: 6.1.0
    - pyarrow: 18.0.0
    - pydantic: 2.9.2
    - pydantic-core: 2.23.4
    - python-dateutil: 2.9.0.post0
    - pytorch-lightning: 2.4.0
    - pytz: 2024.2
    - pyyaml: 6.0.2
    - requests: 2.32.3
    - setuptools: 75.3.0
    - six: 1.16.0
    - sympy: 1.13.1
    - tomli: 2.0.1
    - torch: 2.5.1
    - torchmetrics: 1.4.3
    - torchvision: 0.20.1
    - tqdm: 4.66.6
    - triton: 3.1.0
    - typeguard: 4.3.0
    - typing-extensions: 4.12.2
    - urllib3: 2.2.3
    - wheel: 0.43.0
    - yarl: 1.17.1
    - zipp: 3.19.2
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.12.3
    - release: 6.8.0-38-generic
    - version: Fixed typo in single_cpu_template #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun 7 15:25:01 UTC 2024

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions