Error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE

### Bug description

Would it be possible for Lightning to raise an error if `SLURM_NTASKS != SLURM_NTASKS_PER_NODE` in case both are set?

With a single node the current behavior is:
* `SLURM_NTASKS == SLURM_NTASKS_PER_NODE`: Everything is fine
* `SLURM_NTASKS > SLURM_NTASKS_PER_NODE`: Slurm doesn't let you schedule the job and raises an error
* `SLURM_NTASKS < SLURM_NTASKS_PER_NODE`: Lightning thinks there are `SLURM_NTASKS_PER_NODE` devices but the job only runs on `SLURM_NTASKS` devices.

Example scripts:
```
#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3

source .venv/bin/activate
srun python train_lightning.py
```

And `train_lightning.py`:
```
from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os


def main():
    print(
        f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
    )
    model = BoringModel()
    datamodule = BoringDataModule()
    trainer = Trainer(max_epochs=100)
    print(f"trainer.num_devices: {trainer.num_devices}")
    trainer.fit(model, datamodule)


if __name__ == "__main__":
    main()
```

This generates the following output:
```
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,2]

  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | layer | Linear | 66     | train
-----------------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
1         Modules in train mode
0         Modules in eval mode
SLURM auto-requeueing enabled. Setting signal handlers.
LOCAL_RANK=0, SLURM_NTASKS=1, SLURM_NTASKS_PER_NODE=2
trainer.num_devices: 2
```

`MEMBER: 1/1` indicates that only 1 GPU is used but `trainer.num_devices` returns 2. `nvidia-smi` also indicates that only a single device is used.

Not sure if there is a valid use case for `SLURM_NTASKS < SLURM_NTASKS_PER_NODE`. But if there is not it would be awesome if Lightning could raise an error in this scenario. 

The same error also happens if `--ntasks-per-node` is not set. In this case Lightning assumes 2 devices (I guess based on `CUDA_VISIBLE_DEVICES`) but in reality only a single one is used.

### What version are you seeing the problem on?

v2.4

### How to reproduce the bug

_No response_

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

<details>
  <summary>Current environment</summary>

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 4090
                - NVIDIA GeForce RTX 4090
                - NVIDIA GeForce RTX 4090
                - NVIDIA GeForce RTX 4090
        - available:         True
        - version:           12.4
* Lightning:
        - lightning-utilities: 0.11.8
        - pytorch-lightning: 2.4.0
        - torch:             2.5.1
        - torchmetrics:      1.4.3
        - torchvision:       0.20.1
* Packages:
        - aenum:             3.1.15
        - aiohappyeyeballs:  2.4.3
        - aiohttp:           3.10.10
        - aiosignal:         1.3.1
        - annotated-types:   0.7.0
        - antlr4-python3-runtime: 4.9.3
        - attrs:             24.2.0
        - autocommand:       2.2.2
        - backports.tarfile: 1.2.0
        - certifi:           2024.8.30
        - charset-normalizer: 3.4.0
        - eval-type-backport: 0.2.0
        - filelock:          3.16.1
        - frozenlist:        1.5.0
        - fsspec:            2024.10.0
        - hydra-core:        1.3.2
        - idna:              3.10
        - importlib-metadata: 8.0.0
        - importlib-resources: 6.4.0
        - inflect:           7.3.1
        - jaraco.collections: 5.1.0
        - jaraco.context:    5.3.0
        - jaraco.functools:  4.0.1
        - jaraco.text:       3.12.1
        - jinja2:            3.1.4
        - lightly:           1.5.13
        - lightning-utilities: 0.11.8
        - markupsafe:        3.0.2
        - more-itertools:    10.3.0
        - mpmath:            1.3.0
        - multidict:         6.1.0
        - networkx:          3.4.2
        - numpy:             2.1.3
        - nvidia-cublas-cu12: 12.4.5.8
        - nvidia-cuda-cupti-cu12: 12.4.127
        - nvidia-cuda-nvrtc-cu12: 12.4.127
        - nvidia-cuda-runtime-cu12: 12.4.127
        - nvidia-cudnn-cu12: 9.1.0.70
        - nvidia-cufft-cu12: 11.2.1.3
        - nvidia-curand-cu12: 10.3.5.147
        - nvidia-cusolver-cu12: 11.6.1.9
        - nvidia-cusparse-cu12: 12.3.1.170
        - nvidia-nccl-cu12:  2.21.5
        - nvidia-nvjitlink-cu12: 12.4.127
        - nvidia-nvtx-cu12:  12.4.127
        - omegaconf:         2.3.0
        - packaging:         24.1
        - pillow:            11.0.0
        - platformdirs:      4.2.2
        - propcache:         0.2.0
        - psutil:            6.1.0
        - pyarrow:           18.0.0
        - pydantic:          2.9.2
        - pydantic-core:     2.23.4
        - python-dateutil:   2.9.0.post0
        - pytorch-lightning: 2.4.0
        - pytz:              2024.2
        - pyyaml:            6.0.2
        - requests:          2.32.3
        - setuptools:        75.3.0
        - six:               1.16.0
        - sympy:             1.13.1
        - tomli:             2.0.1
        - torch:             2.5.1
        - torchmetrics:      1.4.3
        - torchvision:       0.20.1
        - tqdm:              4.66.6
        - triton:            3.1.0
        - typeguard:         4.3.0
        - typing-extensions: 4.12.2
        - urllib3:           2.2.3
        - wheel:             0.43.0
        - yarl:              1.17.1
        - zipp:              3.19.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.12.3
        - release:           6.8.0-38-generic
        - version:           #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun  7 15:25:01 UTC 2024

</details>


### More info

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE #20391

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error if SLURM_NTASKS != SLURM_NTASKS_PER_NODE #20391

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions