Why does running Lightning on SLURM with python perform worse than with srun? #18650

Muennighoff · 2023-09-27T03:30:48Z

Bug description

I'm training LLMs across multiple GPUs on a single node using Nvidia/NeMo.
When launching via python train.py inside of an allocation I get much worse performance than when launching directly via srun. In the first case Pytorch Lightning also raises the warning: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun. See below:

107 tflops:

srun --unbuffered --exclusive --open-mode=append --partition=ultra --nodes=1 --ntasks-per-node=4 --gpus=4 --pty bash
source /home/niklas/miniconda/bin/activate
conda activate core
CUDA_DEVICE_MAX_CONNECTIONS=1 GPU_PER_NODE=4 python train.py

172 tflops:

source /home/niklas/miniconda/bin/activate
conda activate core
CUDA_DEVICE_MAX_CONNECTIONS=1 GPU_PER_NODE=4 srun --unbuffered --exclusive --open-mode=append --partition=ultra --nodes=1 --ntasks-per-node=4 --gpus=4 python -u train.py

Why is it the case that the first one performs worse? Maybe is there a difference in how these two strategies launch the torch.distributed process group? (https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing)

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

No response

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli

The text was updated successfully, but these errors were encountered:

awaelchli · 2023-09-28T20:46:03Z

I don't have an answer to your question, but I find the first way of launching very strange, I didn't know this worked and I am surprised it does. If you ever find out why I'd be interested.

But I hope our guide here is good. It's our recommended way to launch multinode jobs.

stas00 · 2023-09-29T01:35:02Z

@Muennighoff, what's inside train.py?

The way you're launching it in many frameworks leads to DP and not DDP (but perhaps not in PTL). So you might not comparing apples to apples.

To surely compare srun you need an actual multi-gpu launcher, e.g. torchrun

nb: @awaelchli has just fixed torchrun running under slurm env #18618
If you want a workaround for not using this fix as PTL@HEAD won't work with Nemo - set SLURM_JOB_NAME=interactive torchrun ... but if you're using standalone PTL then just use its bleed edge.

konstantinjdobler · 2023-10-04T21:29:57Z

For single node training on a SLURM managed cluster, I usually do SLURMEnvironment.detect() and if true, I pass LightningEnvironment() as a plugin to the Trainer or fabric. This should prevent the SLURMEnvironment from being loaded and use the regular lightning launcher to create processes. I haven't seen any performance degradations with this method so far. I do not use the --ntasks-per-node=4 flag in the srun though.

@awaelchli As an additional datapoint for the use case of launching in an allocation: I am running jobs on multiple clusters, some using SLURM and some don't. It's just easier if all jobs can be launched the same way, regardless of if SLURM is available or not.

stas00 · 2023-10-04T22:07:27Z

Thank you for sharing your insights, Konstantin

In general the problem is that https://github.com/nviDIA/nemo, which is what we use, abstracts all PTL bits away, giving the user an API that doesn't have PTL exposed, so we only have a config file to make any customizations and thus such coding workarounds aren't available to us.

konstantinjdobler · 2023-10-05T09:25:00Z

My bad, I didn't realize nvidia/nemo has this limitation. Perhaps this will be useful to someone else who stumbles upon this issue due to issues with SLURM. I know it took me some time to figure out why things were working on some clusters but not on those using SLURM.

awaelchli · 2023-10-06T23:31:43Z

For single node training on a SLURM managed cluster, I usually do SLURMEnvironment.detect() and if true, I pass LightningEnvironment() as a plugin to the Trainer or fabric. This should prevent the SLURMEnvironment from being loaded and use the regular lightning launcher to create processes. I haven't seen any performance degradations with this method so far. I do not use the --ntasks-per-node=4 flag in the srun though.

Don't want to diverge too much from the OP's issue, but just wanted to hint at the fact that if you run interactively on SLURM (i.e. scheduling a machine and then connecting to it), there is simpler way than changing the code like you described, by just setting the job name to "interactive" and then Lightning will know that processes are not getting launched externally, and you won't have to change the code and it will just be like running locally on the workstation. https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode (apologies if you see rendering issues in the docs, we are currently going through a couple of style template updates).

Kin-Zhang · 2023-12-05T14:50:33Z

I also find the srun will speed up. but meanwhile loss curve doesn't converge quickly. Here is the two figure, btw: if you are logging with wandb and run srun in sbatch, system logging in wandb can only show one GPU utilization while of course based on speed, we use all of them.

w/o srun the sbatch file is:

#!/bin/bash
#SBATCH -J xx
#....
python 1_train.py

w srun the sbatch file is:

#!/bin/bash
#SBATCH -J xx
#....
srun python 1_train.py

schlabrendorff · 2023-12-30T18:15:32Z

For me the bottleneck was that by default in my SLURM cluster a task only gets 1 CPU core.
When not using srun, python failed to use multiple GPUs.
But using srun, each task was only allocated one CPU core (default of my slurm cluster config), resulting in a dataloader bottleneck.

Hidden in the documentation of sbatch it says that -c, --cpus-per-task=<ncpus> as a flag is ignored, but one needs to set the SRUN_CPUS_PER_TASK environment variable.

I achieved good performance and full GPU utilization with this script:

#!/usr/bin/bash -l

#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus=a100:4
#SBATCH --mem=180G
#SBATCH --cpus-per-task=8
export SRUN_CPUS_PER_TASK=8

conda activate lit
srun python myscript.py --slurm_id ${SLURM_JOB_ID} "$@"

stas00 · 2023-12-31T00:40:25Z

That's an interesting discovery, @schlabrendorff - though I'm not sure this is always the case.

At least it doesn't seem to impact my setup I checked I'm using SLURM 22.05.09 and I use:

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --exclusive

and then I use srun torchrun or another similar to torchrun launcher, which launches 8 processes plus workers and they all get at least 1 core each - note I'm using ntasks-per-node=1. I don't repeat --cpus-per-task with srun or have SRUN_CPUS_PER_TASK set.

I wonder if this has to do with --exclusive somehow, so it gives me all the cores? In the srun manpage for this option it says:

... but job steps will be allocated all CPUs available to the job on all nodes allocated to the steps.

Oddly I don't see any cpus-per-task related entries in scontrol show config

Which SLURM version are you using? (srun --version).

And can you check if everything works if you go back to the previous setup and add --exclusive?

schlabrendorff · 2023-12-31T00:45:55Z

My slurm cluster runs on slurm 23.02.6 that might explain the difference!
I will try to experiment with the --exclusive flag, but might not get the chance for that soon.

stas00 · 2023-12-31T00:49:24Z

Ah, that's a possibility since perhaps they have planned to switch in 22.05 but didn't do it until 23.x.

When I get a chance I can try the reverse - removing --exclusive and checking that it works. but it's good to prepare ahead of time regardless so your insight is very helpful - appreciating your sharing.

stas00 · 2023-12-31T01:59:24Z

OK, I tested that w/ or w/o -exclusive I get all the available cores in the setup above:

$ scontrol show -d job 3182_1 | grep CPUs/Task
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=48 ReqB:S:C:T=0:0:*:*

It'd be 1 and not 48 if the inheritance --cpus-per-task from sbatch didn't work.

so it must be the version that I'm running that is still not affected.

schlabrendorff · 2023-12-31T02:06:00Z

I think the Request Defaults could might also come into play in my scenario, since my cluster documentation specifies.

Request Defaults:
Unless specified, your jobs will run with the following options to salloc and sbatch options for this partition.
--time=01:00:00 --nodes=1 --ntasks=1 --cpus-per-task=1 --mem-per-cpu=5120

Could be that your default is all (available?) CPUs?

stas00 · 2023-12-31T02:13:24Z

How do I get Request Defaults?

schlabrendorff · 2023-12-31T02:17:52Z

They were listed in my university's cluster documentation page. I guess besides trying to read your /etc/slurm/slurm.conf you can maybe submit an empty job with absolutely no resources requested and check with scontrol show job ?

stas00 · 2023-12-31T02:24:12Z

Great idea!

I did:

#SBATCH --job-name=x
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=0:10:00

srun --jobid $SLURM_JOB_ID bash -c "date; sleep 200"

and got:

$ scontrol show -d job 3185 | grep CPUs/Task
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

so the default is indeed 1.

If I do:

#SBATCH --job-name=x
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=0:10:00

srun --jobid $SLURM_JOB_ID bash -c "date; sleep 200"

I get 48

scontrol show -d job 3186 | grep CPUs/Task
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=48 ReqB:S:C:T=0:0:*:*

Bottom line: the "inheritance" still works in SLURM 22.05.09

stas00 · 2024-01-01T04:16:38Z

I realized I got my diagnostics wrong. scontrol show -d job shows the sbatch/salloc setting, it doesn't know anything about srun.

Using len(os.sched_getaffinity(0)) should give us the correct diagnostics, as it shows which cpu-cores are eligible to be used by the current process.

So here is the updated test:

$ cat test.slurm
#!/bin/bash
#SBATCH --job-name=test-cpu-cores-per-task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=0:10:00
#SBATCH --partition=x
#SBATCH --output=%x-%j.out

srun --jobid $SLURM_JOB_ID python -c 'import os; print(f"cpu cores: {len(os.sched_getaffinity(0))}")'

gives:

cpu cores: 48

The previous comment's conclusion still holds.

I documented it here

awaelchli · 2024-02-16T05:46:55Z

There was never a clear conclusion here was there? Is there anything we need to action on the Lightning side?

TeddLi · 2024-04-22T14:24:44Z

thon failed to use multi

Just wondering what is the final results with Srun and without Srun. Does Srun give worse results?

srmsoumya · 2024-07-08T16:18:44Z

@stas00 Thank you for maintaining the ML-engineering guide!

I noticed that you recommend setting --ntasks-per-node=1, whereas the Lightning documentation suggests --ntasks-per-node=8 (which corresponds to the number of GPUs per node). When I tried setting it to 1, I encountered an error with Lightning. Did you experience a similar issue during your implementation?

stas00 · 2024-07-08T16:43:01Z

You're correct, @srmsoumya - this was a copy-n-paste from torchrun setup where it's always one task. Fixed here.

thank you very much for the heads up.

Muennighoff added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 27, 2023

github-actions bot added the ver: 2.0.x label Sep 27, 2023

awaelchli added help wanted Open to be worked on environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Sep 28, 2023

Borda added question Further information is requested and removed bug Something isn't working labels Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does running Lightning on SLURM with python perform worse than with srun? #18650

Why does running Lightning on SLURM with python perform worse than with srun? #18650

Muennighoff commented Sep 27, 2023 •

edited by github-actions bot

Loading

awaelchli commented Sep 28, 2023

stas00 commented Sep 29, 2023

konstantinjdobler commented Oct 4, 2023

stas00 commented Oct 4, 2023

konstantinjdobler commented Oct 5, 2023 •

edited

Loading

awaelchli commented Oct 6, 2023

Kin-Zhang commented Dec 5, 2023

schlabrendorff commented Dec 30, 2023 •

edited

Loading

stas00 commented Dec 31, 2023 •

edited

Loading

schlabrendorff commented Dec 31, 2023

stas00 commented Dec 31, 2023

stas00 commented Dec 31, 2023 •

edited

Loading

schlabrendorff commented Dec 31, 2023

stas00 commented Dec 31, 2023

schlabrendorff commented Dec 31, 2023

stas00 commented Dec 31, 2023 •

edited

Loading

stas00 commented Jan 1, 2024 •

edited

Loading

awaelchli commented Feb 16, 2024

TeddLi commented Apr 22, 2024

srmsoumya commented Jul 8, 2024

stas00 commented Jul 8, 2024

Why does running Lightning on SLURM with python perform worse than with srun? #18650

Why does running Lightning on SLURM with python perform worse than with srun? #18650

Comments

Muennighoff commented Sep 27, 2023 • edited by github-actions bot Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

awaelchli commented Sep 28, 2023

stas00 commented Sep 29, 2023

konstantinjdobler commented Oct 4, 2023

stas00 commented Oct 4, 2023

konstantinjdobler commented Oct 5, 2023 • edited Loading

awaelchli commented Oct 6, 2023

Kin-Zhang commented Dec 5, 2023

schlabrendorff commented Dec 30, 2023 • edited Loading

stas00 commented Dec 31, 2023 • edited Loading

schlabrendorff commented Dec 31, 2023

stas00 commented Dec 31, 2023

stas00 commented Dec 31, 2023 • edited Loading

schlabrendorff commented Dec 31, 2023

stas00 commented Dec 31, 2023

schlabrendorff commented Dec 31, 2023

stas00 commented Dec 31, 2023 • edited Loading

stas00 commented Jan 1, 2024 • edited Loading

awaelchli commented Feb 16, 2024

TeddLi commented Apr 22, 2024

srmsoumya commented Jul 8, 2024

stas00 commented Jul 8, 2024

Muennighoff commented Sep 27, 2023 •

edited by github-actions bot

Loading

konstantinjdobler commented Oct 5, 2023 •

edited

Loading

schlabrendorff commented Dec 30, 2023 •

edited

Loading

stas00 commented Dec 31, 2023 •

edited

Loading

stas00 commented Dec 31, 2023 •

edited

Loading

stas00 commented Dec 31, 2023 •

edited

Loading

stas00 commented Jan 1, 2024 •

edited

Loading