Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does running Lightning on SLURM with python perform worse than with srun? #18650

Open
Muennighoff opened this issue Sep 27, 2023 · 19 comments
Labels
environment: slurm help wanted Open to be worked on question Further information is requested ver: 2.0.x

Comments

@Muennighoff
Copy link

Muennighoff commented Sep 27, 2023

Bug description

I'm training LLMs across multiple GPUs on a single node using Nvidia/NeMo.
When launching via python train.py inside of an allocation I get much worse performance than when launching directly via srun. In the first case Pytorch Lightning also raises the warning: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun. See below:

107 tflops:

srun --unbuffered --exclusive --open-mode=append --partition=ultra --nodes=1 --ntasks-per-node=4 --gpus=4 --pty bash
source /home/niklas/miniconda/bin/activate
conda activate core
CUDA_DEVICE_MAX_CONNECTIONS=1 GPU_PER_NODE=4 python train.py

172 tflops:

source /home/niklas/miniconda/bin/activate
conda activate core
CUDA_DEVICE_MAX_CONNECTIONS=1 GPU_PER_NODE=4 srun --unbuffered --exclusive --open-mode=append --partition=ultra --nodes=1 --ntasks-per-node=4 --gpus=4 python -u train.py

Why is it the case that the first one performs worse? Maybe is there a difference in how these two strategies launch the torch.distributed process group? (https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing)

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

No response

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli

@Muennighoff Muennighoff added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 27, 2023
@awaelchli
Copy link
Member

I don't have an answer to your question, but I find the first way of launching very strange, I didn't know this worked and I am surprised it does. If you ever find out why I'd be interested.

But I hope our guide here is good. It's our recommended way to launch multinode jobs.

@awaelchli awaelchli added help wanted Open to be worked on environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Sep 28, 2023
@Borda Borda added question Further information is requested and removed bug Something isn't working labels Sep 28, 2023
@stas00
Copy link
Contributor

stas00 commented Sep 29, 2023

@Muennighoff, what's inside train.py?

The way you're launching it in many frameworks leads to DP and not DDP (but perhaps not in PTL). So you might not comparing apples to apples.

To surely compare srun you need an actual multi-gpu launcher, e.g. torchrun

nb: @awaelchli has just fixed torchrun running under slurm env #18618
If you want a workaround for not using this fix as PTL@HEAD won't work with Nemo - set SLURM_JOB_NAME=interactive torchrun ... but if you're using standalone PTL then just use its bleed edge.

@konstantinjdobler
Copy link
Contributor

For single node training on a SLURM managed cluster, I usually do SLURMEnvironment.detect() and if true, I pass LightningEnvironment() as a plugin to the Trainer or fabric. This should prevent the SLURMEnvironment from being loaded and use the regular lightning launcher to create processes. I haven't seen any performance degradations with this method so far. I do not use the --ntasks-per-node=4 flag in the srun though.

@awaelchli As an additional datapoint for the use case of launching in an allocation: I am running jobs on multiple clusters, some using SLURM and some don't. It's just easier if all jobs can be launched the same way, regardless of if SLURM is available or not.

@stas00
Copy link
Contributor

stas00 commented Oct 4, 2023

Thank you for sharing your insights, Konstantin

In general the problem is that https://github.com/nviDIA/nemo, which is what we use, abstracts all PTL bits away, giving the user an API that doesn't have PTL exposed, so we only have a config file to make any customizations and thus such coding workarounds aren't available to us.

@konstantinjdobler
Copy link
Contributor

konstantinjdobler commented Oct 5, 2023

My bad, I didn't realize nvidia/nemo has this limitation. Perhaps this will be useful to someone else who stumbles upon this issue due to issues with SLURM. I know it took me some time to figure out why things were working on some clusters but not on those using SLURM.

@awaelchli
Copy link
Member

For single node training on a SLURM managed cluster, I usually do SLURMEnvironment.detect() and if true, I pass LightningEnvironment() as a plugin to the Trainer or fabric. This should prevent the SLURMEnvironment from being loaded and use the regular lightning launcher to create processes. I haven't seen any performance degradations with this method so far. I do not use the --ntasks-per-node=4 flag in the srun though.

Don't want to diverge too much from the OP's issue, but just wanted to hint at the fact that if you run interactively on SLURM (i.e. scheduling a machine and then connecting to it), there is simpler way than changing the code like you described, by just setting the job name to "interactive" and then Lightning will know that processes are not getting launched externally, and you won't have to change the code and it will just be like running locally on the workstation. https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode (apologies if you see rendering issues in the docs, we are currently going through a couple of style template updates).

@Kin-Zhang
Copy link

I also find the srun will speed up. but meanwhile loss curve doesn't converge quickly. Here is the two figure, btw: if you are logging with wandb and run srun in sbatch, system logging in wandb can only show one GPU utilization while of course based on speed, we use all of them.

image
image

image

w/o srun the sbatch file is:

#!/bin/bash
#SBATCH -J xx
#....
python 1_train.py

w srun the sbatch file is:

#!/bin/bash
#SBATCH -J xx
#....
srun python 1_train.py

@schlabrendorff
Copy link

schlabrendorff commented Dec 30, 2023

For me the bottleneck was that by default in my SLURM cluster a task only gets 1 CPU core.
When not using srun, python failed to use multiple GPUs.
But using srun, each task was only allocated one CPU core (default of my slurm cluster config), resulting in a dataloader bottleneck.

Hidden in the documentation of sbatch it says that -c, --cpus-per-task=<ncpus> as a flag is ignored, but one needs to set the SRUN_CPUS_PER_TASK environment variable.

I achieved good performance and full GPU utilization with this script:

#!/usr/bin/bash -l

#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus=a100:4
#SBATCH --mem=180G
#SBATCH --cpus-per-task=8
export SRUN_CPUS_PER_TASK=8

conda activate lit
srun python myscript.py --slurm_id ${SLURM_JOB_ID} "$@"

@stas00
Copy link
Contributor

stas00 commented Dec 31, 2023

That's an interesting discovery, @schlabrendorff - though I'm not sure this is always the case.

At least it doesn't seem to impact my setup I checked I'm using SLURM 22.05.09 and I use:

#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --exclusive

and then I use srun torchrun or another similar to torchrun launcher, which launches 8 processes plus workers and they all get at least 1 core each - note I'm using ntasks-per-node=1. I don't repeat --cpus-per-task with srun or have SRUN_CPUS_PER_TASK set.

I wonder if this has to do with --exclusive somehow, so it gives me all the cores? In the srun manpage for this option it says:

... but job steps will be allocated all CPUs available to the job on all nodes allocated to the steps.

Oddly I don't see any cpus-per-task related entries in scontrol show config

Which SLURM version are you using? (srun --version).

And can you check if everything works if you go back to the previous setup and add --exclusive?

@schlabrendorff
Copy link

My slurm cluster runs on slurm 23.02.6 that might explain the difference!
I will try to experiment with the --exclusive flag, but might not get the chance for that soon.

@stas00
Copy link
Contributor

stas00 commented Dec 31, 2023

Ah, that's a possibility since perhaps they have planned to switch in 22.05 but didn't do it until 23.x.

When I get a chance I can try the reverse - removing --exclusive and checking that it works. but it's good to prepare ahead of time regardless so your insight is very helpful - appreciating your sharing.

@stas00
Copy link
Contributor

stas00 commented Dec 31, 2023

OK, I tested that w/ or w/o -exclusive I get all the available cores in the setup above:

$ scontrol show -d job 3182_1 | grep CPUs/Task
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=48 ReqB:S:C:T=0:0:*:*

It'd be 1 and not 48 if the inheritance --cpus-per-task from sbatch didn't work.

so it must be the version that I'm running that is still not affected.

@schlabrendorff
Copy link

I think the Request Defaults could might also come into play in my scenario, since my cluster documentation specifies.

Request Defaults:
Unless specified, your jobs will run with the following options to salloc and sbatch options for this partition.
--time=01:00:00 --nodes=1 --ntasks=1 --cpus-per-task=1 --mem-per-cpu=5120

Could be that your default is all (available?) CPUs?

@stas00
Copy link
Contributor

stas00 commented Dec 31, 2023

How do I get Request Defaults?

@schlabrendorff
Copy link

They were listed in my university's cluster documentation page. I guess besides trying to read your /etc/slurm/slurm.conf you can maybe submit an empty job with absolutely no resources requested and check with scontrol show job ?

@stas00
Copy link
Contributor

stas00 commented Dec 31, 2023

Great idea!

I did:

#SBATCH --job-name=x
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=0:10:00

srun --jobid $SLURM_JOB_ID bash -c "date; sleep 200"

and got:

$ scontrol show -d job 3185 | grep CPUs/Task
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

so the default is indeed 1.

If I do:

#SBATCH --job-name=x
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=0:10:00

srun --jobid $SLURM_JOB_ID bash -c "date; sleep 200"

I get 48

scontrol show -d job 3186 | grep CPUs/Task
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=48 ReqB:S:C:T=0:0:*:*

Bottom line: the "inheritance" still works in SLURM 22.05.09

@stas00
Copy link
Contributor

stas00 commented Jan 1, 2024

I realized I got my diagnostics wrong. scontrol show -d job shows the sbatch/salloc setting, it doesn't know anything about srun.

Using len(os.sched_getaffinity(0)) should give us the correct diagnostics, as it shows which cpu-cores are eligible to be used by the current process.

So here is the updated test:

$ cat test.slurm
#!/bin/bash
#SBATCH --job-name=test-cpu-cores-per-task
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#SBATCH --time=0:10:00
#SBATCH --partition=x
#SBATCH --output=%x-%j.out

srun --jobid $SLURM_JOB_ID python -c 'import os; print(f"cpu cores: {len(os.sched_getaffinity(0))}")'

gives:

cpu cores: 48

The previous comment's conclusion still holds.

I documented it here

@awaelchli
Copy link
Member

There was never a clear conclusion here was there? Is there anything we need to action on the Lightning side?

@TeddLi
Copy link

TeddLi commented Apr 22, 2024

thon failed to use multi

Just wondering what is the final results with Srun and without Srun. Does Srun give worse results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environment: slurm help wanted Open to be worked on question Further information is requested ver: 2.0.x
Projects
None yet
Development

No branches or pull requests

8 participants