-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does running Lightning on SLURM with python perform worse than with srun? #18650
Comments
I don't have an answer to your question, but I find the first way of launching very strange, I didn't know this worked and I am surprised it does. If you ever find out why I'd be interested. But I hope our guide here is good. It's our recommended way to launch multinode jobs. |
@Muennighoff, what's inside The way you're launching it in many frameworks leads to DP and not DDP (but perhaps not in PTL). So you might not comparing apples to apples. To surely compare nb: @awaelchli has just fixed torchrun running under slurm env #18618 |
For single node training on a SLURM managed cluster, I usually do @awaelchli As an additional datapoint for the use case of launching in an allocation: I am running jobs on multiple clusters, some using SLURM and some don't. It's just easier if all jobs can be launched the same way, regardless of if SLURM is available or not. |
Thank you for sharing your insights, Konstantin In general the problem is that https://github.com/nviDIA/nemo, which is what we use, abstracts all PTL bits away, giving the user an API that doesn't have PTL exposed, so we only have a config file to make any customizations and thus such coding workarounds aren't available to us. |
My bad, I didn't realize nvidia/nemo has this limitation. Perhaps this will be useful to someone else who stumbles upon this issue due to issues with SLURM. I know it took me some time to figure out why things were working on some clusters but not on those using SLURM. |
Don't want to diverge too much from the OP's issue, but just wanted to hint at the fact that if you run interactively on SLURM (i.e. scheduling a machine and then connecting to it), there is simpler way than changing the code like you described, by just setting the job name to "interactive" and then Lightning will know that processes are not getting launched externally, and you won't have to change the code and it will just be like running locally on the workstation. https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode (apologies if you see rendering issues in the docs, we are currently going through a couple of style template updates). |
I also find the srun will speed up. but meanwhile loss curve doesn't converge quickly. Here is the two figure, btw: if you are logging with wandb and run
#!/bin/bash
#SBATCH -J xx
#....
python 1_train.py
#!/bin/bash
#SBATCH -J xx
#....
srun python 1_train.py |
For me the bottleneck was that by default in my SLURM cluster a task only gets 1 CPU core. Hidden in the documentation of sbatch it says that I achieved good performance and full GPU utilization with this script: #!/usr/bin/bash -l
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gpus=a100:4
#SBATCH --mem=180G
#SBATCH --cpus-per-task=8
export SRUN_CPUS_PER_TASK=8
conda activate lit
srun python myscript.py --slurm_id ${SLURM_JOB_ID} "$@" |
That's an interesting discovery, @schlabrendorff - though I'm not sure this is always the case. At least it doesn't seem to impact my setup I checked I'm using SLURM 22.05.09 and I use:
and then I use I wonder if this has to do with
Oddly I don't see any Which SLURM version are you using? ( And can you check if everything works if you go back to the previous setup and add |
My slurm cluster runs on |
Ah, that's a possibility since perhaps they have planned to switch in 22.05 but didn't do it until 23.x. When I get a chance I can try the reverse - removing |
OK, I tested that w/ or w/o
It'd be so it must be the version that I'm running that is still not affected. |
I think the Request Defaults could might also come into play in my scenario, since my cluster documentation specifies.
Could be that your default is all (available?) CPUs? |
How do I get |
They were listed in my university's cluster documentation page. I guess besides trying to read your |
Great idea! I did:
and got:
so the default is indeed 1. If I do:
I get 48
Bottom line: the "inheritance" still works in SLURM 22.05.09 |
I realized I got my diagnostics wrong. Using So here is the updated test:
gives:
The previous comment's conclusion still holds. I documented it here |
There was never a clear conclusion here was there? Is there anything we need to action on the Lightning side? |
Just wondering what is the final results with Srun and without Srun. Does Srun give worse results? |
@stas00 Thank you for maintaining the ML-engineering guide! I noticed that you recommend setting --ntasks-per-node=1, whereas the Lightning documentation suggests --ntasks-per-node=8 (which corresponds to the number of GPUs per node). When I tried setting it to 1, I encountered an error with Lightning. Did you experience a similar issue during your implementation? |
You're correct, @srmsoumya - this was a copy-n-paste from thank you very much for the heads up. |
Bug description
I'm training LLMs across multiple GPUs on a single node using
Nvidia/NeMo
.When launching via
python train.py
inside of an allocation I get much worse performance than when launching directly viasrun
. In the first case Pytorch Lightning also raises the warning:The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun
. See below:107 tflops:
172 tflops:
Why is it the case that the first one performs worse? Maybe is there a difference in how these two strategies launch the torch.distributed process group? (https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing)
What version are you seeing the problem on?
v2.0
How to reproduce the bug
No response
Error messages and logs
No response
Environment
Current environment
More info
No response
cc @awaelchli
The text was updated successfully, but these errors were encountered: