New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch lightning causes slurm nodes to drain #15008
Comments
This is with multi-node, 8 GPUs per node. Tried on 2,3,4,and 5 nodes and all cases cause the issue. Feels like the higher the nodes we try, the higher the probability that one or more will go into the drain state. |
What about single node training? The OOM should occur there too, and is likely not related to SLURM. Maybe there is a memory leak in the model? |
OOM ing is not the problem. Single node works fine. We figured it out by talking to the people that manage our cluster, and they increased the timeout limit on slurm and that fixed our issue. Sometimes it will take 5min or more though for a job to exit |
Is your script exiting normally? Can you check what happens at the end of training, are you saving big checkpoints? Can you check if it is also 5 mins on single node? |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
Hi, I'm facing the very same problem. A buggy solution that (sometimes) works for me is to first stop the job with W&B, wait for a few moments, and then kill the slum job with |
Just following up on this as we've seen this intermittently at my institute as well. From the slurm side, if a job finishes normally and has enough buffer time after it's done to exit out normally you won't see this issue. You will however see it more frequently in situations where people either cancel a running job (via Unfortunately in older slurm instances setting this value any higher than slurm bug report: https://bugs.schedmd.com/show_bug.cgi?id=11103 |
We have been having this issue as well. We upgraded to Slurm 21.08.8-2 a few months ago, and set |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
@Jonathan-LeRoux sorry i missed this, we cranked our timeouts on that cluster way up: SlurmctldTimeout=600
SlurmdTimeout=300
UnkillableStepTimeout=1200 |
Is there anything that Lightning could do? Since Lightning is not in control of the process creation (SLURM is), I'm not sure what action could be taken here. |
@awaelchli I'm not sure honestly, the biggest thing is making sure when a process gets a sigkill, it actually dies instead of holding on for minutes before finally finishing up and quiting out. I mostly stumbled on this while trying to resolve that situation and thought people might find those details helpful. I'm not sure where you even sit 'in the stack' as it were but if you do anything involving the signal handling and teardown processes maybe there? |
With the following settings, we still have issues once in a while:
I will try to increase |
Bug description
Hello! When I train with DDP strategy, any type of crashes like
Out Of Memory (OOM)
error orscancel
slurm job results in slurm nodes to drain due toKill task failed
which means that the pytorch lightning process running on these nodes failed to clean up after termination. I was wondering how I could fix this?How to reproduce the bug
Error messages and logs
Environment
More info
No response
cc @awaelchli
The text was updated successfully, but these errors were encountered: