pytorch lightning causes slurm nodes to drain #15008

meshghi · 2022-10-05T19:15:00Z

Bug description

Hello! When I train with DDP strategy, any type of crashes like Out Of Memory (OOM) error or scancel slurm job results in slurm nodes to drain due to Kill task failed which means that the pytorch lightning process running on these nodes failed to clean up after termination. I was wondering how I could fix this?

How to reproduce the bug

callbacks_list = [
        lr_monitor,
        checkpoint_callback,
        swa_ensemble,
        PrintCallback(),
        callbacks.ModelSummary(),
        callbacks.DeviceStatsMonitor(cpu_stats=True),
        ]

logger = [
        loggers.tensorboard.TensorBoardLogger(save_dir="./logs", version=slurm_job_id), 
        ]

trainer = pl.Trainer(
            gpus=_cfg.slurm_job.gpus_per_node,
            num_nodes=_cfg.slurm_job.number_of_nodes, 
            accelerator="gpu",
            strategy=DDPStrategy(find_unused_parameters=False),
            plugins=pl.plugins.environments.SLURMEnvironment(auto_requeue=False),
            max_epochs=_cfg.training.fit.epochs, 
            callbacks=callbacks_list,
            logger=logger,
            accumulate_grad_batches=_cfg.training.fit.accumulation_steps,
            profiler="simple",
            )
data_module = DataModule(_cfg)
model_module = LitModelModule(_cfg)

trainer.fit(
            model=model_module, 
            datamodule=data_module,
            ckpt_path=None
        )

Error messages and logs


# Error messages and logs here please

Environment


- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, LightningModule
- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
- PyTorch Version (e.g., 1.10): 1.12.0
- Python version (e.g., 3.9): 3.9.12
- OS (e.g., Linux): Linux --RHEL7.4 
- CUDA/cuDNN version: 11.7
- GPU models and configuration: RTX 5000's
- How you installed Lightning(`conda`, `pip`, source): pip
-

More info

No response

cc @awaelchli

The text was updated successfully, but these errors were encountered:

bhaktatejas922 · 2022-10-05T19:30:53Z

This is with multi-node, 8 GPUs per node. Tried on 2,3,4,and 5 nodes and all cases cause the issue. Feels like the higher the nodes we try, the higher the probability that one or more will go into the drain state.

awaelchli · 2022-10-09T10:19:14Z

What about single node training? The OOM should occur there too, and is likely not related to SLURM. Maybe there is a memory leak in the model?

bhaktatejas922 · 2022-10-09T18:07:01Z

OOM ing is not the problem. Single node works fine. We figured it out by talking to the people that manage our cluster, and they increased the timeout limit on slurm and that fixed our issue. Sometimes it will take 5min or more though for a job to exit

awaelchli · 2022-10-09T22:30:02Z

Sometimes it will take 5min or more though for a job to exit

Is your script exiting normally? Can you check what happens at the end of training, are you saving big checkpoints? Can you check if it is also 5 mins on single node?

stale · 2022-11-13T16:37:06Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

ChristophReich1996 · 2022-11-16T12:20:20Z

Hi, I'm facing the very same problem. A buggy solution that (sometimes) works for me is to first stop the job with W&B, wait for a few moments, and then kill the slum job with scancle. I also expect that the time limit of scancle might be a solution here, any updates on this?

Wildcarde · 2023-04-03T20:35:43Z

Just following up on this as we've seen this intermittently at my institute as well. From the slurm side, if a job finishes normally and has enough buffer time after it's done to exit out normally you won't see this issue. You will however see it more frequently in situations where people either cancel a running job (via scancel), a job is killed for exceeding the allowed memory (oom in slurm means allocation exceeded, not necessarily that the system is out of space), or when a job runs to the end of it's time limit and a job kill is issued they'll hit the UnkillableStepTimeout wall. This is the max amount of time between a SIGKILL getting issued by the job scheduler and the job quiting that is allowed before a node is drained as the scheduler assumes there is an issue.

Unfortunately in older slurm instances setting this value any higher than 126 would both not work and cause other resource contention issues. This has been fixed so people can bump up timeouts now but you need to be running a newer version of slurm to do so.

slurm bug report: https://bugs.schedmd.com/show_bug.cgi?id=11103

Jonathan-LeRoux · 2023-05-16T16:53:26Z

We have been having this issue as well. We upgraded to Slurm 21.08.8-2 a few months ago, and set UnkillableStepTimeout=180, but it didn't help. We also tried having an UnkillableStepProgram that should have spit out some information, but this never ran.
@Wildcarde : what worked for you?

stale · 2023-06-18T01:23:27Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

Wildcarde · 2023-08-14T18:33:24Z

@Jonathan-LeRoux sorry i missed this, we cranked our timeouts on that cluster way up:

SlurmctldTimeout=600
SlurmdTimeout=300
UnkillableStepTimeout=1200

awaelchli · 2023-08-14T22:39:12Z

Is there anything that Lightning could do? Since Lightning is not in control of the process creation (SLURM is), I'm not sure what action could be taken here.

Wildcarde · 2023-08-15T00:02:41Z

@awaelchli I'm not sure honestly, the biggest thing is making sure when a process gets a sigkill, it actually dies instead of holding on for minutes before finally finishing up and quiting out. I mostly stumbled on this while trying to resolve that situation and thought people might find those details helpful. I'm not sure where you even sit 'in the stack' as it were but if you do anything involving the signal handling and teardown processes maybe there?

Jonathan-LeRoux · 2023-09-01T16:04:44Z

With the following settings, we still have issues once in a while:

SlurmctldTimeout=300
SlurmdTimeout=300
ResumeTimeout=1800
UnkillableStepTimeout=180

I will try to increase SlurmctldTimeout and UnkillableStepTimeout.

meshghi added the needs triage Waiting to be triaged by maintainers label Oct 5, 2022

awaelchli added question Further information is requested environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Oct 9, 2022

stale bot added the won't fix This will not be worked on label Nov 13, 2022

stale bot removed the won't fix This will not be worked on label Nov 16, 2022

stale bot added the won't fix This will not be worked on label Jun 18, 2023

stale bot removed the won't fix This will not be worked on label Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch lightning causes slurm nodes to drain #15008

pytorch lightning causes slurm nodes to drain #15008

meshghi commented Oct 5, 2022 •

edited by github-actions bot

bhaktatejas922 commented Oct 5, 2022

awaelchli commented Oct 9, 2022

bhaktatejas922 commented Oct 9, 2022

awaelchli commented Oct 9, 2022

stale bot commented Nov 13, 2022

ChristophReich1996 commented Nov 16, 2022 •

edited

Wildcarde commented Apr 3, 2023

Jonathan-LeRoux commented May 16, 2023

stale bot commented Jun 18, 2023

Wildcarde commented Aug 14, 2023

awaelchli commented Aug 14, 2023

Wildcarde commented Aug 15, 2023

Jonathan-LeRoux commented Sep 1, 2023

pytorch lightning causes slurm nodes to drain #15008

pytorch lightning causes slurm nodes to drain #15008

Comments

meshghi commented Oct 5, 2022 • edited by github-actions bot

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

bhaktatejas922 commented Oct 5, 2022

awaelchli commented Oct 9, 2022

bhaktatejas922 commented Oct 9, 2022

awaelchli commented Oct 9, 2022

stale bot commented Nov 13, 2022

ChristophReich1996 commented Nov 16, 2022 • edited

Wildcarde commented Apr 3, 2023

Jonathan-LeRoux commented May 16, 2023

stale bot commented Jun 18, 2023

Wildcarde commented Aug 14, 2023

awaelchli commented Aug 14, 2023

Wildcarde commented Aug 15, 2023

Jonathan-LeRoux commented Sep 1, 2023

meshghi commented Oct 5, 2022 •

edited by github-actions bot

ChristophReich1996 commented Nov 16, 2022 •

edited