Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch lightning causes slurm nodes to drain #15008

Open
meshghi opened this issue Oct 5, 2022 · 13 comments
Open

pytorch lightning causes slurm nodes to drain #15008

meshghi opened this issue Oct 5, 2022 · 13 comments
Labels
environment: slurm question Further information is requested

Comments

@meshghi
Copy link

meshghi commented Oct 5, 2022

Bug description

Hello! When I train with DDP strategy, any type of crashes like Out Of Memory (OOM) error or scancel slurm job results in slurm nodes to drain due to Kill task failed which means that the pytorch lightning process running on these nodes failed to clean up after termination. I was wondering how I could fix this?

How to reproduce the bug

callbacks_list = [
        lr_monitor,
        checkpoint_callback,
        swa_ensemble,
        PrintCallback(),
        callbacks.ModelSummary(),
        callbacks.DeviceStatsMonitor(cpu_stats=True),
        ]

logger = [
        loggers.tensorboard.TensorBoardLogger(save_dir="./logs", version=slurm_job_id), 
        ]

trainer = pl.Trainer(
            gpus=_cfg.slurm_job.gpus_per_node,
            num_nodes=_cfg.slurm_job.number_of_nodes, 
            accelerator="gpu",
            strategy=DDPStrategy(find_unused_parameters=False),
            plugins=pl.plugins.environments.SLURMEnvironment(auto_requeue=False),
            max_epochs=_cfg.training.fit.epochs, 
            callbacks=callbacks_list,
            logger=logger,
            accumulate_grad_batches=_cfg.training.fit.accumulation_steps,
            profiler="simple",
            )
data_module = DataModule(_cfg)
model_module = LitModelModule(_cfg)

trainer.fit(
            model=model_module, 
            datamodule=data_module,
            ckpt_path=None
        )

Error messages and logs


# Error messages and logs here please

Environment


- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer, LightningModule
- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
- PyTorch Version (e.g., 1.10): 1.12.0
- Python version (e.g., 3.9): 3.9.12
- OS (e.g., Linux): Linux --RHEL7.4 
- CUDA/cuDNN version: 11.7
- GPU models and configuration: RTX 5000's
- How you installed Lightning(`conda`, `pip`, source): pip
- 

More info

No response

cc @awaelchli

@meshghi meshghi added the needs triage Waiting to be triaged by maintainers label Oct 5, 2022
@bhaktatejas922
Copy link

This is with multi-node, 8 GPUs per node. Tried on 2,3,4,and 5 nodes and all cases cause the issue. Feels like the higher the nodes we try, the higher the probability that one or more will go into the drain state.

@awaelchli
Copy link
Member

What about single node training? The OOM should occur there too, and is likely not related to SLURM. Maybe there is a memory leak in the model?

@awaelchli awaelchli added question Further information is requested environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Oct 9, 2022
@bhaktatejas922
Copy link

OOM ing is not the problem. Single node works fine. We figured it out by talking to the people that manage our cluster, and they increased the timeout limit on slurm and that fixed our issue. Sometimes it will take 5min or more though for a job to exit

@awaelchli
Copy link
Member

Sometimes it will take 5min or more though for a job to exit

Is your script exiting normally? Can you check what happens at the end of training, are you saving big checkpoints? Can you check if it is also 5 mins on single node?

@stale
Copy link

stale bot commented Nov 13, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 13, 2022
@ChristophReich1996
Copy link

ChristophReich1996 commented Nov 16, 2022

Hi, I'm facing the very same problem. A buggy solution that (sometimes) works for me is to first stop the job with W&B, wait for a few moments, and then kill the slum job with scancle. I also expect that the time limit of scancle might be a solution here, any updates on this?

@stale stale bot removed the won't fix This will not be worked on label Nov 16, 2022
@Wildcarde
Copy link

Just following up on this as we've seen this intermittently at my institute as well. From the slurm side, if a job finishes normally and has enough buffer time after it's done to exit out normally you won't see this issue. You will however see it more frequently in situations where people either cancel a running job (via scancel), a job is killed for exceeding the allowed memory (oom in slurm means allocation exceeded, not necessarily that the system is out of space), or when a job runs to the end of it's time limit and a job kill is issued they'll hit the UnkillableStepTimeout wall. This is the max amount of time between a SIGKILL getting issued by the job scheduler and the job quiting that is allowed before a node is drained as the scheduler assumes there is an issue.

Unfortunately in older slurm instances setting this value any higher than 126 would both not work and cause other resource contention issues. This has been fixed so people can bump up timeouts now but you need to be running a newer version of slurm to do so.

slurm bug report: https://bugs.schedmd.com/show_bug.cgi?id=11103

@Jonathan-LeRoux
Copy link

We have been having this issue as well. We upgraded to Slurm 21.08.8-2 a few months ago, and set UnkillableStepTimeout=180, but it didn't help. We also tried having an UnkillableStepProgram that should have spit out some information, but this never ran.
@Wildcarde : what worked for you?

@stale
Copy link

stale bot commented Jun 18, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jun 18, 2023
@Wildcarde
Copy link

@Jonathan-LeRoux sorry i missed this, we cranked our timeouts on that cluster way up:

SlurmctldTimeout=600
SlurmdTimeout=300
UnkillableStepTimeout=1200

@awaelchli
Copy link
Member

Is there anything that Lightning could do? Since Lightning is not in control of the process creation (SLURM is), I'm not sure what action could be taken here.

@stale stale bot removed the won't fix This will not be worked on label Aug 14, 2023
@Wildcarde
Copy link

@awaelchli I'm not sure honestly, the biggest thing is making sure when a process gets a sigkill, it actually dies instead of holding on for minutes before finally finishing up and quiting out. I mostly stumbled on this while trying to resolve that situation and thought people might find those details helpful. I'm not sure where you even sit 'in the stack' as it were but if you do anything involving the signal handling and teardown processes maybe there?

@Jonathan-LeRoux
Copy link

With the following settings, we still have issues once in a while:

SlurmctldTimeout=300
SlurmdTimeout=300
ResumeTimeout=1800
UnkillableStepTimeout=180

I will try to increase SlurmctldTimeout and UnkillableStepTimeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environment: slurm question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants