How to disable automatic SLURM detection / signal handling? #5225

jbohnslav · 2020-12-21T21:18:19Z

❓ Questions and Help

What is your question?

I'm running single-GPU jobs on a SLURM cluster. PyTorch Lightning uses environment variables to detect that I'm on SLURM, and automatically interrupts SIGTERM signals. However, when I'm debugging, I don't want the SIGTERM to be bypassed-- I need to know where the signal is originating.

I can't seem to tell PytorchLightning to not use the Slurm handler, because it's automatically detected using environment variables. Is there any way to not use PL's default SLURM connector / SIGTERM bypass function?

OS: Linux
Packaging pip
Version 1.1.1

tchaton · 2020-12-23T08:45:49Z

Hey @williamFalcon. Any idea on this one ?

awaelchli · 2020-12-23T09:21:04Z

@jbohnslav a quick hack could be to delete the slurm env variables like so:

import os
del os.environ["SLURM_NTASKS"]
del os.environ["SLURM_JOB_NAME"]

at the beginning of the script, then lightning will not detect it as slurm.

Another way could be to grab the original signal handler before trainer init,
and then restore it on fit start (overwriting the one trainer has set):

import signal
# before training init
original_handler = signal.getsignal(signal.SIGTERM)

# in on_fit_start hook
signal.signal(signal.SIGTERM, original_handler)

(A while back I had a feature PR #3632 that added a configureable way to register signals)

YannDubs · 2021-05-18T00:55:33Z

I just had a similar issue of wanting to deactivate SLURM detection (I use another library to deal with that and pytorch lightning is only one small component of my code).

Given that wanting to deactivate SLURM detection seems a recurrent usecase (see also #6204 #6389 ) I think there should really just be a flag to the trainer (e.g. disable_hpc_detection as suggested by #6389 (comment) )

stale · 2021-06-17T03:22:32Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

jbohnslav added the question Further information is requested label Dec 21, 2020

tchaton added the environment: slurm label Dec 23, 2020

tchaton added this to the 1.2 milestone Dec 23, 2020

tchaton added the priority: 1 Medium priority task label Dec 23, 2020

edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021

edenlightning added feature Is an improvement or enhancement and removed question Further information is requested labels Feb 9, 2021

vitusbenson mentioned this issue Feb 14, 2021

Lightning throws "bypassing sigterm" on Slurm Cluster for unknown reason #5969

Closed

edenlightning removed this from the v1.3 milestone Apr 27, 2021

stale bot added the won't fix This will not be worked on label Jun 17, 2021

stale bot closed this as completed Jun 24, 2021

YannDubs mentioned this issue Sep 26, 2022

Lightning sends SIGTERM when using other SLURM manager #14893

Closed

jasonkena mentioned this issue Aug 23, 2023

Launch ddp on 8 devices, but only run on the first gpu #16236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to disable automatic SLURM detection / signal handling? #5225

How to disable automatic SLURM detection / signal handling? #5225

jbohnslav commented Dec 21, 2020

tchaton commented Dec 23, 2020

awaelchli commented Dec 23, 2020 •

edited

YannDubs commented May 18, 2021 •

edited

stale bot commented Jun 17, 2021

How to disable automatic SLURM detection / signal handling? #5225

How to disable automatic SLURM detection / signal handling? #5225

Comments

jbohnslav commented Dec 21, 2020

❓ Questions and Help

What is your question?

tchaton commented Dec 23, 2020

awaelchli commented Dec 23, 2020 • edited

YannDubs commented May 18, 2021 • edited

stale bot commented Jun 17, 2021

awaelchli commented Dec 23, 2020 •

edited

YannDubs commented May 18, 2021 •

edited