Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to disable automatic SLURM detection / signal handling? #5225

Closed
jbohnslav opened this issue Dec 21, 2020 · 4 comments
Closed

How to disable automatic SLURM detection / signal handling? #5225

jbohnslav opened this issue Dec 21, 2020 · 4 comments
Labels
environment: slurm feature Is an improvement or enhancement priority: 1 Medium priority task won't fix This will not be worked on

Comments

@jbohnslav
Copy link

❓ Questions and Help

What is your question?

I'm running single-GPU jobs on a SLURM cluster. PyTorch Lightning uses environment variables to detect that I'm on SLURM, and automatically interrupts SIGTERM signals. However, when I'm debugging, I don't want the SIGTERM to be bypassed-- I need to know where the signal is originating.

I can't seem to tell PytorchLightning to not use the Slurm handler, because it's automatically detected using environment variables. Is there any way to not use PL's default SLURM connector / SIGTERM bypass function?

  • OS: Linux
  • Packaging pip
  • Version 1.1.1
@jbohnslav jbohnslav added the question Further information is requested label Dec 21, 2020
@tchaton
Copy link
Contributor

tchaton commented Dec 23, 2020

Hey @williamFalcon. Any idea on this one ?

@tchaton tchaton added this to the 1.2 milestone Dec 23, 2020
@tchaton tchaton added the priority: 1 Medium priority task label Dec 23, 2020
@awaelchli
Copy link
Member

awaelchli commented Dec 23, 2020

@jbohnslav a quick hack could be to delete the slurm env variables like so:

import os
del os.environ["SLURM_NTASKS"]
del os.environ["SLURM_JOB_NAME"]

at the beginning of the script, then lightning will not detect it as slurm.

Another way could be to grab the original signal handler before trainer init,
and then restore it on fit start (overwriting the one trainer has set):

import signal
# before training init
original_handler = signal.getsignal(signal.SIGTERM)

# in on_fit_start hook
signal.signal(signal.SIGTERM, original_handler)

(A while back I had a feature PR #3632 that added a configureable way to register signals)

@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021
@edenlightning edenlightning added feature Is an improvement or enhancement and removed question Further information is requested labels Feb 9, 2021
@edenlightning edenlightning removed this from the v1.3 milestone Apr 27, 2021
@YannDubs
Copy link

YannDubs commented May 18, 2021

I just had a similar issue of wanting to deactivate SLURM detection (I use another library to deal with that and pytorch lightning is only one small component of my code).

Given that wanting to deactivate SLURM detection seems a recurrent usecase (see also #6204 #6389 ) I think there should really just be a flag to the trainer (e.g. disable_hpc_detection as suggested by #6389 (comment) )

@stale
Copy link

stale bot commented Jun 17, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environment: slurm feature Is an improvement or enhancement priority: 1 Medium priority task won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants