-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job submitted via SGE scheduler hangs until walltime #3542
Comments
How this is meant to work is that the Parsl scaling code - the same code that submits the batch job - is also meant to cancel the batch job at exit. That's what is meant to kill process worker pools, rather than the pools exiting themselves. You need to shut down parsl to do that though -- this used to happen at exit of the workflow script automatically, but modern Python is increasingly hostile to doing complicated things at Python shutdown and so this was removed in PR #3165 You can use Parsl as a context manager like this:
and when the That's the point at which batch jobs should be cancelled. You should see that happen in If you are still getting leftover batch jobs even with |
Sorry, I was able to try this only now, and I can confirm that using the context manager here does indeed the trick for me, thanks! The only thing I noticed is that, even if the bash/python app itself is successful, the job ends with exit code 137 (= 128 + 9 and 9 is SIGKILL), but perhaps that's expected because the job is killed by parsl? The parsl script terminates with 0 as expected. Only other comment, I'm not sure that using the context manager is necessary in this case is clear in the documentation? Can't pinpoint exactly what were the sections I was looking at though, it was a couple of months ago now. |
The job should be terminated by
qdel . I'd usually expect something more like a SIGTERM there for batch systems in general, but I don't know exactly what's happening in your situation.
The context manager is pretty always necessary now (due to ongoing changes in how exit/shutdown is handled in Python itself) but because this is new, a lot of documentation doesn't talk about that - if you see any documentation that does a |
Describe the bug
I have a pipeline for an SGE-based cluster which looks roughly like
The bash app works all fine as far as I can tell (also the more complicated one I'm actually using, I'm showing here
echo hello world
just for simplicity), but the problem is that the job never finishes and is only killed by the scheduler when the requested is walltime is reached.The submit job script looks like
I can't spot anything wrong with the job script options, my understanding is that
process_worker_pool.py
never finishes andwait $PID
waits forever. I also don't know if this is really specific to SGE, this is just where I'm experiencing the issue.To Reproduce
Steps to reproduce the behavior, for e.g:
Expected behavior
Ideally the job would finish when the app work is done, not until the walltime, which may be set conservatively large, and it's a waste of resources to keep a node busy for doing exactly nothing.
Environment
Distributed Environment
The text was updated successfully, but these errors were encountered: