Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop polling all jobs in Slurm to find the done ones #4431

Closed
adamnovak opened this issue Mar 30, 2023 · 1 comment · Fixed by #4471
Closed

Stop polling all jobs in Slurm to find the done ones #4431

adamnovak opened this issue Mar 30, 2023 · 1 comment · Fixed by #4471
Assignees
Labels

Comments

@adamnovak
Copy link
Member

adamnovak commented Mar 30, 2023

We should redesign the Slurm batch system so that it can ask Slurm just for the finished jobs, so that we don't need to do O(n) work to get one finished job when n jobs are running, and so it won't be O(n^2) work to find n finished jobs.

Originally posted by @adamnovak in #2323 (comment)

This is limiting the max number of running jobs below what the cluster can support, in the linked issue, because at a certain point we spend so much time polling jobs we have no time to handle the done ones or send in new ones.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1317

@michaelkarlcoleman
Copy link

It'd be pretty hacky, but one could imagine mostly removing SLURM polling from the loop. For example, perhaps have all jobs append a "I'm done" line to a common log, which could itself be "tail'ed". You could still poll SLURM directly every few minutes to look for stuck jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants