Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-status should use squeue --iterate #81

Closed
holtgrewe opened this issue Feb 11, 2022 · 3 comments
Closed

cluster-status should use squeue --iterate #81

holtgrewe opened this issue Feb 11, 2022 · 3 comments

Comments

@holtgrewe
Copy link

I'm a long-term Snakemake user, starting out with SGE and DRMAA (a standard actually driven forward by the SGE vendor). We have switched to Slurm quite some time ago and it works fine with DRMAA but has one problem: our controller is draining in an "RPC storm". What is an RPC storm? Many users performing queries such as squeue repetiviely, e.g. as watch squeue. Something similar is true for sacct but this will not hit the controller but rather the slurmdbd.

Why does this matter, you might wonder. Here is why: the cluster status script is using sacct internally, you already know this. You probably use this because the slurm jobs are not visible in squeue if more than MinJobAge have passed.

So, to summarize up to here:

  • watch squeue is an antipattern in Slurm
  • the controller is faster to answer queries than the slurm database daemon

Now, what do the wonderful SchedMD recommend? Using the -i/--iterate function. This will make on RPC call and keep it open and then the controller will happily print the queue to you with almost no hit on the controller. Chosing an -i value small enough will also ensure that snakemake can know all status updates.

Example output below.

Fri Feb 11 16:32:19 2022
JOBID,STATE
122577,PENDING
98589,PENDING
98588,PENDING
98587,PENDING

Fri Feb 11 16:32:24 2022
JOBID,STATE
122577,PENDING
98589,PENDING
98588,PENDING
98587,PENDING

What are the actionables that I propose, you might ask.

Good question. What we would need is to

  • launch an squeue --me -i 10 --format='%i,%T' at startup in a background thread
  • have the thread parse through the output and memoize the job ids in a dict
  • have cluster-status query that dict

One way to implement this is to have the slurm profile actually serve the cache values through a micro REST API.

@percyfal
Copy link
Collaborator

Hi @holtgrewe , thanks for the illuminating comment - I wasn't aware of the --iterate option! I have not encountered the problems you describe at our HPC, but I have had the nagging feeling that something like this is bound to happen if enough users execute these commands.

Could you elaborate more specifically what changes would be needed here and how they relate to the snakemake pull request you reference? The sidecar process would launch squeue (or any command of choice) and cluster-status would then communicate with the process? Is there anything that could already be implemented in the profile, or should I just hang tight until the PR is complete?

@holtgrewe
Copy link
Author

Hi. I stand corrected. According to Slurm support, doing watch -n 10 squeue is the same as squeue -i 10. However caching the squeue output should be better than many saccts. Let me see my snakemake prs through and I will continue here.

@holtgrewe
Copy link
Author

Addressed by #85. This will need Snakemake v7 and the PR mentioned in the ticket.

As an alternative, administration can setup a central slurmrestd behind a caching proxy and slurm-status.py could query this. As I don't have setup slurmrestd centrally yet and it requires administration intervation, I exclude this from the scope here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants