New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-status should use squeue --iterate #81
Comments
Hi @holtgrewe , thanks for the illuminating comment - I wasn't aware of the Could you elaborate more specifically what changes would be needed here and how they relate to the snakemake pull request you reference? The sidecar process would launch |
Hi. I stand corrected. According to Slurm support, doing watch -n 10 squeue is the same as squeue -i 10. However caching the squeue output should be better than many saccts. Let me see my snakemake prs through and I will continue here. |
Addressed by #85. This will need Snakemake v7 and the PR mentioned in the ticket. As an alternative, administration can setup a central slurmrestd behind a caching proxy and |
I'm a long-term Snakemake user, starting out with SGE and DRMAA (a standard actually driven forward by the SGE vendor). We have switched to Slurm quite some time ago and it works fine with DRMAA but has one problem: our controller is draining in an "RPC storm". What is an RPC storm? Many users performing queries such as
squeue
repetiviely, e.g. aswatch squeue
. Something similar is true forsacct
but this will not hit the controller but rather theslurmdbd
.Why does this matter, you might wonder. Here is why: the cluster status script is using sacct internally, you already know this. You probably use this because the slurm jobs are not visible in squeue if more than MinJobAge have passed.
So, to summarize up to here:
watch squeue
is an antipattern in SlurmNow, what do the wonderful SchedMD recommend? Using the
-i/--iterate
function. This will make on RPC call and keep it open and then the controller will happily print the queue to you with almost no hit on the controller. Chosing an-i
value small enough will also ensure that snakemake can know all status updates.Example output below.
What are the actionables that I propose, you might ask.
Good question. What we would need is to
squeue --me -i 10 --format='%i,%T'
at startup in a background threaddict
dict
One way to implement this is to have the slurm profile actually serve the cache values through a micro REST API.
The text was updated successfully, but these errors were encountered: