Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm_utils.py submit_job function is very slow #79

Open
atermanini opened this issue Oct 28, 2021 · 13 comments
Open

slurm_utils.py submit_job function is very slow #79

atermanini opened this issue Oct 28, 2021 · 13 comments

Comments

@atermanini
Copy link

Hello,

I am using this cookiecutter template with the last snakemake and SLURM.

For pipelines that spawn jobs that takes a small amount of time to be executed (eg 30 or less seconds), even spawning 1000 independents jobs I cannot reach more than 30-40 running jobs per time. Therefore, I am not able to use the full power of the HPC.

After many checks, I found that snakemake takes at least 1 seconds to spawn a new job.

The majority of this time is taken by the last code line in the slurm-submit.py script, which takes around 0.7 seconds.
print(slurm_utils.submit_job(jobscript, **sbatch_options))

Going to the slurm_utils.py, and inspecting the submit_job function with some timing checks, I can confirm that this is where the majority of the time is taken.

Is there any way to optimize this function?

Here we are using several pipelines that spawn thousand of independent jobs, and that takes no more than 30 seconds to be completed. If snakemake is spawning a jobs each second, it is difficult to take advantage of our cluster.

Please let me know if I can help with more details / code to replicate the problem.

Thanks very much,

Best,
Alberto

@maarten-k
Copy link

Hi @atermanini ,
I quickly checked the code in the function and I do not see any possible source of this slowness except the call to SLURM itself.
Can you check what time it takes to submit a job to slurm with the sbatch command:

/usr/bin/time -v sbatch myscript.sbatch

I am afraid yourr slurm system is a bit sluggish.

@atermanini
Copy link
Author

atermanini commented Oct 29, 2021

Hi @maarten-k

thanks very much for your help.

The results are:

real     0m0.060s
user    0m0.004s
sys     0m0.000s

On our system the submission with sbatch is fast.
Also, by running snakemake without the cookiecutter profile, the submission of jobs is fast
(--cluster sbatch --partition cpuq --time 00:10:00 --mem 1G --nodes 1 --ntasks 1 --cpus-per-task 1)

I am experimenting this delay of near one second only with the cookiecutter template.

For debugging, I got the jobscript from the temporary folder of snakemake (snakejob.align_seq.2.sh), and I also printed the "cmd" variable which is passed to the sp.check_output(cmd) inside the submit_job function:

['sbatch', '--parsable', '--nodes=1', '--ntasks=1', '--cpus-per-task=1', '--mem=1000', '--time=00:10:00', '--partition=cpuq', '--job-name=align_seq', '--output=slurm_logs/align_seq/slurm-%j.out', '--error=slurm_logs/align_seq/slurm-%j.err', '/home/alberto.termanini/tests/snakemake_tests/pipeline_05/.snakemake/tmp.gspor3wl/snakejob.align_seq.2.sh']

less snakejob.align_seq.2.sh

#!/bin/bash
# properties = {"type": "single", "rule": "align_seq", "local": false, "input": ["data/0001.fastq"], "output": ["sam/0001.sam"], "wildcards": {"sample": "0001"}, "params": {}, "log": [], "threads": 1, "resources": {"tmpdir": "/tmp", "mem_mb": 1000, "time": "00:10:00"}, "jobid": 2, "cluster": {}}
 cd /home/alberto.termanini/tests/snakemake_tests/pipeline_05 && \
/home/alberto.termanini/conda_envs/snakemake/bin/python3.9 \
-m snakemake sam/0001.sam --snakefile /home/alberto.termanini/tests/snakemake_tests/pipeline_05/Snakefile \
--force --cores all --keep-target-files --keep-remote --max-inventory-time 0 \
--wait-for-files /home/alberto.termanini/tests/snakemake_tests/pipeline_05/.snakemake/tmp.31efp_wf data/0001.fastq --latency-wait 90 \
 --attempt 1 --force-use-threads --scheduler ilp \
--wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ \
  -p --allowed-rules align_seq --nocolor --notemp --no-hooks --nolock --scheduler-solver-path /home/alberto.termanini/conda_envs/snakemake/bin \
--mode 2  --default-resources "tmpdir=system_tmpdir"  && exit 0 || exit 1

Here is my testing pipeline code

less Snakefile

from os.path import join

DATA_DIR = 'data'
SAMPLES = glob_wildcards(join(DATA_DIR, "{sample}.fastq")).sample

rule all:
    input:
        'results/summary.txt'
        
rule make_summary:
    input:
        expand("sam/{sample}.sam", sample = SAMPLES)
    output:
        'results/summary.txt'
    threads: 1    
    resources:
        mem_mb=500,
        time="00:02:00"
    shell:
        'ls {input} | wc -l > {output}'

rule align_seq:
    input:
        join(DATA_DIR,'{sample}.fastq')
    output:
        'sam/{sample}.sam'
    threads: 1
    resources:
        mem_mb=1000,
        time="00:10:00"
    shell:
        "sleep 10s; "
        "echo {input} > {output}"

The files are just empty files generated with:
touch data/{0001..1000}.fastq

I hope you can help me finding why this is happening.

Thanks again,
Best, Alberto

@maarten-k
Copy link

Looks like slurm is not the culprit and to be honest I do not know for now

Going to the slurm_utils.py, and inspecting the submit_job function with some timing checks, I can confirm that this is where the majority of the time is taken.

Did you profile the submit_job function in a bit more detail ?

@atermanini
Copy link
Author

Hi @maarten-k ,

by profiling the submit_job function (just printing the difference between datetime.now() before and after each block of code), I saw that the command:

res = sp.check_output(cmd)

is taking the majority of the time: from 0.6 to 1 or seconds (see screenshot in attachment).
This is happening both executing snakemake on an head node of the cluster or inside an sbatch script:

0:00:00.059181 0:00:00.074783 0:00:00.064700 0:00:00.059625 0:00:00.050544 0:00:00.053404 0:00:00.055185 0:00:00.055678 0:00:00.054610 0:00:00.050749 0:00:00.054997 0:00:00.081413 0:00:00.094667 0:00:00.055244 0:00:00.051311 0:00:00.110717 0:00:00.119001 0:00:00.053417

Usually, the first three-four jobs that are spawned has even higher times (2-3 seconds), then timing goes down and stay between 0.5 - 1 second. I don't know if this can be helpful: I am running snakemake inside a conda environment which is installed in my home folder.

Please let me know if I can make some more tests.

Thanks again,
Alberto
MicrosoftTeams-image
Senza titolo

@frankier
Copy link

frankier commented Nov 1, 2021

The slowness is because each submission starts a new Python interpreter to run slurm-submit.py. See #73

@atermanini
Copy link
Author

Hi @frankier,

I am not sure that the problem is the opening of a new Python interpreter. For sure this takes time, however the slurm-submit.py takes around 0.7 second to be executed, and 0.6 seconds (90%) is taken by the "res = sp.check_output(cmd)" command, therefore I think that it is something more specific to this function call.

What do you think?

Best,
Alberto

@frankier
Copy link

frankier commented Nov 2, 2021

Looks like I jumped the gun there since it seems like that's not where your slowness is. Did you try the alternative profile anyway? Whether it works faster or not might provide extra information.

@atermanini
Copy link
Author

Hi @frankier ,

I will try the alternative profile suggested and let you know.

Thanks again,
Alberto

@atermanini
Copy link
Author

HI @frankier ,

with the "Simple SLURM" and status-sacct.sh the submission of jobs was more fast, I was able to have a peak of around 470 running jobs. Each job executes a simple "sleep 30s" shell command. The jobs are 1000 in total. The execution seems to be stable.

Thanks,
Alberto

@percyfal
Copy link
Collaborator

Hi @atermanini , sorry for the delay. Thank you for your detailed report and the identification of the bottleneck you are experiencing when submitting your jobs. I will try to find some time to look into this. I personally haven't really had any problems with the submissions being a bit slow for the simple reason that I haven't dealt with the number of jobs that you are.

Cheers,

Per

@jdblischak
Copy link

For pipelines that spawn jobs that takes a small amount of time to be executed (eg 30 or less seconds), even spawning 1000 independents jobs I cannot reach more than 30-40 running jobs per time. Therefore, I am not able to use the full power of the HPC.

@atermanini This was the exact same problem I was having when I decided to create smk-simple-slurm. I knew from previous experience with Snakemake prior to the introduction of the --profile flag that it had the potential to submit jobs much faster.

The slowness is because each submission starts a new Python interpreter to run slurm-submit.py.

@frankier Thanks for sharing! My working assumption is still that all the extra Python invocations are causing the slow down, but I admit I didn't perform a detailed diagnosis of the exact bottlenecks. I just assumed that Snakemake (ie Python) shelling out to Bash just to re-invoke Python again over and over (for each job submission and each time each job status is checked) was contributing to the problem.

In case it could be useful for comparing submission speeds, I created the benchmark below to convince myself that Snakemake could submit jobs faster (ie that Slurm itself wasn't the culprit).

https://github.com/jdblischak/smk-simple-slurm/tree/7c54cba6ff59a186d8f33d4a14a535607b980868/examples/jobs-per-second

@mbhall88
Copy link
Member

mbhall88 commented May 10, 2022

Just chiming in - I am new to this profile (I am the developer of the LSF profile but have just moved to an institute that uses slurm).

If you have advanced argument conversion on I believe there is an additional call to sinfo that is being made also to check the limits for the partition/cluster

cmd = f"sinfo -e -o %all -p {partition} {cluster}".split()
try:
output = sp.Popen(" ".join(cmd), shell=True, stdout=sp.PIPE).communicate()
except Exception as e:
print(e)
raise

I have no profiled with and without this advanced argument parsing, but it must surely add to overhead?

I guess a first question would be: is this needed? I kind of feel that if a user is asking for time longer than the partition allows, that's their issue not this profile's?

@percyfal
Copy link
Collaborator

@mbhall88 yes the advanced argument hack with multiple calls to sinfo call is certainly not ideal. Storing the cluster configuration in a file and loading once would be one way to improve this step. Regarding whether it is needed or not, the initial motivation was to solve cases where time/memory was automatically increased (e.g. doubled in a lambda function) on failed jobs. Admittedly, this need has proven less evident with the new flags to set resources. I'll mark it for reworking/removal.

Still, the issue with slow submission for many short jobs remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants