Variant calling using `sfm` does not adhere to thread limit #56

matthdsm · 2021-10-21T06:10:30Z

Hi,

I noticed that when using 'sfm' mode, elprep uses more resources than allowed (e.g. all available threads instead of the given thread count). When using filter, this does not occur.

Cheers
Matthias

The text was updated successfully, but these errors were encountered:

caherzee · 2021-10-21T12:04:28Z

Hi Matthias,

How are you observing that more threads are used by elprep?

(I double checked the source code, which seems fine, and I cannot reproduce the behaviour).

Thanks!
Charlotte

matthdsm · 2021-10-21T12:07:15Z

I was monitoring CPU usage through htop and saw the entire node was being used (full cpu, all cores), despite only having asked 35 threads (instead of 80).

I'll rerun the analysis and see of this occurs again.
Matthias

caherzee · 2021-10-21T12:15:44Z

If possible, can you run your command with /usr/bin/time -v (e.g. /usr/bin/time -v elprep sfm input.bam output.bam) and send me the output?

Thanks!
Charlotte

matthdsm · 2021-10-21T12:16:39Z

Good thing I'm benchmarking right now!

Command being timed: "elprep sfm NA12878.bam NA12878.elprep.bam --nr-of-threads 35 --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --sorting-order coordinate --haplotypecaller NA12878.elprep.g.vcf.gz --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --known-sites dbsnp-151.elsites --reference-confidence GVCF --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --log-path /projects/demultiplex/development/elprep --timed"
User time (seconds): 74572.83
System time (seconds): 2917.64
Percent of CPU this job got: 2323%
Elapsed (wall clock) time (h:mm:ss or m:ss): 55:35.31
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
-> RAM USAGE
Maximum resident set size (kbytes): 40641948 = 40GB
<-
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 2
Minor (reclaiming a frame) page faults: 352416596
Voluntary context switches: 51149677
Involuntary context switches: 2166519
Swaps: 0
File system inputs: 0
File system outputs: 55015096
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

caherzee · 2021-10-21T12:28:10Z

Hmm that seems ok. Any chance the extra threads you see are not from the elprep command?

matthdsm · 2021-10-21T12:30:42Z

Elprep was the only thing running on the node at that moment.
I'm currently running the same sample on 10 threads. At this point the spike hasn't occured.
I'll get back to you in an hour with the results.

matthdsm · 2021-10-21T12:57:49Z

The spike occurred again, staring with "Phase 2" of the process

matthdsm · 2021-10-21T13:21:15Z

Elprep log

elprep version 5.1.1 compiled with go1.16.7 - see http://github.com/exascience/elprep for more information.
2021/10/21 12:07:49 Created log file at /projects/demultiplex/development/elprep/logs/elprep/elprep-2021-10-21-12-07-49-573422852-UTC.log
2021/10/21 12:07:49 Command line: [elprep sfm NA12878.bam NA12878.elprep.bam --nr-of-threads 10 --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --sorting-order coordinate --haplotypecaller NA12878.elprep.g.vcf.gz --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --known-sites dbsnp-151.elsites --reference-confidence GVCF --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --log-path /projects/demultiplex/development/elprep --timed]
2021/10/21 12:07:49 Executing command:
 elprep sfm NA12878.bam NA12878.elprep.bam --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --quantize-levels 0 --max-cycle 500 --known-sites dbsnp-151.elsites --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --haplotypecaller NA12878.elprep.g.vcf.gz --sorting-order coordinate --nr-of-threads 10 --timed --log-path /projects/demultiplex/development/elprep --intermediate-files-output-prefix NA12878 --intermediate-files-output-type bam
2021/10/21 12:07:49 Splitting...
2021/10/21 12:12:55 Filtering (phase 1)...
2021/10/21 12:55:34 Filtering (phase 2) and variant calling...
2021/10/21 13:14:29 Merging sam/bam...
2021/10/21 13:18:42 Merging metrics...
2021/10/21 13:18:42 Merge vcf...

time log

Command being timed: "elprep sfm NA12878.bam NA12878.elprep.bam --nr-of-threads 10 --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --sorting-order coordinate --haplotypecaller NA12878.elprep.g.vcf.gz --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --known-sites dbsnp-151.elsites --reference-confidence GVCF --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --log-path /projects/demultiplex/development/elprep --timed"
        User time (seconds): 67783.27
        System time (seconds): 2824.52
        Percent of CPU this job got: 1659%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:10:53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 45212524
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 391376706
        Voluntary context switches: 41522383
        Involuntary context switches: 1930008
        Swaps: 0
        File system inputs: 0
        File system outputs: 55015496
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

caherzee · 2021-10-21T13:25:15Z

I think the problem might be there are too many system calls blocked in this specific case. It seems that the go runtime continues to launch threads in that case, though only 10 goroutines (threads) should actually ever be running in parallel. So if I understand it correctly, it should not be an issue so many threads are spawned, and it is why we don't see a bigger cpu usage than expected. I am unsure if I can prevent the go runtime to do this. I have to look into this a bit deeper.

matthdsm · 2021-10-21T13:37:57Z

I'm currently re-running with tmp path set to a faster FS, that should help the blocked calls, no?

matthdsm · 2021-10-21T14:30:32Z

No dice, same result, but to a lesser extent. A slow(ish) FS seems to be the culprit in this case

caherzee · 2021-10-21T14:54:25Z

That makes sense. Are you restricting the nr of threads to run multiple jobs on the server?

matthdsm · 2021-10-21T15:15:25Z

Currently, yes. I'm running the same sample with different parameters to find the sweet spot between performance and speed specific to our hardware. Atm I've noticed that throwing more resources at elprep does not scale linearly, with only about 2min speed gains with double the resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant calling using `sfm` does not adhere to thread limit #56

Variant calling using `sfm` does not adhere to thread limit #56

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

matthdsm commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

Variant calling using sfm does not adhere to thread limit #56

Variant calling using sfm does not adhere to thread limit #56

Comments

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

matthdsm commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

matthdsm commented Oct 21, 2021

caherzee commented Oct 21, 2021

matthdsm commented Oct 21, 2021

Variant calling using `sfm` does not adhere to thread limit #56

Variant calling using `sfm` does not adhere to thread limit #56