Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variant calling using sfm does not adhere to thread limit #56

Open
matthdsm opened this issue Oct 21, 2021 · 13 comments
Open

Variant calling using sfm does not adhere to thread limit #56

matthdsm opened this issue Oct 21, 2021 · 13 comments

Comments

@matthdsm
Copy link
Contributor

Hi,

I noticed that when using 'sfm' mode, elprep uses more resources than allowed (e.g. all available threads instead of the given thread count). When using filter, this does not occur.

Cheers
Matthias

@caherzee
Copy link
Contributor

Hi Matthias,

How are you observing that more threads are used by elprep?

(I double checked the source code, which seems fine, and I cannot reproduce the behaviour).

Thanks!
Charlotte

@matthdsm
Copy link
Contributor Author

I was monitoring CPU usage through htop and saw the entire node was being used (full cpu, all cores), despite only having asked 35 threads (instead of 80).

I'll rerun the analysis and see of this occurs again.
Matthias

@caherzee
Copy link
Contributor

If possible, can you run your command with /usr/bin/time -v (e.g. /usr/bin/time -v elprep sfm input.bam output.bam) and send me the output?

Thanks!
Charlotte

@matthdsm
Copy link
Contributor Author

Good thing I'm benchmarking right now!

Command being timed: "elprep sfm NA12878.bam NA12878.elprep.bam --nr-of-threads 35 --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --sorting-order coordinate --haplotypecaller NA12878.elprep.g.vcf.gz --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --known-sites dbsnp-151.elsites --reference-confidence GVCF --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --log-path /projects/demultiplex/development/elprep --timed"
User time (seconds): 74572.83
System time (seconds): 2917.64
Percent of CPU this job got: 2323%
Elapsed (wall clock) time (h:mm:ss or m:ss): 55:35.31
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
-> RAM USAGE
Maximum resident set size (kbytes): 40641948 = 40GB
<-
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 2
Minor (reclaiming a frame) page faults: 352416596
Voluntary context switches: 51149677
Involuntary context switches: 2166519
Swaps: 0
File system inputs: 0
File system outputs: 55015096
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

@caherzee
Copy link
Contributor

Hmm that seems ok. Any chance the extra threads you see are not from the elprep command?

@matthdsm
Copy link
Contributor Author

Elprep was the only thing running on the node at that moment.
I'm currently running the same sample on 10 threads. At this point the spike hasn't occured.
I'll get back to you in an hour with the results.

@matthdsm
Copy link
Contributor Author

The spike occurred again, staring with "Phase 2" of the process
Schermafbeelding 2021-10-21 om 14 56 35

@matthdsm
Copy link
Contributor Author

Elprep log

elprep version 5.1.1 compiled with go1.16.7 - see http://github.com/exascience/elprep for more information.
2021/10/21 12:07:49 Created log file at /projects/demultiplex/development/elprep/logs/elprep/elprep-2021-10-21-12-07-49-573422852-UTC.log
2021/10/21 12:07:49 Command line: [elprep sfm NA12878.bam NA12878.elprep.bam --nr-of-threads 10 --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --sorting-order coordinate --haplotypecaller NA12878.elprep.g.vcf.gz --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --known-sites dbsnp-151.elsites --reference-confidence GVCF --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --log-path /projects/demultiplex/development/elprep --timed]
2021/10/21 12:07:49 Executing command:
 elprep sfm NA12878.bam NA12878.elprep.bam --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --quantize-levels 0 --max-cycle 500 --known-sites dbsnp-151.elsites --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --haplotypecaller NA12878.elprep.g.vcf.gz --sorting-order coordinate --nr-of-threads 10 --timed --log-path /projects/demultiplex/development/elprep --intermediate-files-output-prefix NA12878 --intermediate-files-output-type bam
2021/10/21 12:07:49 Splitting...
2021/10/21 12:12:55 Filtering (phase 1)...
2021/10/21 12:55:34 Filtering (phase 2) and variant calling...
2021/10/21 13:14:29 Merging sam/bam...
2021/10/21 13:18:42 Merging metrics...
2021/10/21 13:18:42 Merge vcf...

time log

Command being timed: "elprep sfm NA12878.bam NA12878.elprep.bam --nr-of-threads 10 --mark-duplicates --mark-optical-duplicates NA12878_duplicate_metrics.txt --optical-duplicates-pixel-distance 2500 --sorting-order coordinate --haplotypecaller NA12878.elprep.g.vcf.gz --bqsr NA12878.recall --reference /references/Hsapiens/hg38/seq/hg38.elfasta --known-sites dbsnp-151.elsites --reference-confidence GVCF --target-regions /references/Hsapiens/hg38/coverage/capture_regions/CMGG_WES_analysis_ROI_v2.bed --log-path /projects/demultiplex/development/elprep --timed"
        User time (seconds): 67783.27
        System time (seconds): 2824.52
        Percent of CPU this job got: 1659%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:10:53
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 45212524
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 391376706
        Voluntary context switches: 41522383
        Involuntary context switches: 1930008
        Swaps: 0
        File system inputs: 0
        File system outputs: 55015496
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@caherzee
Copy link
Contributor

I think the problem might be there are too many system calls blocked in this specific case. It seems that the go runtime continues to launch threads in that case, though only 10 goroutines (threads) should actually ever be running in parallel. So if I understand it correctly, it should not be an issue so many threads are spawned, and it is why we don't see a bigger cpu usage than expected. I am unsure if I can prevent the go runtime to do this. I have to look into this a bit deeper.

@matthdsm
Copy link
Contributor Author

I'm currently re-running with tmp path set to a faster FS, that should help the blocked calls, no?

@matthdsm
Copy link
Contributor Author

No dice, same result, but to a lesser extent. A slow(ish) FS seems to be the culprit in this case

@caherzee
Copy link
Contributor

That makes sense. Are you restricting the nr of threads to run multiple jobs on the server?

@matthdsm
Copy link
Contributor Author

Currently, yes. I'm running the same sample with different parameters to find the sweet spot between performance and speed specific to our hardware. Atm I've noticed that throwing more resources at elprep does not scale linearly, with only about 2min speed gains with double the resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants