Threads used affects results #450

keiranmraine · 2021-02-08T16:07:53Z

Hi,

This may be a know issue or specific to the version I'm running but I couldn't see issues relating to this. Docker image downloaded from dockerhub - version 2.9.4.

I've been running some tests to evaluate the most efficient way to deploy gridss on our system, different numbers of threads etc.

I'm finding that the calls are varying considerably depending on the number of threads used:

output-2 = 2 threads
output-4 = 4 threads
output-6 = 6 threads
output-8 = 8 threads

$ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-4/test.vcf.gz | cut -f 1-5 | sort) | wc -l
46039
$ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-6/test.vcf.gz | cut -f 1-5 | sort) | wc -l
720
$ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-8/test.vcf.gz | cut -f 1-5 | sort) | wc -l
0

Along with the difference in calls I noted the following:

The VCFs aren't equally sorted. Extending sort to REF/ALT ensures stable outputs when multiple events occur at the same position.
Although 2 threads vs 8 looks to match there are differences in the genotype column.

Executed via:

export CPUS=8 # this value changed for each thread count
rm -rf output-${CPUS} output-${CPUS}-tmp
mkdir -p output-${CPUS} output-${CPUS}-tmp
gridss.sh\
 --reference $PWD/ref/genome.fa\
 --blacklist $PWD/ref/blacklist.bed\
 --threads $CPUS\
 --labels test\
 --assembly $PWD/output-${CPUS}/test.assembly.bam\
 --output $PWD/output-${CPUS}/test.vcf.gz\
 --workingdir output-${CPUS}-tmp\
 $PWD/inputs/INPUT.bam

The text was updated successfully, but these errors were encountered:

d-cameron · 2021-02-09T02:43:21Z

That definitely should not be happening. Can you attach the full log files for each run?

…

On Tue, Feb 9, 2021 at 3:08 AM Keiran Raine ***@***.***> wrote: Hi, This may be a know issue or specific to the version I'm running but I couldn't see issues relating to this. Docker image downloaded from dockerhub - version 2.9.4. I've been running some tests to evaluate the most efficient way to deploy gridss on our system, different numbers of threads etc. I'm finding that the calls are varying considerably depending on the number of threads used: - output-2 = 2 threads - output-4 = 4 threads - output-6 = 6 threads - output-8 = 8 threads $ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-4/test.vcf.gz | cut -f 1-5 | sort) | wc -l 46039 $ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-6/test.vcf.gz | cut -f 1-5 | sort) | wc -l 720 $ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-8/test.vcf.gz | cut -f 1-5 | sort) | wc -l 0 Along with the difference in calls I noted the following: 1. The VCFs aren't equally sorted. Extending sort to REF/ALT ensures stable outputs when multiple events occur at the same position. 2. Although 2 threads vs 8 looks to match there are differences in the the genotype column. Executed via: export CPUS=8 # this value changed for each thread count rm -rf output-${CPUS} output-${CPUS}-tmp mkdir -p output-${CPUS} output-${CPUS}-tmp gridss.sh\ --reference $PWD/ref/genome.fa\ --blacklist $PWD/ref/blacklist.bed\ --threads $CPUS\ --labels test\ --assembly $PWD/output-${CPUS}/test.assembly.bam\ --output $PWD/output-${CPUS}/test.vcf.gz\ --workingdir output-${CPUS}-tmp\ $PWD/inputs/INPUT.bam — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#450>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOBYOGQWGHMPP2ERW74EYTS6AD7FANCNFSM4XJII62A> .

keiranmraine · 2021-02-09T06:10:23Z

Input data is mapped with bwa-mem (not using -a flag).

2-cpu-gridss.full.20210205_152922.node-11-6-4.58859.log
4-cpu-gridss.full.20210205_153350.node-10-10-4.33784.log
6-cpu-gridss.full.20210205_153816.node-10-6-2.60683.log
8-cpu-gridss.full.20210205_153839.node-11-7-2.1778.log

d-cameron · 2021-02-09T11:54:51Z

Doing regression testing to see if this is a symptom of #363, or a more widespread issue.

keiranmraine · 2021-02-09T11:59:00Z

I am able to confirm that the first point of divergence is the *.sv.bam. All of the ASCII files generated during the preprocess step are equal (other than paths/timestamps).

d-cameron · 2021-02-16T06:54:21Z

Are you using a version of bwa that is not stable w.r.t number of threads used?

See https://www.biostars.org/p/90390/

GRIDSS calls bwa during preprocessing to identify split reads (e.g. bwa does not report all split reads & bowtie2 does not do split read alignment at all). Differences in the alignment results return will have a downstream impact on the GRIDSS SV calls.

keiranmraine · 2021-02-16T09:11:00Z

We are using the image you have placed on dockerhub:

docker pull gridss/gridss:2.9.4

The input is the same file for each execution, the only variable being the number of threads passed to GRIDSS.

keiranmraine · 2021-02-16T09:12:40Z

I should mention we are moving to 2.10.2 (or greater), we are not fixed on the 2.9.x branch.

d-cameron · 2021-02-16T12:00:04Z

Just to clarify: are the .sv.bam difference difference in the content of the SAM records themselves, or just md5 differences due to the bwa program group header being different due to different paths/ # thread?

d-cameron · 2021-02-16T13:25:27Z

Ok, the root cause is that bwa 0.7.17-r1188 mapq scores are not stable w.r.t to number of threads used to run bwa

keiranmraine · 2021-02-16T13:25:53Z

It is the content of the SAM records themselves (2cpu vs 4cpu):

$ comm --nocheck-order -23 <(samtools view stepwise-2-tmp/test.bam.gridss.working/test.bam.sv.bam) <(samtools view stepwise-4-tmp/test.bam.gridss.working/test.bam.sv.bam) | wc -l 
5712180

d-cameron · 2021-02-16T13:31:39Z

Ok, root cause is that bwa is not stable w.r.t thread count due to dynamic batch size. -K need to be passed to bwa to force stability

d-cameron · 2021-02-17T04:39:45Z

--externalaligner is required if you need stability as I don't enforce deterministic call order when doing multithreaded processing against the bwa JNI API

keiranmraine · 2021-02-17T11:34:33Z

Just checking I understand. The dev branch has a correction, but to use this we need to set the --externalaligner option.

Is there a timeline for a release/hotfix with the vcf order and this to be pushed out or the possibility of a docker image being made available for testing? Thanks

d-cameron · 2021-02-22T15:59:37Z

Fix is on the dev branch, but you still need to use --externalaligner for the fix to actually work since the in-process bwa can occur out-of-order since they're not going via a fastq file.

ETA for next release is end of next week.

d-cameron · 2021-02-22T16:03:16Z

The preprocess step is slower when using --externalaligner so there will be a small performance hit if you require deterministic alignment behaviour.

keiranmraine · 2021-03-09T16:22:03Z

Can we get an updated ETA on a versioned release please?

d-cameron · 2021-03-10T01:40:10Z

I have a release candidate for v2.11.0 already prepared. Release notes have been written and I'm currently doing regression testing. Should be a few days if all goes as expected, within a week if I pick up any regression bug.

d-cameron pushed a commit that referenced this issue Feb 16, 2021

#450 stabilising sort order in VCF output

7b08669

d-cameron pushed a commit that referenced this issue Feb 16, 2021

#450 setting bwa mem batch size (-K) to force deterministic behaviour

1638912

d-cameron closed this as completed Feb 16, 2021

DarioS mentioned this issue Feb 17, 2021

BWA Results Vary Based on Thread Number Sydney-Informatics-Hub/Fastq-to-BAM#1

Open

d-cameron mentioned this issue Apr 2, 2023

Inconsistent calling results after rerun with the same set of bam files and parameters #621

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threads used affects results #450

Threads used affects results #450

keiranmraine commented Feb 8, 2021 •

edited

d-cameron commented Feb 9, 2021 via email

keiranmraine commented Feb 9, 2021

d-cameron commented Feb 9, 2021

keiranmraine commented Feb 9, 2021

d-cameron commented Feb 16, 2021 •

edited

keiranmraine commented Feb 16, 2021

keiranmraine commented Feb 16, 2021

d-cameron commented Feb 16, 2021

d-cameron commented Feb 16, 2021

keiranmraine commented Feb 16, 2021

d-cameron commented Feb 16, 2021

d-cameron commented Feb 17, 2021

keiranmraine commented Feb 17, 2021

d-cameron commented Feb 22, 2021

d-cameron commented Feb 22, 2021 •

edited

keiranmraine commented Mar 9, 2021

d-cameron commented Mar 10, 2021

Threads used affects results #450

Threads used affects results #450

Comments

keiranmraine commented Feb 8, 2021 • edited

d-cameron commented Feb 9, 2021 via email

keiranmraine commented Feb 9, 2021

d-cameron commented Feb 9, 2021

keiranmraine commented Feb 9, 2021

d-cameron commented Feb 16, 2021 • edited

keiranmraine commented Feb 16, 2021

keiranmraine commented Feb 16, 2021

d-cameron commented Feb 16, 2021

d-cameron commented Feb 16, 2021

keiranmraine commented Feb 16, 2021

d-cameron commented Feb 16, 2021

d-cameron commented Feb 17, 2021

keiranmraine commented Feb 17, 2021

d-cameron commented Feb 22, 2021

d-cameron commented Feb 22, 2021 • edited

keiranmraine commented Mar 9, 2021

d-cameron commented Mar 10, 2021

keiranmraine commented Feb 8, 2021 •

edited

d-cameron commented Feb 16, 2021 •

edited

d-cameron commented Feb 22, 2021 •

edited