Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high false positive rate #68

Open
dancooke opened this issue Jul 2, 2018 · 7 comments
Open

Very high false positive rate #68

dancooke opened this issue Jul 2, 2018 · 7 comments

Comments

@dancooke
Copy link

dancooke commented Jul 2, 2018

Hi,

I'm currently evaluating VarDict on a synthetic tumour dataset similar to the ICGC DREAM set. I'm using VarDict version 1.5.2 with the following commands:

Call variants:

$ VarDict \
    -G hs37d5.fa \
    -f 0.01 \
    -N NA12878.TUMOUR \
    -b "NA12878.TUMOUR.60x.skin.bwa-mem.b37.bam|NA12878.NORMAL.30x.bwa.b37.bam" \
    -z 0 -c 1 -S 2 -E 3 \
    hs37d5.chromosomes.bed \
    | testsomatic.R \
    | var2vcf_paired.pl \
      -N "NA12878.TUMOUR|NA12878.NORMAL" \
      -f 0.01 \
      | bgzip \
      > vardict.NA12878.syntumour.skin.b37.bwa.vcf.gz
$ tabix vardict.NA12878.syntumour.skin.b37.bwa.vcf.gz

Filter somatics:

$ bcftools view \
    -i 'INFO/STATUS="StrongSomatic"' \
    -f PASS \
    -Oz -o vardict.NA12878.syntumour.skin.b37.bwa.somatic.PASS.vcf.gz \
    vardict.NA12878.syntumour.skin.b37.bwa.vcf.gz
$ tabix vardict.NA12878.syntumour.skin.b37.bwa.somatic.PASS.vcf.gz

Although VarDict achieves high sensitivity, the false positive rate is very high (FDR 0.25 compared with < 0.02 for all other callers tested). Am I doing something wrong?

Thanks,
Dan

@chapmanb
Copy link
Contributor

chapmanb commented Jul 11, 2018

Dan;
Thanks much for starting this discussion. More validation sets are always welcome and it's great you're working on this. Are you able to share the truth sets and outcomes you're seeing? We're actively working on benchmarking VarDict on harder inputs (tumor-only, FFPE, low frequency) with truth sets:

https://github.com/bcbio/bcbio_validation_workflows#somatic-low-frequency-variants

and here are the current in progress validations:

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq

It would be useful to see how your results compare.

Practically we do apply additional filters after calling which help improve specificity. The gory details are here:

https://github.com/bcbio/bcbio-nextgen/blob/ed98597efe7d18ff684a8ec64bd45bd39b647bfd/bcbio/variation/vardict.py#L270

and we wrote up some details about the filters here:

http://bcb.io/2016/04/04/vardict-filtering/

Looking forward to coordinating more to help improve your specificity.

@dancooke
Copy link
Author

Thanks Brad, I'll have a go with these additional filters. I'm not able to release our validation set yet. Hopefully this will be forthcoming soon, along with the Octopus paper.

@chapmanb
Copy link
Contributor

Dan -- sounds good, please let me know if you run into any problems. I know it is not super generalized to run outside of bcbio.

Is it worth running octopus on the tumor-only low frequency samples I mentioned above to provide a baseline comparison? The docs mention tumor/normal as preferred but not sure if you have a reasonable expectation of decent calls for tumor-only.

Looking forward to the octopus paper and working more on the validation set you're using when it's available.

@dancooke
Copy link
Author

@chapmanb I'm considering looking at tumour-only however tumour-normal is definitely the priority. The main issue I have with tumour-only is to what extent somatic status is relevant. From the little testing I've done with tumour-only, octopus will happily call somatic variants, but often misclassifies them as germline (especially if the VAF isn't far from 50%). I wouldn't be surprised if this is the case for any tumour-only caller. Now the question is how important this actually is; from a clinical perspective it may not be relevant at all. How to deal with this when composing a validation framework is not obvious to me.

@chapmanb
Copy link
Contributor

Dan -- thanks for this. Misclassifying germline as somatic is definitely a problem across callers, and is a limitation of not having the normal for effectively doing this. I'll definitely work on including octopus in the comparisons as an additional data point you can use for octopus versus VarDict in terms of specificity (at least for these harder cases). Thanks again.

@chapmanb
Copy link
Contributor

Dan;
Thanks again for the suggestions about Octopus tumor-only. To provide an additional data point for this discussion with the 0.5% validations we've been doing, I included octopus in bcbio and added in to the comparisons. Unfortunately I'm not seeing much detection at this low frequency:

https://github.com/bcbio/bcbio_validations/tree/master/somatic-lowfreq#low-frequency-umi-tagged-tumor-only-samples

so I probably need to tweak additional parameters for this type of high depth low frequency detection. Here's the implementation right now, which is pretty vanilla other than changing --min-credible-somatic-frequency:

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/variation/octopus.py#L68

I'd be happy to tweak and improve to get octopus running on par with the other callers and provide a fair comparison. Thanks again for all this helpful discussion.

@dancooke
Copy link
Author

Thanks for the feed back Brad. I've opened an issue on Octopus' Github page regarding this, perhaps we should continue this discussion there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants