Skip to content

VennVcf 5.2.7 Design Document

tamsen edited this page Aug 18, 2018 · 8 revisions

Overview

VennVcf is a tool to sort the different variants from two input (g)vcf files into five output vcf files. Given input files A.gvcf and B.gvcf, VennVcf will return AnotB.vcf, AandB.vcf,BnotA.vcf, BandA.vcf, and (if configured) a consensus.vcf.

The consensus vcf may be used to gain confident calls for a set of replicates, or for AmpliconDS (to get the consensus variants between the forward and the reverse-complement probe pool generated vcfs.)

If vcfA is a tumor sample and vcfB is a normal sample, the AnotB.vcf will be the somatic variants, and the consensus variants would be the cleaned germline variants.

If you use VennVcf to generate tumor normal (T/N) results, we suggest using Pisces in somatic mode to create vcfA and in germline mode to create vcfB.

VennVcf is depicted in a cartoon below.

Configuration

VennVcf supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

VennVcf shall accept command line arguments as a whitespace-separated list of name and value pairs. For example:

Format: dotnet VennVcf.dll [-options]

Example: dotnet VennVcf.dll -if [A.genome.vcf,B.genome.vcf] -o outdir -consensus myConsensus2.gvcf

Pisces Argument Table:

Safety Category

  • ✅ Required input parameter
  • ❇️ Optional input parameter
  • ⚠️ Experimental or dev use only.
Argument Type Default Description Safe? Category
-if string none File path for input vcfs, ie,[A.genome.vcf,B.genome.vcf] 🛄input
-o,-out string none output directory 🛄output
-consensus string none file name of output consensus vcf ❇️ 🛄output
-Mfirst bool false how we order variants in the output vcf ❇️ 🛄output

VennVcf accepts (as optional) all the same vcf-writing command line arguments accepted by Pisces. In this way, the new consensus variants written are formatted in the same manner as the input variants. (Both Scylla and Pisces use the same vcf writer code)

Input

VennVcf requires as input two (g)VCF files. The (g)VCF files are assumed to be sorted. The gVCF file should be formatted uncrushed for a somatic vcf, and crushed for a diploid vcf.

Output

VennVcf outputs a somatic (uncrushed) or diploid (crushed) VCF file, depending on the ploidy configuration, with the same convention and structure as described in the Pisces SDS.

By default, VennVcf shall produce output files in the same directory as input VCF files.

If an output folder is configured, VennVcf shall produce output files in that folder. If the output folder does not exist, VennVcf shall create it.

VennVcf shall output 4 "venn" VCF files, containing the following variants. The gVCF file shall be named the same as the input BAM file, but with a “phased.genome.vcf” file extension.

  • AnotB.vcf is all the variants from file A that were also found in file B. For example, if a C→ G SNP was found at position 10, on A.gvcf and NOT on B.vcf, the C->G variant call line from A.gvcf would be written to AnotB.vcf
  • AandB.vcf is all the variants from file A that were not found in file B. For example, if a C→ G SNP was found on both A.gvcf and B.vcf, the C->G variant call line from A.gvcf would be written to AandB.vcf
  • BnotA.vcf is all the variants from file B that were also found in file A.
  • BandA.vcf is all the variants from file B that were also found in file A.

Note that while AandB.vcf and BandA.vcf will contain the same variants (with respect to position and alleles called), the two output files will not necessarily have the same depths, frequencies, qualities, or filters at each other. All these values carry over from the original vcf (A for AandB.vcf , and B for BandA.vcf).

VennVcf shall output one consensus (g)VCF file. The consensus.gvcf is the variants that were found found in both files, and were then recalled using evidence from both vcfs. Note the consensus vcf variants will have depth and allele support contributions from both input files. The variant frequencies and qualities will not be the same as the original file. Generally, there is an improvement in accuracy.

Design

VennVcf is a very simple program. It opens two read streams and 5 writer streams. As it parses the original files, variant by variant, it sorts and matches the incoming variants into the correct bucket, according to the venn diagram, and writes it back out.

When a variant is found in both pools, the variant is a candidate consensus variant. The support for the reference and alternate alleles are pooled, and the variant is recalled. It the varaint is only found in one vcf (A or B), it is flagged with the "PB" pool-bias flag and will not count as a PASS in the consensus vcf. In other regards, the consensus variant must pass variant calling filters as it normally would in Pisces. The consensus building algorithms are additionally stringent, such that the filer flags (ie, lowQ, lowDB, SB, etc) present in the original vcf will bubble up to the consensus vcf.

The variant recalling algorithms detailed below could ideally be replaced with a call to the Pisces.Logic.VariantCalling code. However, this remains a TODO.

Candidate variants are combined from both pools, under the following variant calling rules:

A ) if combined VF<1% and less than 2.6% in each pool, call REF. Recalculate Q score with reference model. Replace alt with "."

B ) if combined VF<1% and more than 2.6% in one pool, call NO CALL + filter for Probe Pool Bias. Recalculate Q score with alt model. This will later fail our freq threshold cutoff.

C ) if combined VF>=1% and <2.6% combined, call NO CALL. Recalculate Q score with alt model. This will later fail our freq threshold cutoff.

D ) if combined VF>=2.6% call VARIANT . If <1% in one pool, filter for Probe Pool Bias. Recalculate Q score with alt model.

E ) If we end up with multiple REF calls for the same loci, combine those .VCF lines into one ref call. Ref Q score model.

F ) If we end up with multiple NOCALL calls for the same loci, different variants, leave those .VCF lines separate. Alt Q score model.

*The "1%" and "2.6%" mentioned above are configurable. The 1% in the description below, assumes that the original vcfs gave a variant a pass if had a frequency >=1%.,and that the VennVcf.exe frequency threshold was set to filter recalled consensus variants that had a frequency > 2.6%.

Limitations

This tool was developed for a specific product where the chr listed in the vcf follow the pattern "chr1, chr2... chrM" . The current implementation is unfortunately limited such that if the chr are "1, 2.." or "frog" or "mydecoy.L" etc, the ordering algorithm will complain. A cheap work around is to rename your chr to fit the pattern. We will fix this limitation in future releases.

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally