Skip to content

VQR 5.2.5 Design Document

tamsen edited this page Dec 13, 2017 · 1 revision

Overview

The Variant Quality Recalibration tool (VQR) is a command line tool used to post-process gVCF files. VQR recalibrates the variant quality scores (Q scores) given to variants within a sample, simply based on if the particular variants are over represented in the given sample. This tool was specifically developed to facilitate the filtering of FFPE artifacts on highly degraded samples, but is not limited to these types of signature events. VQR self-discovers which types of variants are over represented, and may be used to filter out a range of system artifacts or upstream sample issues. VQR requires a (g)VCF as input, and outputs an adjusted (g)VCF, where variant Q scores have been downgraded accordingly.

Pisces VQR works for vcf and genome.vcf input files. It does NOT currently work on crushed/diploid input, because this is not an identified use case for VQR.

Annecdotally, Pisces VQR seems to also work on Strelka vcfs.

Glossary

Pisces Glossary

Configuration

VQR supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

Format: dotnet VariantQualityRecalibration.dll [-options]

Example: dotnet VariantQualityRecalibration.dll –vcf C:\test.vcf –o C:\OutFolder

SDS ID Specification
SDS-1 VQR shall accept command line arguments as a whitespace-separated list of name and value pairs.
SDS-2 If an invalid command is given, VQR shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands.
SDS-3 VQR command line shall be capitalization invariant.
SDS ID Specification
SDS-4 VQR shall require the command line arguments listed below:
Argument Name Type Default value Description
vcf string none File path for input vcf
SDS ID Specification
SDS-5 VQR shall optionally support the command line arguments listed below:
Argument Name Type Default value Description
-locicount integer none (-1) If a vcf is given instead of a gvcf, VQR needs the approximate number of loci to asses the error rates.(When given a gvcf, VQR can figure this out by itself, by counting the lines in the gvcf.)
o string none. By default the output destination will be the original bam folder destination for output bam
log integer 20 in case of a stitching conflict, bases with qscore less than this value will automatically be disregarded in favor of the mate's bases.
b integer 1 reads with map quality less than this value shall be filtered
z double true reads marked as duplicate reads shall be filtered
f integer false reads marked as not proper pairs shall be filtered
q integer false reads pairs with incompatible cigar strings shall be filtered

Input

VQR requires as input one gVCF file. The gVCF file should be formatted such that each variant allele has its own line in the gVCF. file. Pisces output has this format by default.

SDS ID Specification
SDS-6 Scylla shall require one gVCF file as input.

Output

VQR outputs one gVCF file, with the same convention and structure as the input file.

SDS ID Specification
SDS-7 VQR shall produce output files in the same directory as input gVCF file.
SDS-8 VQR shall output a gVCF as described in the https://git.illumina.com/Bioinformatics/Pisces5/wiki/Pisces-VCF-Specifications document.
SDS-9 VQR the output file name shall be the input file name with ".recal" appended to the file name.

Design

VQR reads in the gVCF file and generates a "counts" file, where it has calculated how many variants have been called in each mutation category. There are 12 point mutation categories, as shown below. The counter also tracks insertions, deletions, reference, and other categories of variant, but these are not used int he recalibration step.

Mutation Category A C G T
A X A>C A>G A>T
C C>A X C>G C>T
G G>A G>C X G>T
T T>T T>C T>G X

Once the counts are known, the recalibration step begins. The average mutation rate is calculated for each category, and the variance between each category is also calculated. Each category that exceeds the mean plus Z times the typical standard deviation is considered over represented. The value of Z is configurable. Young samples typically have a very white profile. However, older samples with FFPE artifacts, oxidative damage, or characteristic sequencing artifacts might have a characteristic colored profile, where certain mutations are highly over represented in the sample. These distributions generally look the same if we constrain the observations to be purely false positives (which are typically not known apriori) or all called variants.

For samples with a balanced profile, no recalibration is performed. For samples with a highly colored noise profile, the variant Q scores are recalibrated int he following manner: The 1% noise model used by Pisces, which assumes the same noise-rate for all categories of mutations, is replaced with a noise model derived from the sample-specific noise profile. Specifically, the 1% noise assumption is raised to the observed mutation rate for the over represented categories of mutations. In this way, for an over represented mutation to get a passing Q score, it has to distinguish itself from the baseline over-represented state of the sample. This allows for better resolution in variant/noise discrimination.

Results

This technique has shown the improvements in FP count for a range of FFPE samples, for 2 to 15 years old. For some samples, the FP rate goes from several hundred calls to less than 10. However, not all samples see improved FP, and this might be because other error modes are the source of the false positives.

Limitations

This technique only reduces the FPs that follow the particular pattern the algorithm is looking for, and is currently restricted to point mutations. This technique is adaptable and extensible for future work.

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally