VQR 5.2.5 Design Document

Overview

The Variant Quality Recalibration tool (VQR) is a command line tool used to post-process gVCF files. VQR recalibrates the variant quality scores (Q scores) given to variants within a sample, simply based on if the particular variants are over represented in the given sample. This tool was specifically developed to facilitate the filtering of FFPE artifacts on highly degraded samples, but is not limited to these types of signature events. VQR self-discovers which types of variants are over represented, and may be used to filter out a range of system artifacts or upstream sample issues. VQR requires a (g)VCF as input, and outputs an adjusted (g)VCF, where variant Q scores have been downgraded accordingly.

Pisces VQR works for vcf and genome.vcf input files. It does NOT currently work on crushed/diploid input, because this is not an identified use case for VQR.

Annecdotally, Pisces VQR seems to also work on Strelka vcfs.

Glossary

Pisces Glossary

Configuration

VQR supports configuration of parameters so that its behavior can be fine tuned depending on the application context.

Format: dotnet VariantQualityRecalibration.dll [-options]

Example: dotnet VariantQualityRecalibration.dll –vcf C:\test.vcf –o C:\OutFolder

SDS ID	Specification
SDS-1	VQR shall accept command line arguments as a whitespace-separated list of name and value pairs.
SDS-2	If an invalid command is given, VQR shall exit with an error message describing the failed argument, the reason for failure, and the list of valid commands.
SDS-3	VQR command line shall be capitalization invariant.

SDS ID	Specification
SDS-4	VQR shall require the command line arguments listed below:

Argument Name	Type	Default value	Description
vcf	string	none	File path for input vcf

SDS ID	Specification
SDS-5	VQR shall optionally support the command line arguments listed below:

Argument Name	Type	Default value	Description
-locicount	integer	none (-1)	If a vcf is given instead of a gvcf, VQR needs the approximate number of loci to asses the error rates.(When given a gvcf, VQR can figure this out by itself, by counting the lines in the gvcf.)
o	string	none. By default the output destination will be the original bam folder	destination for output bam
log	integer	20	in case of a stitching conflict, bases with qscore less than this value will automatically be disregarded in favor of the mate's bases.
b	integer	1	reads with map quality less than this value shall be filtered
z	double	true	reads marked as duplicate reads shall be filtered
f	integer	false	reads marked as not proper pairs shall be filtered
q	integer	false	reads pairs with incompatible cigar strings shall be filtered

Input

VQR requires as input one gVCF file. The gVCF file should be formatted such that each variant allele has its own line in the gVCF. file. Pisces output has this format by default.

SDS ID	Specification
SDS-6	Scylla shall require one gVCF file as input.

Output

VQR outputs one gVCF file, with the same convention and structure as the input file.

SDS ID	Specification
SDS-7	VQR shall produce output files in the same directory as input gVCF file.
SDS-8	VQR shall output a gVCF as described in the https://git.illumina.com/Bioinformatics/Pisces5/wiki/Pisces-VCF-Specifications document.
SDS-9	VQR the output file name shall be the input file name with ".recal" appended to the file name.

Design

VQR reads in the gVCF file and generates a "counts" file, where it has calculated how many variants have been called in each mutation category. There are 12 point mutation categories, as shown below. The counter also tracks insertions, deletions, reference, and other categories of variant, but these are not used int he recalibration step.

Mutation Category	A	C	G	T
A	X	A>C	A>G	A>T
C	C>A	X	C>G	C>T
G	G>A	G>C	X	G>T
T	T>T	T>C	T>G	X

Once the counts are known, the recalibration step begins. The average mutation rate is calculated for each category, and the variance between each category is also calculated. Each category that exceeds the mean plus Z times the typical standard deviation is considered over represented. The value of Z is configurable. Young samples typically have a very white profile. However, older samples with FFPE artifacts, oxidative damage, or characteristic sequencing artifacts might have a characteristic colored profile, where certain mutations are highly over represented in the sample. These distributions generally look the same if we constrain the observations to be purely false positives (which are typically not known apriori) or all called variants.

For samples with a balanced profile, no recalibration is performed. For samples with a highly colored noise profile, the variant Q scores are recalibrated int he following manner: The 1% noise model used by Pisces, which assumes the same noise-rate for all categories of mutations, is replaced with a noise model derived from the sample-specific noise profile. Specifically, the 1% noise assumption is raised to the observed mutation rate for the over represented categories of mutations. In this way, for an over represented mutation to get a passing Q score, it has to distinguish itself from the baseline over-represented state of the sample. This allows for better resolution in variant/noise discrimination.

Results

This technique has shown the improvements in FP count for a range of FFPE samples, for 2 to 15 years old. For some samples, the FP rate goes from several hundred calls to less than 10. However, not all samples see improved FP, and this might be because other error modes are the source of the false positives.

Limitations

This technique only reduces the FPs that follow the particular pattern the algorithm is looking for, and is currently restricted to point mutations. This technique is adaptable and extensible for future work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VQR 5.2.5 Design Document

Overview

Glossary

Configuration

Input

Output

Design

Results

Limitations

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally