VCF Requirements
We expect the incoming VCF files to have the following properties:
- Build version must be specified in VCF metadata (using the
assembly
key in thecontig
header) - Build version must be
b38
(i.e.GRCh38
) - The
CHROM
field must be in the formatchr##
- All positions must be provided. If a required position is missing, it is assumed that the position could not be called.
- Should only have data for a single sample. If it's a multi-sample VCF file, only the first sample is used.
- The
QUAL
andFILTER
columns are not read or interpreted. Any data lines in the file can be used. It is left to the user to remove data not meeting quality criteria before using it in PharmCAT.
Variant representation is an on-going problem in NGS (for instance see: https://macarthurlab.org/2014/04/28/converting-genetic-variants-to-their-minimal-representation/). The following VCF lines all describe the same variant:
chr7 117548628 . GTTTTTTTA GTTTTTTTA,GTTTTTA . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . GTT G . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . G . . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . G(T)7A G(T)5A . PASS CFTR:5T GT 0/0
Different NGS pipelines, the ways files are created (for instance if a multi-sample file is split) and post-processing software tools all lead to these differences. As PharmCAT is directly matching these strings to what is in the definition files this can cause problems.
PharmCAT expects to find deletions as the REF="ATCT" ALT="."
, rather than REF="A" ALT="."
. Therefore you will need to replace all the deletions within the file. If the REF is a single letter it means no variant was found, so it's safe to replace it with the appropriate nucleotide string.
PharmCAT expects to find insertions as REF="A" ALT="ATCT"
. Simple insertions should not need updating.
We are still exploring the diversity of possible values in repeat regions.