Skip to content

VCF Requirements

lester-pharmgkb edited this page Oct 21, 2016 · 13 revisions

We expect the incoming VCF files to have the following properties:

  • Build version must be specified in VCF metadata (using the assembly key in the contig header)
  • Build version must be b38 (i.e. GRCh38)
  • The CHROM field must be in the format chr##
  • All positions must be provided. If a required position is missing, it is assumed that the position could not be called.
  • Should only have data for a single sample. If it's a multi-sample VCF file, only the first sample is used.
  • The QUAL and FILTER columns are not read or interpreted. Any data lines in the file can be used. It is left to the user to remove data not meeting quality criteria before using it in PharmCAT.

Standardizing variant representation

Variant representation is an on-going problem in NGS (for instance see: https://macarthurlab.org/2014/04/28/converting-genetic-variants-to-their-minimal-representation/). The following VCF lines all describe the same variant:

chr7    117548628    .    GTTTTTTTA    GTTTTTTTA,GTTTTTA    .    PASS    CFTR:Reference,CFTR:5T    GT    0/0
chr7    117548628    .    GTT     G    .    PASS    CFTR:Reference,CFTR:5T    GT    0/0
chr7    117548628    .    G    .    .    PASS    CFTR:Reference,CFTR:5T    GT    0/0
chr7    117548628    .    G(T)7A    G(T)5A    .    PASS    CFTR:5T    GT    0/0

Different NGS pipelines, the ways files are created (for instance if a multi-sample file is split) and post-processing software tools all lead to these differences. As PharmCAT is directly matching these strings to what is in the definition files this can cause problems.

Deletions

PharmCAT expects to find deletions as the REF="ATCT" ALT=".", rather than REF="A" ALT=".". Therefore you will need to replace all the deletions within the file. If the REF is a single letter it means no variant was found, so it's safe to replace it with the appropriate nucleotide string.

Insertions

PharmCAT expects to find insertions as REF="A" ALT="ATCT". Simple insertions should not need updating.

Repeats

We are still exploring the diversity of possible values in repeat regions.