Skip to content
Browse files

v1.1 update. Adding instructions for how to go from VCF to a GEMINI d…

…atabase ready for Seave import.
  • Loading branch information...
VelNZ committed May 7, 2018
1 parent 3c93b24 commit 6da36e836131896c931f065c031cc6a25fed5a98
BIN +44.8 KB (120%) Administration Guide.pdf
Binary file not shown.
@@ -21,9 +21,6 @@

\usepackage{listings} % For code snippets

\usepackage{graphicx} % Required for inserting images
\graphicspath{ {images/} } % Paths to where images are stored

@@ -36,6 +33,9 @@
aboveskip=\baselineskip % Whitespace before the listing

\usepackage{graphicx} % Required for inserting images
\graphicspath{{images/}} % Paths to where images are stored

\setlength{\parindent}{0pt} % Stop paragraph indentation
\setlength{\parskip}{6pt} % Spacing between paragraphs

@@ -44,12 +44,14 @@
\newcommand{\GEMINI}{\href{}{GEMINI} } % Custom command for referring to GEMINI
\newcommand{\titlepageguidename}{Administration Guide} % Custom command for the guide name on the title page

\newcommand{\version}{v1.0} % Document version
\newcommand{\version}{v1.1} % Document version



\renewcommand{\thepage}{\roman{page}} % Roman numerals for page counter

@@ -61,6 +63,7 @@


@@ -85,6 +88,12 @@

\tableofcontents % Print the ToC


\renewcommand{\thepage}{\arabic{page}} % Arabic numerals for page counter
\setcounter{page}{1} % Reset page numbering

@@ -400,6 +409,124 @@ \section{Advanced}


\subsection{Going From VCF to GEMINI Database for Import Into Seave}

Seave manages \GEMINI databases to store and query SNV and Indel variants. For a number of reasons, we have opted to decouple the variant annotation and \GEMINI database creation from Seave:

\item We wanted to handle WGS-sized data, which means >3Gb VCF files that are cumbersome to upload via a webpage.
\item Variant annotation using VEP and \GEMINI is computationally expensive for WGS data, where we typically use 16- to 32-core servers. We felt that it makes sense to keep this heavy compute as part of the production genomics pipeline.
\item We have a production analysis pipeline, so adding an analysis module at the end to push data into Seave was straightforward.

To make it easier to adopt Seave, this section will go through the process of starting with a VCF file through to creating a \GEMINI database that is ready for import into Seave.

We use VEP and the b37d5 (1000 genomes + decoy) reference genome in our production pipeline so the process outlined here will follow this. Other options may work too, but this is what works well for us.

\subsubsection{VCF Creation}

\textbf{Germline DNA} --- VCF files are created by a Best Practices (GATK) production pipeline. Our primary germline variant caller is GATK HaplotypeCaller v3.3. We use gVCF mode, so each sample has a g.vcf.gz extension. These are joint-called into cohorts using GATK GenotypeGVCFs. For WGS data we then apply VQSR.

\textbf{Somatic DNA} --- Our primary somatic variant caller is Strelka2. By default, the VCF files are incompatible with \GEMINI as they lack the required GT field, the optional AD, DP, GQ fields, and always name the samples NORMAL and TUMOR. We have a script to post-process Strelka VCF files to make them compatible with \GEMINI and provide fields that are useful for Seave. See scripts/strelka\_add\_to\ in the repository for this guide. To combine multiple Strelka VCFs for multi-sample or cohort analysis of somatic variants, we use GATK CombineVariants

\subsubsection{VCF Decomposition and Normalisation}

Multi-allelic variants (e.g. REF: A, ALT: T,C) are output by most variant callers, but downstream tools often don't handle them well. The problem lies in having to deal with conflicting or extra data for each allele. The standard solution is to first decompose variants such that each of the multiple alleles becomes a separate variant sharing the same position and reference allele. The next step is to normalise these decomposed variants to adjust the position and reference allele if needed (usually in the case of SNP+Indel multi-allelic variants).

We use the tool \href{}{Vt} (v0.5) for variant decomposition and normalisation, following the process documented in the \GEMINI documentation \href{}{here}. \GEMINI expects variants to be decomposed and normalised and will issue a warning if it finds any that aren't. The reason for this is that the annotations \GEMINI adds to variants have been decomposed and normalised to enable maximum ability to annotate variants.

We use the following command for Vt:

zcat ~/in/vcfgz/$input_filename | sed 's/ID=AD,Number=./ID=AD,Number=R/' | vt decompose -s - | vt normalize -r genome.fa - | vt sort -o $output_file -

tabix -p vcf $output_file
\subsubsection{VCF Genomic Impact Annotation}
We use \href{}{Variant Effect Predictor} (VEP) for annotating variants. Seave works well with variants annotated by VEP v74, v79 and v87; presumably the intermediate and following versions work too.
There is a \href{}{newer version of VEP} which we have not thoroughly tested, so we recommend using the older \href{}{ensembl-tools-vep}. See scripts/Makefile.vep87 in the repository for this guide for directions on installing VEP 87 along with the required plugins to make the most of Seave.
We do not find the upstream\_gene\_variant and downstream\_gene\_variant annotations useful for coding or regulatory variant analyses, so we delete them. See scripts/filter\ in the repository for this guide for how this is done.
To run VEP on a VCF we use the following parameters:
# Annotate the VCF with VEP
/vep/ -i ./in/vcfgz/* --species homo_sapiens --vcf -o output.vcf --stats_file "$vcfgz_prefix".vep.html --offline --fork `nproc` --no_progress --canonical --polyphen b --sift b --symbol --numbers --terms so --biotype --total_length --plugin LoF,human_ancestor_fa:false --fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,Feature_type,cDNA_position,CDS_position,Existing_variation,DISTANCE,STRAND,CLIN_SIG,LoF_flags,LoF_filter,LoF,RadialSVM_score,RadialSVM_pred,LR_score,LR_pred,CADD_raw,CADD_phred,Reliability_index,HGVSc,HGVSp --fasta /vep/homo_sapiens/87_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz --hgvs --shift_hgvs 1 --dir /vep

# VEP79 introduced spaces into the INFO field, for the sift & polyphen2 annotations. fix them.
perl -ne 'if ($_ !~ /^#/) { $_ =~ s/ /_/g; print $_; } else { print $_; }' output.vcf > output.clean.vcf
mv output.clean.vcf output.vcf
# Replace upstream_gene_variant/downstream_gene_variant with intergenic_variant
python ./ --vcf output.vcf | vcf-sort | bgzip > output.sorted.vcf.gz
tabix -p vcf output.sorted.vcf.gz
\subsubsection{Creating a GEMINI Database}
\GEMINI installation is straightforward from the \href{}{documentation}. We have tested Seave extensively with v0.11.*, and more recently with 0.17.*, 0.18.* and 0.19.1.
Databases are created from annotated VCF files with:
gemini load -v "$vcfgz_path" --cores `nproc` --skip-gerp-bp -t VEP out.db
\subsubsection{Automatically Importing Databases or GBS Variants Into Seave}
Our production pipeline runs on \href{}{DNA Nexus} and we have created an app there that allows the import of \GEMINI databases and GBS (long) variants into any group on multiple Seave servers.
The logic of the app for importing \GEMINI databases can be distilled to:
MD5=$(md5sum "$filename" | cut -d" " -f1)
OUTCOME=$(curl '-s' '-S' '-L' "${SEAVE_URL}")
echo "Seave reported: $OUTCOME"
The logic of the app for importing GBS (long) variants can be distilled to:
MD5=$(md5sum "$filename" | cut -d" " -f1)
all_methods=(CNVnator LUMPY Sequenza ROHmer VarpipeSV Manta CNVkit)
if [[ ! " ${all_methods[@]} " =~ " ${method} " ]]; then
echo "A valid 'method' must be specified if data is being imported into the GBS."
exit 1
# VCF files contain the sample names in the header. Many CNV/SV callers produce TSV files, so you also need to specify the sample name.
sample_name_methods=(CNVnator Sequenza ROHmer CNVkit)
# If the method specified doesn't include a sample name in the file to import, make sure the user specified one
if [[ " ${sample_name_methods[@]} " =~ " ${method} " ]]; then
if [[ "$sample_name" == "" ]]; then
echo "You must specify a sample name for GBS import methods that don't include a sample name in the file to import"
exit 1
OUTCOME=$(curl '-s' '-S' '-L' "${SEAVE_URL}")
echo "Seave reported: $OUTCOME"
\subsection{Installing Seave from the Source Code on a Private Server or Local Computer}\label{localSeaveInstallation}
Seave is written in PHP and uses MySQL for its database and as such, it is meant to run in a LAMP (Linux, Apache, MySQL, PHP) stack. If you would like to run Seave on your own hardware rather than on AWS, you have two options.
@@ -0,0 +1,63 @@
SHELL=/bin/bash -e -x

# develop this Makefile in a cloud workstation, as sudo, since this Makefile will be executed as root:
# dx run kccg:/cloud_workstation -imax_session_length=120 --ssh --yes

.PHONY: all bioperl cpan vep vep_caches test clean

all: bioperl cpan vep vep_caches test

# install vep version 87
cd ~/ensembl-tools-release-87/scripts/variant_effect_predictor/ && echo "y" | perl -a a -d /usr/share/perl5 1>&2
mkdir /vep
cp ~/ensembl-tools-release-87/scripts/variant_effect_predictor/*pl /vep 1>&2
cp ~/ensembl-tools-release-87/scripts/variant_effect_predictor/*vcf /vep 1>&2
cd ~/ensembl-tools-release-87/scripts/variant_effect_predictor/ && echo "y" | perl -c /vep -d /usr/share/perl5 -a p -g CADD,LoF,ExAC,dbNSFP 1>&2

cd ~/ensembl-tools-release-87/scripts/variant_effect_predictor/ && echo "y" | perl -c /vep -d /usr/share/perl5 -a cf -s homo_sapiens -y GRCh37 1>&2
cd ~/ensembl-tools-release-87/scripts/variant_effect_predictor/ && echo "y" | perl -c /vep -d /usr/share/perl5 -a cf -s mus_musculus -y GRCm38 1>&2

# install bioperl to /usr/share/perl5.
# Adding DEBIAN_FRONTEND=noninteractive prior to apt-get avoids the warning found at under Tips
DEBIAN_FRONTEND=noninteractive apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install --yes bioperl

echo -e "\n\n" | cpan
# follows the auto installation prompt.
cpan CPAN
cpan Test::Base
cpan YAML
cpan File::Copy::Recursive
cpan SUPER
cpan Archive::Zip
cpan install Bio::Root::Version
cpan install Module::Build
cpan DBI
cpan JSON

/vep/ -i /vep/example_GRCh37.vcf --species homo_sapiens --offline --fork 2 \
-o example_GRCh37.vep.vcf --vcf \
--fasta /vep/homo_sapiens/87_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz \
--hgvs --shift_hgvs 1 --dir /vep
# this will download the 100Mb mouse vcf file for which the URL is valid for 2 years from February 2017.
# It will validate the test below. Without this, the mouse part of the test will fail as the file doesn't exist.
wget -P /vep/
/vep/ -i /vep/Mouse_germline_v_M3768_Tumour_A_TEST.strelka.vcf.gz --species mus_musculus --offline --fork 2 \
-o Mouse_germline_v_M3768_Tumour_A_TEST.strelka.vep.vcf --vcf --fasta /vep/mus_musculus/87_GRCm38/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz \
--hgvs --shift_hgvs 1 --dir /vep

# To activate either of the below command type: sudo make clean'

rm -rf /vep
rm -f
rm -rf ~/ensembl-tools-release-87
rm -rf ~/.vep

@@ -0,0 +1,93 @@
import argparse
import gzip
import re
import sys

# Filter records from a VEP annotated VCF file.
# Specifically, this filters [downstream_gene_variant', 'upstream_gene_variant' records, and replaces
# them with intergenic_variant if doing so removes all annotations.
# Acknowledgements to Konrad Karczewski (konradjk), accessed 2016-06-26
# Mark Cowley, KCCG.

__author__ = 'marcow'

def dict2orderedValues (d, keys):
res = []
for key in keys:
return res

def main(args):
f = if args.vcf.endswith('.gz') else open(args.vcf)
vep_field_names = None
header = None
for line in f:
line = line.strip()

if args.ignore:
print >> sys.stdout, line

# Reading header lines to get VEP and individual arrays
if line.startswith('#'):
print >> sys.stdout, line
line = line.lstrip('#')
if line.find('ID=CSQ') > -1:
print >> sys.stdout, '##command=' + " ".join(sys.argv)
vep_field_names = line.split('Format: ')[-1].strip('">').split('|')
blank_record = [''] * len(vep_field_names)
blank_record[vep_field_names == 'Consequence'] = 'intergenic_variant'
blank_record = dict(zip(vep_field_names, blank_record))
if line.startswith('CHROM'):
header = line.split()
header = dict(zip(header, range(len(header))))

if vep_field_names is None:
print >> sys.stderr, "VCF file does not have a VEP header line. Exiting."
if header is None:
print >> sys.stderr, "VCF file does not have a header line (CHROM POS etc.). Exiting."

# Pull out annotation info from INFO and ALT fields
# fields = vcf line, can be accessed using:
# fields[header['CHROM']] for chromosome,
# fields[header['ALT']] for alt allele,
# or samples using sample names, as fields[header['sample1_name']]
fields = line.split('\t')

info_field = dict([(x.split('=', 1)) if '=' in x else (x, x) for x in re.split(';(?=\w)', fields[header['INFO']])])

if 'CSQ' not in info_field:
print >> sys.stdout, line

# annotations = list of dicts, each corresponding to a transcript-allele pair
# (each dict in annotations contains keys from vep_field_names)
annotations = [dict(zip(vep_field_names, x.split('|'))) for x in info_field['CSQ'].split(',') if len(vep_field_names) == len(x.split('|'))]

selected = [x for x in annotations if x['Consequence'] not in
['downstream_gene_variant', 'upstream_gene_variant']]
if len(selected) == 0:
selected = [blank_record]
hits = ','.join(['|'.join(dict2orderedValues(x, vep_field_names)) for x in selected])

print >> sys.stdout, re.sub(';CSQ=[^\t]*', ';CSQ=' + hits, line)


if __name__ == '__main__':
parser = argparse.ArgumentParser()

parser.add_argument('--vcf', '--input', '-i', help='Input VCF file (from VEP+LoF); may be gzipped', required=True)
parser.add_argument('--ignore', help='Pass through all variants.', action='store_true', default=False)
args = parser.parse_args()
if args.ignore:
print >> sys.stderr, "Skipping up/downstream filtering"

Oops, something went wrong.

0 comments on commit 6da36e8

Please sign in to comment.
You can’t perform that action at this time.