Skip to content
Cristina Yenyxe Gonzalez Garcia edited this page Jan 11, 2017 · 10 revisions

The main goal of the EVA pipeline is to store variants read from VCF files, calculate population statistics and annotate them using Ensembl's Variant Effect Predictor.

In order to do so, the following jobs are available:

  • Initializing the database (work in progress)
  • Processing a genotyped VCF
  • Processing an aggregated VCF
  • Re-annotating variants (work in progress)
  • Re-calculating statistics (work in progress)

Please note this section is a work in progress and more details about the structure of each job will be added in the future.

Initializing the database (work in progress)

It is not yet necessary to run this job in order to have a ready-to-use database. In the future, some preparation will be required for improved efficiency when querying the database.

This preparation will at least involve loading a feature set for the genome assembly the variation data is based on, to allow to translate from gene/transcript name to genomic coordinates.

Processing a genotyped VCF

One of the sources of data for the EVA ETL pipeline is a Variant Call Format (VCF) file listing the sample genotypes for each variant, like the following:

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.17                GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

First, the alleles from columns REF and ALT are normalized to ensure that different representations across multiple VCFs will be recognized as the same. More information about the normalization process can be found in this presentation. Once normalized, variants are stored into the database.

Population statistics and annotation from Ensembl VEP can optionally be generated and stored along with the core variant information. These processes are shared across multiple jobs (VCF processing, re-annotation, etc) and will be documented separately.

Genotyped VCF job

(Click on the diagram for fullscreen view)

Processing an aggregated VCF

One of the sources of data for the EVA ETL pipeline is a Variant Call Format (VCF) file not listing sample genotypes. Given that statistics such as allele frequencies can't be calculated based on the genotypes, they are often provided in the INFO column, like the following:

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Count">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;AC=1;AN=6;DB;H2
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017;AC=1;AN=6
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AC=2,4;AN=6;AA=T;DB
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AF=0.0;AC=0;AN=6;AA=T
20     1234567 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AF=0.5,0.25;AC=3,1;AN=6;AA=G

First, the alleles from columns REF and ALT are normalized to ensure that different representations across multiple VCFs will be recognized as the same. More information about the normalization process can be found in this presentation. Population statistics can be optionally read from the INFO column. Once both these steps are finished, variants are stored into the database.

Annotation from Ensembl VEP can optionally be generated and stored along with the core variant information. This process is shared across multiple jobs (VCF processing, re-annotation, etc) and will be documented separately.

Aggregated VCF job

(Click on the diagram for fullscreen view)

Re-annotating variants (work in progress)

After storing a set of annotated variants, it may be necessary to annotate them again due to reasons such as the following:

  1. A new, improved version of Ensembl VEP has been released
  2. A mistake happened when selecting the VEP cache

This job will run VEP over all the variants already loaded into a database and store the new annotations. In case 1, they will be added as a new set, whereas in case 2 the existing ones will be overwritten.

Re-calculating statistics (work in progress)

After storing the population statistics associated to a set of variants, it may be necessary to re-calculate them due to reasons like the following:

  1. New population grouping added
  2. Set of samples describing an existing population changed

This job will calculate the population statistics for all the variants already loaded into a database and store the new ones. In case 1, they will be added as a new set, whereas in case 2 the existing ones will be overwritten.