Skip to content
Steven N Hart edited this page Aug 23, 2016 · 41 revisions

#Welcome to the VariantDB_Challenge wiki!

Thank you for your interest in this project! This is how we think this project will work. Before you begin, please ensure you have read the Rules document.

Challenge #1. The multiple source integration problem

The crux of this problem is that multiple pieces of metadata are often required to interpret the genomic data. In database-speak, this is the equivalent of using JOINS across multiple data sets. Using JOINS is a critical element, since it it unfeasible to replicate all the data into the same collection/table due to the large sizes and amounts of data that is required. This challenge is to figure out the most efficient query structure and data storage architecture to address the computationally expensive JOIN problem in the context of biological discovery.

Variant Call Format VCF files are the most common format to store genomics data. Each row corresponds to a place in the genome where there is a difference with respect to our arbitrary reference sequence. It allows us to only show the differences from the reference genome, which is more compact than displaying all 3.2 billion base pairs of sequence we all contain. Data from the first 8 columns refer to sample-independent attributes, whereas columns 9 to N show information specific to each sample in the file. This structure is ideal so we don't have to repeat the first 8 columns of information for each sample in each column. The annotations in the INFO field can be any key-value pair, not just the ones listed in the example in the specification. What's important is that many VCF files may need to be combined for any one study, so we need a flexible scalable solution to address this.

Process

  1. Download the sample file. It consists of 348,110 different variants from 629 samples, and several annotations in the INFO field. For more information about this particular VCF, see our VCF-Miner manuscript. If you haven't seen a VCF file before, the specification can be found here. They are compressed, so you may need to decompress them before beginning (e.g. bgzip -dc or zcat).
  2. Download the cryptic relatedness file. The sample names are in the 2nd and 3rd columns.
  3. Download the population description file
  • Note: the populations we are looking for on the later query are in column 3, e.g. CEU, AFR, ASN, etc.
  • Relationships are in the 4th column.
  1. Transform the VCF and *tsv files into formats that can be loaded into your database.
  2. Import files into database
  3. Perform the following query:
  • find all the heterozygous variants (GT=0/1) in the VCF where
  • the total depth (the sum of AD field - or DP if exists) > 30 and
  • the alternate allele depth (the 2nd number in AD field) > 10 and
  • the genotype quality (GQ) is > than 30 and
  • the ExAC.Info.AF annotation value is < 0.1 or not labeled at all and
  • SAVANT_IMPACT=HIGH or SAVANT_IMPACT=MODERATE but
  • exclude the samples from the cryptic-relatedness file - except for the Siblings from the ASW population
  • For each population, provide the number of variants remaining
  1. Once you decide on what your best query/db structure, submit a pull request and we'll post your results.

Additional Notes

  • Don't commit the data files, we don't want to blow up GitHub. 😈
  • Don't just think about this single challenge question when you design your solution. Part of the difficulty in building this type of system is that you don't know what all the use cases/queries are a priori. As mentioned previously, other VCF files may have different data elements in their INFO fields but still need to be merged. New data will be generated on a weekly basis.

Challenge #2. The continuous integration problem.

Now that you've loaded your database, an investigator wants you to add another dataset to your existing database. It sounds easy, until you learn that when you extract the number of heterozygous variants as in the previous challenge, you also need to provide a denominator for the number of variants that had data. Why is this difficult? Well, the VCF only contains information across a set of variants if at least one of those samples contains a variant. However, in the new dataset, there are likely to be several variants that were not seen at all in your initial set. So, by only loading new VCFs, you will never be able to discern if those previous samples were the same as reference, or if they were missing. This is where the gVCF comes in. It proposes to solve this N+1 problem - that is, if you keep everything in a separate file. Therefore, this challenge is centered around leveraging the gVCF information and to structure it in such a way as to keep track of denominators as new data are added to the system.

The challenge would be to load gVCF files one at a time. Then, extract all the variants in VCF Format that are homozygous alternate (GT=1/1) in NA12878, but homozygous reference (not missing or low quality) in the other two samples.

Why does this seem so hard? Because you aren't necessarily looking for a GT that is 0/0 for that position in another sample. In fact, if the variant isn't found in NA12891 or NA12892, you probably won't find it unless you parse the gVCF correctly. What's worse is scale. gVCF files are MUCH larger than regular VCF files.

Process

  1. Download the sample gvcf files NA12878, NA12891, and NA12892
  2. Parse the gVCF into a format suitable for your database design
  3. Load samples in this order: NA12878, NA12891, NA12892
  4. Return the variants in VCF format that are homozygous alternate in NA12878 but are homozygous reference in NA12891 & NA12892

Additional Notes

  • The regions are not fixed - they can be different for every sample.
  • Genotype Qualities less than 20 are generally deemed as uninformative, so those types of variants can be omitted.
  • It's probably best to leverage the range (END=) attribute and replicated the FORMAT data for that range in the output

Challenge #3. The cost projection problem.

Now that we evaluated the performance on a relatively small data set, the next logical question is to project the cost of such database design. These few samples are only a drop in the bucket in the grand scheme of genomics. Therefore, this next challenge is of utmost practical importance.

Given your schema design, how much total space and memory does it currently use? In other words, how much would it cost to establish this infrastructure in order to reach 1 second query response times.

We want to project how much storage and memory that would be required if we increase the sample size. A proper benchmark should be based on the expected size of the 100,000 Genomes Project. From this, we can extrapolate expected costs as sample sizes increase. Care to take a guess at the cluster specs needed to support just that 1 project? Assume 100,000 samples with 3 million variants each.

Please keep scale in mind when you design your solutions!

Current Progress

Technology Submitter Schema Import Query Cost Projection
SQLite ?
ArrangoDB StevenNHart
MongoDB StevenNHart
MySQL raymond301

Put all these metrics in your README.md