Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Query supplementary

Karthik Gururaj edited this page Aug 4, 2018 · 4 revisions
It seems TileDB/GenomicsDB can only query for columns/contig positions and rows/samples, but what if someone would want to look for all variants in a certain contig satisfying a specific expression? For example, minimal allele frequency of X, BaseQRankSum > 5 etc?

There is no 'primitive' in TileDB/GenomicsDB that allows you to run that kind of query. The current way is to query for all variants within the region of interest and then select the ones that meet your criteria. You can write a simple tool that pipes the output of gt_mpi_gather and filters out variants that don't meet your condition.

Alternately, I suggest users use bcftools for a lot of these filters i.e. use GenomicsDB for the variant data, extract variants from GenomicsDB from the region of interest and filter using bcftools.

See the bcftools documentation here. They have lots of interesting filter options

For example, to select variants with allele count > 5

gt_mpi_gather -j query.json --produce-Broad-GVCF | bcftools view -c 5

Verify that you have set produce_GT_field to true in your query.json for the above query. You can speed up the execution by just querying just the "GT" field in GenomicsDB - see "query_attributes" in the wiki section on queries. This way other fields are not included in the query.

Or to select variants with BaseQRankSum > 5

gt_mpi_gather -j query.json --produce-Broad-GVCF | bcftools view -e "INFO/BaseQRankSum <= 5"

With this you get the scalability and performance of TileDB/GenomicsDB and all the nice filters provided by bcftools. However, it's not without drawbacks:

  • The VCF format is not the best format for performance
  • You still read a lot of data from disk and probably discard a large percentage of the data read. To fix this problem, we need persistent secondary indexes.

The other alternative is to use the GenomicsDB Spark API. The wiki could use more work. There are a couple of examples in Java and Scala.

We were hoping to support integration with Hail to get the scalability of Spark and the rich functionality of Hail analysis.

Clone this wiki locally