Skip to content

Commit

Permalink
Fix typos, formatting and links in import-etl docs and update README.…
Browse files Browse the repository at this point in the history
…md (#291)

* Fix typos, formatting and links in import-etl docs. Also update README.md
* Move Styling from README to Style.md
  • Loading branch information
nalinigans committed May 19, 2023
1 parent 7e2ac04 commit a60583b
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 32 deletions.
23 changes: 3 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![readthedocs](https://readthedocs.org/projects/genomicsdb/badge/?version=latest)](https://genomicsdb.readthedocs.io/en/latest/)
[![Maven Central](https://img.shields.io/maven-central/v/org.genomicsdb/genomicsdb.svg)](https://mvnrepository.com/artifact/org.genomicsdb)

| Master | Develop |
| --- | --- |
| [![actions](https://github.com/GenomicsDB/GenomicsDB/workflows/build/badge.svg?branch=master)](https://github.com/GenomicsDB/GenomicsDB/actions?query=branch%3Amaster) | [![actions](https://github.com/GenomicsDB/GenomicsDB/workflows/build/badge.svg?branch=develop)](https://github.com/GenomicsDB/GenomicsDB/actions?query=branch%3Adevelop) |
| [![codecov](https://codecov.io/gh/GenomicsDB/GenomicsDB/branch/master/graph/badge.svg)](https://codecov.io/gh/GenomicsDB/GenomicsDB) | [![codecov](https://codecov.io/gh/GenomicsDB/GenomicsDB/branch/develop/graph/badge.svg)](https://codecov.io/gh/GenomicsDB/GenomicsDB/branch/develop) |

GenomicsDB, originally from [Intel Health and Lifesciences](https://github.com/Intel-HLS/GenomicsDB), is built on top of a fork of [htslib](https://github.com/samtools/htslib) and a tile-based array storage system for importing, querying and transforming variant data. Variant data is sparse by nature (sparse relative to the whole genome) and using sparse array data stores is a perfect fit for storing such data. GenomicsDB is a highly performant scalable data storage written in C++ for importing, querying and transforming genomic variant data.
GenomicsDB is built on top of a fork of [htslib](https://github.com/samtools/htslib) and a tile-based array storage system for importing, querying and transforming variant data. Variant data is sparse by nature (sparse relative to the whole genome) and using sparse array data stores is a perfect fit for storing such data. GenomicsDB is a highly performant scalable data storage written in C++ for importing, querying and transforming genomic variant data. See [genomicsdb.readthedocs.io](https://genomicsdb.readthedocs.io/en/latest/) for documentation and usage.
* Supported platforms : Linux and MacOS.
* Supported filesystems : POSIX, HDFS, EMRFS(S3), GCS and Azure Blob.

Expand All @@ -17,29 +18,11 @@ Included are

GenomicsDB is packaged into [gatk4](https://software.broadinstitute.org/gatk/documentation/article?id=11091) and benefits qualitatively from a large user base.

The GenomicsDB documentation for users is hosted as a Github wiki:
https://github.com/GenomicsDB/GenomicsDB/wiki

## External Contributions
GenomicsDB is open source and all participation is welcome.
GenomicsDB is released under the MIT License and all external
contributors are expected to grant an MIT License for their contributions.

### Checklist before creating Pull Request
Please ensure that the code is well documented in Javadoc style for Java/Scala. For C/C++ code, roughly adhere to [Google C++ Style](https://google.github.io/styleguide/cppguide.html) for consistency/readabilty.
Please ensure that the code is well documented in Javadoc style for Java/Scala. For Java/C/C++ code formatting, roughly adhere to the Google Style Guides. See [GenomicsDB Style Guide](Style.md)

```
Use spaces instead of tabs.
Use 2 spaces for indenting.
Add brackets even for one line blocks e.g.
if (x>0)
do_foo();
should ideally be
if (x>0) {
do_foo();
}
Pad header e.g.
if(x>0) should be if (x>0)
while(x>0) should be while (x>0)
One half indent for class modifiers.
```
21 changes: 21 additions & 0 deletions Style.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
## GenomicsDB Style Guide

For Java code, roughly adhere to [Google Java Style](https://google.github.io/styleguide/javaguide.html) and for C/C++ roughly adhere to [Google C++ Style](https://google.github.io/styleguide/cppguide.html) for consistency/readabilty

GenomicsDB Example Rules:

```
Use spaces instead of tabs.
Use 2 spaces for indenting.
Add brackets even for one line blocks e.g.
if (x>0)
do_foo();
should ideally be
if (x>0) {
do_foo();
}
Pad header e.g.
if(x>0) should be if (x>0)
while(x>0) should be while (x>0)
One half indent for class modifiers.
```
2 changes: 1 addition & 1 deletion docs/examples/gatk.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
###############################
Using with GATK
Using GenomicsDB with GATK
###############################

GenomicsDB is packaged into
Expand Down
27 changes: 16 additions & 11 deletions docs/import-etl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@ Import / ETL
###############################
GenomicsDB supports importing genomics data in several common formats, and with a variety of methods.

GenomicsDB can ingest data in VCF, gVCF. and CSV formats. When importing VCFs or gVCFs,
GenomicsDB can ingest data in VCF, BCF, gVCF and CSV formats. When importing VCF/BCFs or gVCFs,
you may need to run a few preprocessing steps before importing.

The primary and suggested method of importing is to use the native vcf2genomicsdb importer tool (see :ref:`CLI Tools <CLI Tools>`).
There is also a Java API for importing (see here).
There is also a Java API for importing (see :ref:`here <APIs/Java/API Reference/Java Import Package>`).


VCF/gVCF
*******************************
The import program can handle block compressed and indexed VCFs, gVCFs, BCFs and gBCFs.
The import program can handle block compressed and indexed VCFs, BCFs and gVCFs.
For brevity, we will only use the term VCF.


Expand All @@ -31,7 +31,7 @@ Organizing your data
bcftools norm -m +any [-O <output_format> -o <output>] <input_file>
* In a multi-node environment, you must decide:
* How to `partition your data in GenomicsDB <#multi-node-setup>`
* How to `partition your data in GenomicsDB <#Multi-node-setup>`
* How your VCF files are accessed by the import program:
* On a shared filesystem (NFS, Lustre etc) accessible from all nodes.
* If you are partitioning data by rows, your files can be scattered across local filesystems on multiple machines; each filesystem accessible only by the node on which it is mounted. Currently, such a setup isn't supported for column partitioning.
Expand Down Expand Up @@ -127,19 +127,24 @@ The user must decide how to partition data across multiple nodes in a cluster:
* How many nodes should be used to store the data?
* How many partitions should reside on each node? A single node can hold multiple partitions (assuming the node has enough disk space).
* What mode should be used for partitioning the data? Two modes of partitioning are supported by various import/query tools.

* Row partitioning: In this mode, for a given sample/CallSet (row), all the variant data resides in a single partition. Data belonging to different samples/CallSets may be scattered across different partitions.
* Column partitioning: In this mode, for a given genomic position (column), all the variant data across all samples/CallSets resides in a single partition. Data is partitioned by genomic positions.

Which partitioning scheme is better to use is dependent on the queries/analysis performed by downstream tools. Here are some example queries for which the 'best' partitioning schemes are suggested.

* Query: fetch attribute X from all samples/CallSets for position Y (or small interval [Y1-Y2])

* Row-based partitioning
* For single position queries (or small intervals), partitioning the data by rows would likely provide higher performance. By accessing data across multiple partitions that may be located in multiple nodes in parallel, the system will be able to utilize higher aggregate disk and memory bandwidth. In a column based partitioning, only a single partition would service the request.
* Simple data import step if the original data is organized as a file per sample/CallSet (for example VCFs). Just import data from the required subset of files to the correct partition.
* Con(s). A final aggregator may be needed since the data for a given position is scattered across machines. Some of the query tools we provide use MPI to collect the final output into a single node.
* For single position queries (or small intervals), partitioning the data by rows would likely provide higher performance. By accessing data across multiple partitions that may be located in multiple nodes in parallel, the system will be able to utilize higher aggregate disk and memory bandwidth. In a column based partitioning, only a single partition would service the request.
* Simple data import step if the original data is organized as a file per sample/CallSet (for example VCFs). Just import data from the required subset of files to the correct partition.
* Con(s). A final aggregator may be needed since the data for a given position is scattered across machines. Some of the query tools we provide use MPI to collect the final output into a single node.

* Query: run analysis tool T on all variants (grouped by column position) found in a large column interval [Z1-Z2] (or scan across the whole array)

* Query: run analysis tool T on all variants (grouped by column position) found in a large column interval [Z1-Z2] (or scan across the whole array).
* Column-based partitioning
* The user is running a query/analysis for every position in the queried interval. Hence, for each position, the system must fetch data from all samples/CallSets and run T. Partitioning by column reduces/eliminates any communication between partitions. For a sufficiently large query interval, the aggregate disk and memory bandwidth across multiple nodes can still be utilized.
* No/minimal data aggregation step as all the data for a given column is located within a single partition.
* Con(s). Importing data into GenomicsDB may become complex, especially if the initial data is organized as a file per sample/CallSet.
* The user is running a query/analysis for every position in the queried interval. Hence, for each position, the system must fetch data from all samples/CallSets and run T. Partitioning by column reduces/eliminates any communication between partitions. For a sufficiently large query interval, the aggregate disk and memory bandwidth across multiple nodes can still be utilized.
* No/minimal data aggregation step as all the data for a given column is located within a single partition.
* Con(s). Importing data into GenomicsDB may become complex, especially if the initial data is organized as a file per sample/CallSet.


0 comments on commit a60583b

Please sign in to comment.