Skip to content

JSeqArray: data manipulation of whole-genome sequencing variants with SeqArray files in Julia

License

Notifications You must be signed in to change notification settings

CoreArray/JSeqArray.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JSeqArray: data manipulation of whole-genome sequencing variants with SeqArray files in Julia

GPLv3 GNU General Public License, GPLv3

Build Status

pre-release version: v0.1.0

Features

Data management of whole-genome sequence variant calls with thousands of individuals: genotypic data (e.g., SNVs, indels and structural variation calls) and annotations in SeqArray files are stored in an array-oriented and compressed manner, with efficient data access using the Julia programming language.

The SeqArray format is built on top of Genomic Data Structure (GDS) data format, and defines required data structure. GDS is a flexible and portable data container with hierarchical structure to store multiple scalable array-oriented data sets. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. It also offers the efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access.

Installation

  • Development version from Github, requiring julia >= v0.5
  • require jugds
Pkg.status()
# install package dependencies
Pkg.clone("https://github.com/CoreArray/jugds.jl.git")
Pkg.build("jugds")

Pkg.clone("https://github.com/CoreArray/JSeqArray.jl.git")
Pkg.build("JSeqArray")

Package Maintainer

Dr. Xiuwen Zheng (zhengxwen@gmail.com)

Documentation

JSeqArray.jl documentation: docs/index.md

Citation

Original paper (implemented in an R/Bioconductor package):

SeqArray

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.

Python package

PySeqArray

SeqArray File Download

Examples

using JSeqArray

fn = seqExample(:kg)
f = seqOpen(fn)
SeqArray File: JSeqArray/demo/data/1KG_phase1_release_v3_chr22.gds (1.1M)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 1092 LZMA_ra(10.5%), 914B } *
|--+ variant.id   { Int32 19773 LZMA_ra(8.39%), 6.6K } *
|--+ position   { Int32 19773 LZMA_ra(52.0%), 41.1K } *
|--+ chromosome   { Str8 19773 LZMA_ra(0.28%), 166B } *
|--+ allele   { Str8 19773 LZMA_ra(22.7%), 111.9K } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 2x1092x19773 LZMA_ra(8.17%), 882.5K } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Int16 0 LZMA_ra, 19B }
|--+ phase   [  ]
|  |--+ data   { Bit1 1092x19773 LZMA_ra(0.02%), 550B } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 19B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 19B }
|--+ annotation   [  ]
|  |--+ id   { Str8 19773 LZMA_ra(35.2%), 77.0K } *
|  |--+ qual   { Float32 19773 LZMA_ra(3.62%), 2.9K } *
|  |--+ filter   { Int32,factor 19773 LZMA_ra(0.21%), 170B } *
|  |--+ info   [  ]
|  \--+ format   [  ]
\--+ sample.annotation   [  ]
   |--+ Family.ID   { Str8 1092 LZMA_ra(15.3%), 1.1K }
   |--+ Population   { Str8 1092 LZMA_ra(5.08%), 222B }
   |--+ Gender   { Str8 1092 LZMA_ra(5.85%), 386B }
   \--+ Ancestry   { Str8 1092 LZMA_ra(2.43%), 233B }
# get genotype data (ploidy × sample × variant), 0xFF is missing value
seqGetData(f, "genotype")

seqClose(f)
2×1092×19773 Array{UInt8,3}:
[:, :, 1] =
 0x00  0x01  0x01  0x00  0x01  …  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x01  0x00     0x01  0x01  0x01  0x01  0x01
[:, :, 2] =
 0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00
 0x00  0x01  0x01  0x00  0x01     0x00  0x00  0x00  0x00  0x00
[:, :, 3] =
 0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00
...
[:, :, 19771] =
 0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00     0x01  0x00  0x00  0x00  0x00
[:, :, 19772] =
 0x00  0x01  0x01  0x00  0x01  …  0x01  0x01  0x00  0x01  0x00
 0x01  0x00  0x00  0x00  0x00     0x01  0x01  0x00  0x00  0x00
[:, :, 19773] =
 0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x01
 0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x01  0x00

Key Functions

Function Description
seqFilterSet Define a data subset of samples or variants. »
seqGetData Get data from a SeqArray file with a defined filter. »
seqApply Apply a user-defined function over array margins. »
seqParallel Apply functions in parallel. »
... »

More Examples

Julia tutorial with SeqArray files: demo/tutorial.ipynb

Julia tutorial with SeqArray files and parallel programming: demo/tutorial_parallel.ipynb

JSeqArray.jl documentation: docs/index.md

Other Resources

Learn X in Y minutes (where X=Julia): http://learnxinyminutes.com/docs/julia/

About

JSeqArray: data manipulation of whole-genome sequencing variants with SeqArray files in Julia

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages