Skip to content

timpalpant/java-genomics-io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Java Genomics IO API

This library provides a simple and performant API for reading and writing many genomic file formats in a consistent fashion. File formats currently supported include: Bed, BigBed, BedGraph, GFF, GeneTrack, SAM, BAM, Wiggle, BigWig, and Tabix.

The design seeks to accomplish two major tasks:

  1. Allow many different formats to be accessed and manipulated in a consistent way without needing to worry about parsing individual file types. For example, all line-based interval formats (Bed, BedGraph, BigBed, GFF, GeneTrack, SAM, BAM) inherit from the Interval class so that applications may be agnostic to the specific format of input files. In a similar fashion, text Wiggle files and BigWig files share the same interface and may be used interchangeably.

  2. Make random-access to information in large genomic datasets efficient by using indexing schemes. In this way, your tools can randomly pull information from specific regions of the genome without the major performance cost incurred by seeking through ASCII text files line-by-line. Line-based interval files are indexed as needed using a pure-Java reimplementation of Tabix. Built-in indexes are used for BAM, BigWig, and BigBed files. ASCII Wiggle files are indexed using a custom implementation.

With all readers, data is buffered and lazy-parsed to minimize memory requirements and maximize disk performance.

Writers are also available to create line-based interval files and Wiggle files.

Complete JavaDocs are available at: palpant.us/java-genomics-io and may also be generated with the ant task “javadoc”

Compilation

An ant build script is provided for easy compilation. To see available ant tasks, call

$ ant -p

To build the distribution as a jar file, call

$ ant dist

To run the unit tests, call

$ ant test

and to analyze the unit test code coverage, call

$ ant coverage

Reports for the unit test results and code coverage are output in the reports subdirectory.

Examples

There are two major categories of readers/writers: for line-based interval files and for (Big)Wig files.

Line-based interval files

Entries in line-based interval files can be iterated over, or the file can be queried for entries that overlap a given genomic coordinate.

The format of the interval file can also be auto-detected so that your application can be format-agnostic:

$ try (IntervalFile<? extends Interval> loci = IntervalFile.autodetect(lociFile)) {
$   for (Interval interval : loci) {
$     System.out.println(interval.toBed());
$   }
$ } catch (IntervalFileSnifferException e) {
$  log.fatal("Error autodetecting interval file format");
$  e.printStackTrace();
$ }

This is particularly useful for writing Galaxy tools (getgalaxy.org).

Alternatively, if the file format is known and type-specific data is needed, it can be opened directly:

$ try (BAMFileReader bam = new BAMFileReader(bamFile)) {
$   for (SAMEntry entry : bam) {
$     System.out.println(entry.getQuality());
$   }
$ }

Interval files can also be queried to retrieve only entries that overlap a given locus. Queries are performed efficiently by first indexing the files with a pure-Java reimplementation of Samtools/Tabix. For example, to retrieve all BAM alignments overlapping the region chr1:1,000,000-2,000,000:

$ Iterator<SAMEntry> result = bam.query("chr1", 1_000_000, 2_000_000);
$ while(result.hasNext()) {
$   SAMEntry entry = result.next();
$   // process the alignment
$ }

This type of querying is available for all supported file types.

Interval files can be written in any desired format using IntervalFileWriters. For example, to convert a file from GFF format to Bed format, open the input file with a GFFFileReader, iterate over each interval, and write it to a BedFileWriter:

$ try (GFFFileReader reader = new GFFFileReader(gffFile);
$      BedFileWriter<GFFEntry> writer = new BedFileWriter<>(bedFile)) {
$   for (GFFEntry entry : reader) {
$     // do some processing
$     writer.write(entry);
$   }
$ }

Generics can be used to restrict the types of Interval that may be written to a given file.

Wiggle files

Similar abstraction is provided for Wig/BigWig files. Contiguous blocks of values can be retrieved from either kind of Wig file:

$ try (WigFileReader wig = WigFileReader.autodetect(wigOrBigWigFile)) {
$   Contig result = wig.query("chr23", 10_000, 1_000_000);
$   float[] values = result.getValues();
$   System.out.println("Mean of interval chr23:10,000-1,000,000 = " + result.mean());
$   System.out.println("StDev = " + result.stdev());
$   System.out.println("Coverage = " + result.coverage());
$   System.out.println("Total = " + result.total());
$ }

Indexing is used to execute the queries efficiently. If the query is too large to load into memory all at once, you can ask for only statistics about the query:

$ SummaryStatistics stats = wig.queryStats("chr23", 1, 100_000_000);
$ System.out.println("Mean of interval chr23:1-100,000,000 = "+stats.getMean());

Wig files can be written using a WigFileWriter. Contigs can be written in fixedStep or variableStep format, or the most dense format can automatically be chosen based on the sparsity and layout of the data:

$ try (WigFileWriter writer = new WigFileWriter(wigFile)) {
$   writer.writeFixedStepContig(contig);
$   writer.writeVariableStepContig(contig);
$
$   // Automatically decide which format is denser
$   writer.write(contig);
$ }

Full application examples

See github.com/timpalpant/java-genomics-toolkit for some full implementations using this library. A good starting point is: github.com/timpalpant/java-genomics-toolkit/blob/master/src/edu/unc/genomics/ngs/IntervalStats.java which demonstrates iterating over a list of intervals in a Bed/BedGraph/GFF file, querying for data from each interval in a Wig file, and calculating the mean value of the data in each interval.

License

Copyright 2017 Timothy Palpant. This work is licensed under the LGPLv3, subject to the terms and conditions in LICENSE and LICENSE.LESSER.

About

A simple and performant API for working with many genomic file formats in a consistent fashion

Resources

License

GPL-3.0, LGPL-3.0 licenses found

Licenses found

GPL-3.0
LICENSE
LGPL-3.0
LICENSE.LESSER

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages