Skip to content

Commit

Permalink
README update for release, tune dubfile
Browse files Browse the repository at this point in the history
  • Loading branch information
jblachly committed Aug 23, 2020
1 parent 706eaf1 commit 1695573
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 57 deletions.
111 changes: 56 additions & 55 deletions README.md
Expand Up @@ -7,37 +7,23 @@ See our preprint here: (pending)
## Background and Motivation

Our initial goal was to be at least as fast as Jim Kent's seminal "liftOver" tool.
Since then, Swiftover has turned into a vehicle for exploration of interval data structures
as well as a vehicle to get us VCF liftover, not available in `liftOver`, and unfortunately
slow in CrossMap, the current standard.

We hypothesize that specifically for sorted genome intervals,
the implicit predictive caching of splay trees outperforms other
tree structures in linear/sequential search workloads as often found in genomics.
Indeed, splay trees outperform the well-balanced AVL tree,
which outperformed a slightly-less-well-balanced Red-Black tree.

With the recent invention of Iplicit Interval Trees (IITrees)
by Heng Li [1] and similar structures by others, we've tested these and found them
to be even faster than splay trees in some linear-scan liftover workloads.
For strict, simple liftover applications as for example hg19 -> GRCh38, IITrees may
slightly outperform the Splay Tree. However, for some complex liftovers, like
hg19 -> CanFam4, Splay Trees outperform the competition.

Thus for a balance of performance in all tested cases thus far, we recommend
using the Splay Trees version (see compilation, below) of Swiftover.
Since then, Swiftover has turned into a vehicle for exploration of interval data structures as well as a vehicle to get us VCF liftover, not available in `liftOver`, and unfortunately slow in CrossMap, the current standard.

We hypothesize that specifically for sorted genome intervals, the implicit predictive caching of splay trees outperforms other tree structures in linear/sequential search workloads as often found in genomics.
Indeed, splay trees outperform the well-balanced AVL tree, which outperformed a slightly-less-well-balanced Red-Black tree.

With the recent invention of Iplicit Interval Trees (IITrees) by Heng Li [1] and similar structures by others, we've tested these and found them to be even faster than play trees in some linear-scan liftover workloads.
For strict, simple liftover applications as for example hg19 -> GRCh38, IITrees may slightly outperform the Splay Tree. However, for some complex liftovers, like hg19 -> CanFam4, Splay Trees outperform the competition. (**NB:** This may be related to `GC.addRange` calls making IITree safe for GC collected memory, more tests in progress)

Thus for a balance of performance in all tested cases thus far, we recommend using the Splay Trees version (see compilation, below) of Swiftover.

Interestingly, when nodes need to be added and removed (as we might need to when
constructing graph genome structures) a traditional tree structure
may be preferred due to the IITree's need to be entirely reindexed after each
insert/delete operation. Future studies (i.e., benchmarks) are needed.
Interestingly, when nodes need to be added and removed (as we might need to when constructing graph genome structures) a traditional tree structure may be preferred due to the IITree's need to be entirely reindexed after each insert/delete operation. Future studies (i.e., benchmarks) are needed.

## Installation

Precompiled binaries are available on the releases page.
These binaries are statically linked against htslib; a system installation of htslib is not required.
You may still need system installation of htslib's dynamically-linked dependencies,
typically libcurl, libbz2, and liblzma.
Precompiled linux binaries are available on the releases page. These binaries are statically linked against htslib; a system installation of htslib is not required. This is tested and known to work on Ubuntu and Fedora; for Alpine linux and others using musl _swiftover_ must be compiled from source.

See also [Compiling from source](#compiling-from-source)

## Quickstart

Expand All @@ -47,28 +33,31 @@ If lifting over VCF, `-g <destination genome.fa>` is also required`

## Requirements

Chain file: Obtain from UCSC or Ensembl:
### Chain file

If you are working with human data, you can quickly grab hg19 to hg38 (UCSC-style contig naming: chr1, chrM) and GRCh37 to GRCh38 (Ensembl-style contig naming: 1, MT) by issuing `make chains`.
Chainfiles will be placed in `resources/`, and for the time being must be un-gzipped.

Obtain chain files from UCSC or Ensembl:

[http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/](http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/)

[ftp://ftp.ensembl.org/pub/assembly_mapping/](ftp://ftp.ensembl.org/pub/assembly_mapping/)

If you are working with human data, you can quickly grab hg19 to hg38
(UCSC-style contig naming: chr1, chrM) and GRCh37 to GRCh38
(Ensembl-style contig naming: 1, MT) by issuing `make chains`.
Chainfiles will be placed in `resources/`, and for the time being need to be un-gzipped.
Swiftover needs uncompressed chain files. In the future we will add gzip reader.

### Reference genome

Swiftover needs uncompressed chain files. TODO: Will add gzip reader.
In VCF mode, a destination (post-liftover) reference genome FASTA file is required. If a FASTA index does not exist, one will be created the first time the genome is used.

## File Formats

It is critical that the contigs appearing in the _source_ file have an entry in the chain;
otherwise the program will terminate with `range violation`. Adding error checking/handling
otherwise the program may terminate with `range violation`. Adding error checking/handling
for this is possible, but as the check would be run once for every row of input, it could
unnecessarily slow the liftover.

Likewise, in VCF mode, contig naming must be consistent across input VCF, chain file, and
destination genome.
Likewise, in VCF mode, contig naming must be consistent across input VCF, chain file, and destination genome.

### BED

Expand All @@ -87,43 +76,54 @@ Keep in mind that contig names must be consistent among the chainfile, the genom

Lifting a VCF file to a new genome build additionally requres a FASTA file of the new/destination genome. If it is not already faidx indexed, an index will be created automatically. If no .fai index already exists and swiftover does not have write permission to the directory containing the genome, execution will fail.

**Reference allele change:** Occasionally, the reference allele may differ even at equivalent coordinates in different genome builds. When swiftover detects this, it will update the REF column of the VCF record and add the tag **refchg** to the INFO column. These records can then be filtered by downstream tools if necessary (e.g., `bcftools view -i 'INFO/refchg=1'`)
**Reference allele change:** Occasionally, the reference allele may differ even at equivalent coordinates in different genome builds. When swiftover detects this, it will update the `REF` column of the VCF record and add the tag **refchg** to the `INFO` column. These records can then be filtered by downstream tools if necessary (e.g., `bcftools view -i 'INFO/refchg=1'`)

*CAVEATS:* INFO and FORMAT column tags related to allele frequencies and calculations may
no longer be accurate in the destination geneome build (due to subtle mapping differences),
but _especially_ if the reference allele has changed. We will likely add cmdline flag to strip
all INFO/FORMAT tags, followed later by a plugin to recalculate select values (e.g. when refchg),
or scripting (e.g. Lua) capability.
**CAVEATS:** `INFO` and `FORMAT` column tags related to allele frequencies and calculations may no longer be accurate in the destination geneome build (due to subtle mapping differences), but _especially_ if the reference allele has changed. We will likely add cmdline flag to strip all INFO/FORMAT tags, followed later by a plugin to recalculate select values (e.g. when refchg), or scripting (e.g. Lua) capability.

### GFF3

TBD
_TBD_

## Compiling from source

### Install D compiler

Download for your platform here: https://github.com/ldc-developers/ldc/releases/tag/v1.23.0

Uncompress and include the `bin/` directory in your $PATH.

### Step by Step

1. Install D compiler
2. Clone swiftover repository: `git clone https://github.com/blachlylab/swiftover.git`
3. `dub build -b=release-nobounds -c=<intervaltreetype>`
* Where `<intervaltreetype>` in { `avltree`, `splaytree`, `iitree` }

### Selection of interval tree type

Swiftover uses the [intervaltree](https://github.com/blachlylab/intervaltree) library.

With dub, the configurations `avltree`, `splaytree`, and `iitree` are available.
Strictly ordering these according to execution speed is not possible, as they vary for different workloads.
See background and discussion, above. We suggest splaytree for best overall performance.
See background and discussion, above. We suggest splaytree for best overall performance at this time.

### Selection of compiler

DMD codegen can be poor compared to LDC and GDC, with execution too slow to compete with `liftover`.
Use LDC2 and `dub -b=release` for > 100% speedup. Additionally, as of dklib 0.1.1, DMD cannot inline
(at least) one of the functions in khash, which means compilation of swiftover with LDC2 or GDC is required.
Use LDC2 and `dub -b=release` for > 100% speedup. Additionally, as of dklib 0.1.2, DMD cannot inline
one of the functions in khash (`kh_hash_func`), which means compilation of swiftover with LDC2 or GDC is required for best performance.

**htslib linking:**
when using LDC2, or when using the GOLD linker (instead of traditional GNU ld), you'll need to make sure
that the linker can find libhts, which is often installed in `/usr/local/lib`. GOLD does not search there
by default, nor does it examine `LD_LIBRARY_PATH`. It does, however, search `LIBRARY_PATH`, so add
`export LIBRARY_PATH=/usr/local/lib` (or wherever you have installed htslib) to build scripts or run before dub build.
### linking to htslib

when using LDC2, or when using the GOLD linker (instead of GNU ld), you'll need to make sure that the linker can find libhts, which is often installed in `/usr/local/lib`. GOLD does not search there by default, nor does it examine `LD_LIBRARY_PATH`. It does, however, search `LIBRARY_PATH`, so add `export LIBRARY_PATH=/usr/local/lib` (or wherever you have installed htslib) to build scripts or run before dub build.

thanks to [http://jbohren.com/articles/2013-10-28-gold/](http://jbohren.com/articles/2013-10-28-gold/)

**htslib version:**
### htslib version

Swiftover uses our htslib binding [dhtslib](https://github.com/blachlylab/dhtslib).
dhtslib currently works only with htslib-1.9, so if compiling Swiftover from source, make sure that you have
a system installation of htslib-1.9. When dhtslib finalizes htslib-1.10 (breaking ABI changes and new API)
support, we will also update Swiftover to use the new dhtslib.

dhtslib currently works only with htslib-1.9, so if compiling Swiftover from source, make sure that you have a system installation of htslib-1.9. When dhtslib finalizes htslib-1.10 (breaking ABI changes and new API) support, we will also update Swiftover to use the new dhtslib.

## BUGS

Expand All @@ -136,4 +136,5 @@ but _especially_ if the reference allele has changed. Look for the `refchg` tag
## References

[1] https://github.com/lh3/cgranges

[2] https://github.com/blachlylab/intervaltree
10 changes: 8 additions & 2 deletions dub.json
Expand Up @@ -10,8 +10,6 @@
"intervaltree": "~>0.20.0"
},
"buildRequirements": [ "allowWarnings" ],
"dflags-ldc": [ "" ],
"lflags": ["-lcurl", "-lbz2", "-llzma" ],
"subConfigurations": {
"dhtslib": "source"
},
Expand Down Expand Up @@ -45,6 +43,14 @@
"name": "instrument-iitree",
"versions": [ "instrument", "iitree" ],
"targetType": "executable"
},
{
"name": "dist-static-splaytree",
"versions": [ "splay" ],
"targetType": "executable",
"subConfigurations": {
"dhtslib": "source-static"
}
}
]
}

0 comments on commit 1695573

Please sign in to comment.