Official home of genome aligner based upon notion of Cactus graphs
Clone or download
joelarmstrong Record cactus commit and config used in each alignment produced
Should help with reproducibility, or really just helps me remember
what alignment was produced with what settings.
Latest commit 4f567fd Sep 25, 2018
Permalink
Failed to load latest commit information.
api Update to fix build on Travis Trusty image Jul 16, 2017
bar EventTree now belongs to CactusDisk, not Flower Jul 5, 2017
bin Fixed build for module Feb 2, 2011
blast Merge remote-tracking branch 'origin/master' into mapQAlignments Aug 27, 2018
blastLib Merging in the development branch to the phylogeny branch, bringing in Jun 7, 2014
caf Small cleanup. More to do. Sep 12, 2018
check Remove cactusDisk_construct2 now that it's the same as _construct Apr 12, 2017
dbTest Make doctest unit tests optional Aug 21, 2013
doc Add some basic documentation on autoscaling in AWS Jul 6, 2018
examples Update evolverMammals example to point to S3 rather than hgwdev Jul 3, 2018
faces Allow "caching" of DB responses / strings to be disabled Apr 6, 2017
hal Re-enable caching everywhere Jun 1, 2017
lib Fixed build for module Feb 2, 2011
normalisation Remove cactusDisk_construct2 now that it's the same as _construct Apr 12, 2017
phylogeny Fix making an insert instead of update request when writing disk params Jul 6, 2017
pipeline Remove outdated 'resetProjectPaths' script Sep 14, 2017
preprocessor Avoid pip-installing cactus in container [ci skip] Jan 26, 2017
reference Crash early if reference event is wrong, to avoid causing trouble later Sep 30, 2017
runtime Use multi-stage Docker builds instead of manually copying files Jul 24, 2018
setup EventTree now belongs to CactusDisk, not Flower Jul 5, 2017
src/cactus Record cactus commit and config used in each alignment produced Sep 26, 2018
submodules Track master branch in hal rather than revision Mar 14, 2018
.gitignore Write cactus commit into src/ tree when installing through setup.py, … Nov 17, 2016
.gitmodules Refer to --remote option which is apparently needed for submodule branch Mar 15, 2018
.travis.yml Fix up more SON_TRACE_DATASETS reliant tests that were bitrotted Jun 28, 2018
Dockerfile Use multi-stage Docker builds instead of manually copying files Jul 24, 2018
LICENSE.txt Chsnged copyright for peter Feb 6, 2011
Makefile Use multi-stage Docker builds instead of manually copying files Jul 24, 2018
README.md Note Py2.7 restriction Sep 25, 2018
allTests.py Fix up more SON_TRACE_DATASETS reliant tests that were bitrotted Jun 28, 2018
include.mk Reduce the amount of make output and speed up the build Feb 11, 2017
setup.py Merge pull request #34 from ComparativeGenomicsToolkit/mapQAlignments Sep 25, 2018

README.md

Cactus

Build Status

Cactus is a reference-free whole-genome multiple alignment program. The principal algorithms are described here: https://doi.org/10.1101/gr.123356.111

Acknowledgements

Cactus uses many different algorithms and individual code contributions, principally from Joel Armstrong, Glenn Hickey, Mark Diekhans and Benedict Paten. We are particularly grateful to:

  • Yung H. Tsin and Nima Norouzi for contributing their 3-edge connected components program code, which is crucial in constructing the cactus graph structure, see: Tsin,Y.H., "A simple 3-edge-connected component algorithm," Theory of Computing Systems, vol.40, No.2, 2007, pp.125-142.
  • Bob Harris for providing endless support for his LastZ pairwise, blast-like genome alignment tool.

Setup

System requirements

Cactus uses substantial resources. For primate-sized genomes (3 gigabases each), you should expect Cactus to use approximately 120 CPU-days of compute per genome, with about 120 GB of RAM used at peak. The requirements scale roughly quadratically, so aligning two 1-megabase bacterial genomes takes only 1.5 CPU-hours and 14 GB RAM.

Note that to run even the very small evolverMammals example, you will need 2 CPUs and 12 GB RAM. The actual resource requirements are much less, but the individual jobs have resource estimates based on much larger alignments, so the jobs will refuse to run unless there are enough resources to meet their estimates.

Virtual environment

To avoid problems with conflicting versions of dependencies on your system, we strongly recommend installing Cactus inside a Python virtual environment. Note that Cactus will only currently work with Python 2.7, until some of our dependencies become Python 3 compatible.

To install the virtualenv command, if you don't have it already, run:

pip install virtualenv

To set up a virtual environment in the directory cactus_env, run:

virtualenv cactus_env

Then, to enter the virtualenv, run:

source cactus_env/bin/activate

You can always exit out of the virtualenv by running deactivate. The rest of the README assumes you're running inside a virtual environment.

Install Cactus and its dependencies

Cactus uses Toil to coordinate its jobs. To install Toil into your environment, run:

pip install --upgrade toil

Finally, to install Cactus, from the root of the cactus repository, run:

pip install --upgrade .

Compile Cactus executables (if not using Docker/Singularity)

By default Cactus uses containers to distribute its binaries, because compiling its dependencies can sometimes be a pain. If you can use Docker or Singularity, you can skip this section. However, in some environments (e.g. HPC clusters) you won't be able to use Docker or Singularity, so you will have to compile the binaries and install a few dependencies.

First, ensure you have KyotoTycoon installed. If you have root access, it is available through most package managers under kyototycoon or kyoto-tycoon. To compile it manually, you are best off using the unofficial Altice Labs repository. If you've installed KyotoTycoon (and its library, KyotoCabinet) from a package manager, you should be OK to go. If you've installed it in a non-standard location, however, (because you don't have root access, for example) you will need to set the following environment variables:

ttPrefix=<path of the PREFIX where you installed Kyoto>
export kyotoTycoonIncl="-I${ttPrefix}/include -DHAVE_KYOTO_TYCOON=1"
export kyotoTycoonLib="-L${ttPrefix}/lib -Wl,-rpath,${ttPrefix}/lib -lkyototycoon -lkyotocabinet -lz -lbz2 -lpthread -lm -lstdc++"

and copy the ktserver binary to somewhere on your PATH, and depending on your install directory, you may also need to add ${ttPrefix}/lib to your LD_LIBRARY_PATH. (This can be a bit of a pain--we have an updated scons-based build system in the works that will automate most of this, but it's not ready yet.)

Once you have KyotoTycoon installed, you should be able to compile Cactus and its dependencies by running:

git submodule update --init
make

To run using these local executables, you will need to provide the --binariesMode local option to all cactus commands and add the bin directory to your PATH.

System/cluster requirements

Cactus will take about 20 CPU-hours per bacterial-sized (~4 megabase) genome, about 20 CPU-days per nematode-sized (~100 megabase) genome, and about 120 CPU-days per mammalian-sized (~3 gigabase) genome. You will need at least one machine with very large amounts of RAM (150+ GB) to run mammalian-sized genomes. The requirements will vary a bit depending on how closely related your genomes are, so these are only rough estimates.

Running

To run Cactus, the basic format is:

cactus <jobStorePath> <seqFile> <outputHal>

The jobStorePath is where intermediate files, as well as job metadata, will be stored. It must be accessible to all worker systems.

When first testing out Cactus on a new system or cluster, before running anything too large, try running the small (5 600kb genomes) simulated example in examples/evolverMammals.txt. It should take less than an hour to run on a modern 4-core system. That example, even though it's small, should be enough to expose any major problems Cactus may have with your setup.

Choosing how to run the Cactus binaries (Docker/Singularity/local)

By default, Cactus uses Docker to run its compiled components (to avoid making you install dependencies). It can instead use Singularity to run its binaries, or use a locally installed copy. To select a different way of running the binaries, you can use the --binariesMode singularity or --binariesMode local options. (If running using local binaries, you will need to make sure cactus's bin directory is in your PATH.)

seqFile: the input file

The input file, called a "seqFile", is just a text file containing the locations of the input sequences as well as their phylogenetic tree. The tree will be used to progressively decompose the alignment by iteratively aligning sibling genomes to estimate their parents in a bottom-up fashion. Polytomies in the tree are allowed, though the amount of computation required for a sub-alignment rises quadratically with the degree of the polytomy. The file is formatted as follows:

NEWICK tree (optional)
name1 path1
name2 path2
...
nameN pathN

An optional * can be placed at the beginning of a name to specify that its assembly is of reference quality. This implies that it can be used as an outgroup for sub-alignments. If no genomes are marked in this way, all genomes are assumed to be of reference quality. The star should only be placed on the name-path lines and not inside the tree.

  • The tree must be on a single line. All leaves must be labeled and these labels must be unique. Ancestors may be named, or left blank (in which case the ancestors in the final output will automatically be labeled Anc0, Anc1, etc.) Labels must not contain any spaces.
  • Branch lengths that are not specified are assumed to be 1.
  • Lines beginning with # are ignored.
  • Sequence paths must point to either a FASTA file or a directory containing 1 or more FASTA files.
  • Sequence paths must not contain spaces.
  • Each name / path pair must be on its own line
  • http://, s3://, etc. URLs may be used.

Please ensure your genomes are soft-masked with RepeatMasker. We do some basic masking as a preprocessing step to ensure highly repetitive elements are masked when repeat libraries are incomplete, but still genomes that aren't properly masked can take tens of times longer to align that those that are masked. Hard-masking (totally replacing repeats with stretches of Ns) isn't necessary, and is strongly discouraged (you will miss a lot of alignments!).

Example:

  # Sequence data for progressive alignment of 4 genomes
  # human, chimp and gorilla are flagged as good assemblies.
  # since orang isn't, it will not be used as an outgroup species.
 (((human:0.006,chimp:0.006667):0.0022,gorilla:0.008825):0.0096,orang:0.01831);
 *human /data/genomes/human/human.fa
 *chimp /data/genomes/chimp/
 *gorilla /data/genomes/gorilla/gorilla.fa
 orang /cluster/home/data/orang/

Running locally

There isn't much to configure if running locally. Most importantly, if on a shared system, you can adjust the maximum number of processors used with --maxCores <N> (by default, Cactus will use all cores).

Running on a cluster

Cactus (through Toil) supports many batch systems, including LSF, SLURM, GridEngine, Parasol, and Torque. To run on a cluster, simply add --batchSystem <batchSystem>, e.g. --batchSystem gridEngine. If your batch system needs additional configuration, Toil exposes some environment variables that can help.

Running on the cloud

Cactus supports running on AWS, Azure, and Google Cloud Platform using Toil's autoscaling features. For more details on running in AWS, check out these instructions (other clouds are similar).

Using the output

Cactus outputs its alignments in the HAL format. This format represents the alignment in a reference-free, indexed way, but isn't readable by many tools. To export a MAF (which by its nature is usually reference-based), you can use the hal2maf tool to export the alignment from any particular genome: hal2maf <hal> --refGenome <reference> <maf>.

You can use the alignment to generate gene annotatations for your assemblies, using the Comparative Annotation Toolkit.