Home

ogcat is a set of fast and scalable (fast asymptotic running time, low memory usage, with reasonable running time) utilities for working with large-scale Newick and FASTA files. The goal is to complement the existing ecosystem (think RAxML, gotree, etc.) in features while staying scalable to large scale data when existing tools do not (think TB-size FASTA alignments).

Why?

For moderate sized inputs, ogcat on some tasks provides 5x-20x speed-up over existing implementations:

For example, for computing sum-of-pairs error rates for alignments on 1000 sequences with average sequence length around 1000 (1000M2, replicate 0, pasta_4_align.txt from the MAGUS data):

Benchmark 1: ogcat sp -r true_align.txt -e pasta_4_align.txt
  Time (mean ± σ):      49.4 ms ±   1.1 ms    [User: 41.8 ms, System: 18.2 ms]
  Range (min … max):    47.6 ms …  53.6 ms    56 runs

Benchmark 2: java -jar FastSP.jar -r true_align.txt -e pasta_4_align.txt
  Time (mean ± σ):     463.6 ms ±  20.1 ms    [User: 742.3 ms, System: 87.1 ms]
  Range (min … max):   450.3 ms … 518.1 ms    10 runs

Summary
  'ogcat sp -r true_align.txt -e pasta_4_align.txt' ran
    9.38 ± 0.45 times faster than 'java -jar FastSP.jar -r true_align.txt -e pasta_4_align.txt'

On large inputs, ogcat can finish when other methods might not. For example, on very large FASTA alignments that are extremely gappy (~1GB in original sequences, ~1TB aligned), FastSP.jar needs to slurp the entire reference alignment file (including all gaps), but ogcat only uses space dominated by the total size of the sequences. This is however, not to bash on FastSP.jar -- it was invented a decade ago and ogcat internally just uses its algorithm.

Input formats

For phylogenetic trees, ogcat takes in Newick trees. Depending on commands a single file might be allowed to contain multiple new-line separated Newick trees.

For alignment files, ogcat takes in FASTA alignments. Commands do work on alignments that have singleton insertion sites marked in lower letters.

To facilitate large data, ogcat allows reading from compressed files (using the autocompress crate). Input formats can be in the gz, lz, or zst formats (but zst is the safe default format to use as it is usually all-around better than gz and better-compressed than lz), and the output formats can be in gz and zst. The compression format can be auto-detected at least by file extension. For example, aln.fa.zst will be transparently read just as reading aln.fa.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Why?

Input formats

Clone this wiki locally