Home
ogcat
is a set of fast and scalable (fast asymptotic running time, low memory usage, with reasonable running time) utilities for working with large-scale Newick and FASTA files. The goal is to complement the existing ecosystem (think RAxML
, gotree
, etc.) in features while staying scalable to large scale data when existing tools do not (think TB-size FASTA alignments).
For moderate sized inputs, ogcat
on some tasks provides 5x-20x speed-up over existing implementations:
For example, for computing sum-of-pairs error rates for alignments on 1000 sequences with average sequence length around 1000
(1000M2, replicate 0, pasta_4_align.txt
from the MAGUS data):
Benchmark 1: ogcat sp -r true_align.txt -e pasta_4_align.txt
Time (mean ± σ): 49.4 ms ± 1.1 ms [User: 41.8 ms, System: 18.2 ms]
Range (min … max): 47.6 ms … 53.6 ms 56 runs
Benchmark 2: java -jar FastSP.jar -r true_align.txt -e pasta_4_align.txt
Time (mean ± σ): 463.6 ms ± 20.1 ms [User: 742.3 ms, System: 87.1 ms]
Range (min … max): 450.3 ms … 518.1 ms 10 runs
Summary
'ogcat sp -r true_align.txt -e pasta_4_align.txt' ran
9.38 ± 0.45 times faster than 'java -jar FastSP.jar -r true_align.txt -e pasta_4_align.txt'
On large inputs, ogcat
can finish when other methods might not. For example, on very large FASTA alignments that are
extremely gappy (~1GB in original sequences, ~1TB aligned), FastSP.jar
needs to slurp the entire reference
alignment file (including all gaps), but ogcat
only uses space dominated by the total size of the sequences. This is however, not
to bash on FastSP.jar
-- it was invented a decade ago and ogcat
internally just uses its algorithm.
For phylogenetic trees, ogcat
takes in Newick trees. Depending on commands a single file might be allowed to contain
multiple new-line separated Newick trees.
For alignment files, ogcat
takes in FASTA alignments. Commands do work on alignments that have singleton insertion sites
marked in lower letters.
To facilitate large data, ogcat
allows reading from compressed files (using the autocompress crate). Input formats can be in the gz
, lz
, or zst
formats (but zst
is the safe default format to use as it is usually all-around better than gz
and better-compressed than lz
), and the
output formats can be in gz
and zst
. The compression format can be auto-detected at least by file extension. For example,
aln.fa.zst
will be transparently read just as reading aln.fa
.