Normalized compression distance for alignment-free biological sequence comparison
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


snacc: compress and compare pathogen genomes without sequence alignment

snacc is a pipeline that implements the normalized compression distance (NCD) specifically for biological data. The workflow primarily consists of 3 stages: compression, clustering, and visualization. The goal of this project is to provide a faster method of comparing large-scale pathogen genomes to conventional alignment-based methods such as BLAST by exploiting the inherent redundancies of the genetic code.

Table of contents



Installation and Dependencies

Set up a virtual env (optional)

To install virtualenv use the following command in your terminal:

pip install virtualenv

Then in the directory you want to use, create a virtualenv named env:

virtualenv -p python3.6 env

And then activate the environment with:

source env/bin/activate

You can leave the virtualenv at any time with the command:


Install the dependencies

To install the dependencies:

pip install git+

Optional: install BWT disk functionality

To use snacc with the BWT-Disk function you can run the following commands:

cd /path/to/snacc/bin/bwt_disk
make clean


  1. Before calling snacc
source activate my_env
  1. Most basic usage
snacc -d [folder with sequences] -o [output name]
  1. Intermediate: customize number of threads and compression algorithm
snacc -d [folder with sequences] -o [output name] -n 24 -c gzip
  1. Full control
snacc \
--directory [folder with sequences] \
--output [output name] \
--num-threads 24 \
--compression zlib \
--save-compression [folder to store zipped files] \
--show-progress False

Using bwt-disk functionality

The following command will run a burrows-wheeler transform in disk using the default amount of memory (256MB) and then compress using range encoding.

--directory [folder with sequences] \
--output [output name] \
--num-threads 24 \
--compression bwt-disk \
--bwte-compress rle-range-encoding
--save-compression [folder to store zipped files] \
--show-progress False


snacc analysis

  • Analysis time: 2018-10-14 15:18:17.257619
  • Analysis duration: 0:00:26.383997
  • Compression method: lzma
  • Reverse complement: False
  • Burrows-Wheeler transform: False
  • Output filepath: /Users/BenjaminLee/Desktop/Python/Research/hackseq18/bioncd-hackseq/test.csv
Analyzed Files
  • /Users/BenjaminLee/Desktop/Python/Research/hackseq18/bioncd-hackseq/test_dataset/mysteryGenome_1.fasta
  • /Users/BenjaminLee/Desktop/Python/Research/hackseq18/bioncd-hackseq/test_dataset/mysteryGenome_2.fasta
Distance Matrix
file mysteryGenome_1.fasta mysteryGenome_2.fasta
mysteryGenome_1.fasta 0.000644 0.003542
mysteryGenome_2.fasta 0.003542 0.000641
Version Information
  • Python: 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
  • snacc: 0.0.1
  • scikit-learn: 0.20.0
  • py-lz4framed: 0.12.0
  • umap-learn: 0.3.5