Skip to content

medvedevgroup/ESSCompress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ESSCompress v3.1

A tool to compress a set of k-mers represented in FASTA/FASTQ/KFF file(s).

Installation

There are 2 ways to install ESS-Compress: either from source or from pre-compiled binaries.

1. Installation from source

Pre-requisites

  • Linux operating system (64 bit)

  • Git

  • GCC >= 4.8 or a C++11 capable compiler

  • CMake 3.1+

Steps

Download source and install:

git clone https://github.com/medvedevgroup/ESSCompress
cd ESSCompress
./INSTALL

Upon successful execution of this script, you will see linux binaries for kff-tools (essAuxKffTools), Blight (essAuxBlight), BCALM (essAuxBcalm), DSK (essAuxDsk and essAuxDsk2ascii) and MFCompress (essAuxMFCompressC and essAuxMFCompressD) in the aux folder, along with essAuxValidate, essAuxCompress and essAuxDecompress and getMaxLen.

2. Installation from pre-compiled binaries

Requirements

  • Linux operating system (64 bit)

Steps

  1. Download the latest Linux 64-bit binaries wget https://github.com/medvedevgroup/ESSCompress/releases/download/v3.1/essCompress-v3.1-linux-64.tar.gz

  2. Extract the .tar.gz file and change into uncompressed directory.
    tar xvzf essCompress-v3.1-linux-64.tar.gz
    cd essCompress-v3.1/

  3. You will see two executables in the directory named essCompress and essDecompress.

    • You can either refer to these two executables directly when compressing/decompressing (using the command ./essCompress and ./essDecompress),

    • Or, you can move/copy ALL the executables in essCompress-v3.1/bin to the bin directory that is already in your PATH. For instance, considering /usr/bin is already in PATH, you need to run the command mv ess* /usr/bin to move all executables for ESS-Compress software. An alternative to moving/copying executables is adding the location of essCompress-v3.1/bin to your PATH.

Quick start with a step-by-step example

This example assumes that you are currently inside the base directory essCompress-v3.1 after you have completed installing the tool as per the instructions.

Lets say you have a small fasta file of sequences, i.e. examples/smallExample.fa, and
cat examples/smallExample.fa returns

>
AAAAAAACCCCCCCCCC
>
CCCCCCCCCCA

We can compress it using k=11 as follows

./bin/essCompress -k 11 -i examples/smallExample.fa

Now ls examples will show both original input file and compressed file in the same directory:

smallExample.fa
smallExample.fa.essc
...

smallExample.fa.essc is a compressed binary file generated by MFCompress, so it is not in a readable format.

To decompress into a readable format, you can run

./bin/essDecompress examples/smallExample.fa.essc   

You'll now see the decompressed file example.fa.essd in the same directory.
cat examples/smallExample.fa.essd will return:

>
AAAAAAACCCCCCCCCCA

Notice that the decompressed fasta file is not the same as the original file, but it contains the same k-mers as smallExample.fa. You can double check this using the command
./bin/essAuxValidate 11 examples/smallExample.fa examples/smallExample.fa.essd
If they contain the same k-mers (i.e. 11-mers), you will see an output like this:

### SUCCESS: The two files contain same k-mers! ###

Usage details

essCompress: compression of a k-mer set

Syntax: ./essCompress [parameters] 

mandatory arguments:
-k [int]          k-mer size (must be >=4)
-i [input-file]   Path to input file. Input file can be either of these 3 formats:
                     1. a single fasta/fastq file (either gzipped or not)   
                     2. a single text file containing the list of multiple fasta/fastq files (one file per line)
                     3. a single .kff file. In this case, output is a .kff file after compressing in UST mode.

optional arguments:
-a [int]          Default=1. Sets a threshold X, such that k-mers that appear less than X times in the input dataset are filtered out. 
-o [output-dir]   Specify output directory
-t [int]          Default=1. Number of threads (used by bcalm, dsk and blight). 
-x [int]          Default=1. Bytes allocated for associated abundance data per k-mer in kff. For highest compression with kff, by default the program limits 1 byte per k-mer (max value 255).   
-f                Fast compression mode: uses less memory, but achieves smaller compression ratio.
-u                UST mode (output an SPSS, which does not contain any duplicate k-mers and the k-mers it contains are exactly the distinct k-mers in the input. A k-mer and its reverse complement are treated as equal.)   
-d                DEBUG mode. If debug mode is enabled, no intermediate files are removed.
-v                Enable verbose mode: print more useful information.
-c                Verify correctness: check that all the distinct k-mers in the input file appears exactly once in compressed file.
-h                Print this Help
-V                Print version number

Input for essCompress

Two important input parameters are

  • input [-i]
  • k-mer size [-k]

If input is a .kff file, [-k] parameter is disregarded.

File input format can be
1. a single fasta or fastq file (either gzipped or not)
2. a single text file containing the list of multiple fasta/fastq files (one file per line)
3. a single .kff file. In this case, output is a .kff file after compressing in UST mode.

To pass a single FASTA file as input and compress: ./bin/essCompress -i examples/11mers.fa -k 11

To pass a single KFF file as input and compress: ./bin/essCompress -i examples/kmc_k15.kff

To pass several files as input, generate the list of files (one file per line) as follows:

ls -1 examples/*.fa > list_reads   
./bin/essCompress -i list_reads -k 5

ESS-Compress uses BCALM 2 under the hood, which does not care about paired-end information, all given reads contribute to k-mers in the graph (as long as such k-mers pass the abundance threshold).

Output for essCompress

If using fast mode/normal mode: the compressed output is in a file with .essc extension.

If using UST mode without kff: the compressed output is in a file with .fa.essd extension.

If compressing a kff file: the compressed output is in a file with .compressed.kff extension.

essDecompress: decompression of .essc file

    Syntax: ./essDecompress [file_to_decompress]

Input: a .essc file generated by essCompress
Output: a fasta file with .essd extension, where all the distinct k-mers represented by the input .essc file appear exactly once. In other words, output is a spectrum-preserving string set.

Citation

If using ESS-Compresss in your research, please cite

  • Amatur Rahman, Rayan Chikhi and Paul Medvedev, Disk compression of k-mer sets, WABI 2020.