Skip to content
Lightweight resources assembly algorithm
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Replicate-our-results
BinaryStore.cpp
BinaryStore.hpp
BloomFilter.hpp
BloomUtil.hpp
FringeLine.cpp
FringeLine.hpp
GraphTraversal.cpp
GraphTraversal.hpp
GraphUtil.cpp
GraphUtil.hpp
KmerUtil.cpp
KmerUtil.hpp
KmersScanning.hpp
LICENSE
LargeInt.cpp
LargeInt.hpp
MathUtil.hpp
README.md
ReadsParsing.cpp
ReadsParsing.hpp
Utility.cpp
Utility.hpp
kseq.h
main.cpp
makefile
memusage.cpp

README.md

LightAssembler

Lightweight resources assembly algorithm for high-throughput sequencing reads. It uses a pair of cache oblivious Bloom filters, one holding a uniform sample of g-spaced sequenced kmers and the other holding kmers classified as likely correct, using a simple statistical test. LightAssembler contains a light implementation of the graph traversal and simplification modules that achieves comparable assembly accuracy and contiguity to other competing tools, more details about LightAssembler can be found in :

El-Metwally, S., Zakaria, M. and Hamza, T.; LightAssembler: fast and memory-efficient assembly algorithm for high-throughput sequencing reads. Bioinformatics 2016; 32 (21): 3215-3223. doi: 10.1093/bioinformatics/btw470.

Copyright (C) 2015-2016, and GNU GPL, by Sara El-Metwally, Magdi Zakaria and Taher Hamza.

System requirements

64-bit machine with g++ compiler or gcc in general, pthreads,and zlib libraries.

Installation

  1. Clone the GitHub repo, e.g. with git clone https://github.com/SaraEl-Metwally/LightAssembler.git
  2. Run make in the repo directory for k <= 31 or make k=kmersize for k > 31, e.g. make k=49.

Quick usage guide

./LightAssembler -k [kmer size] -g [gap size] -e [error rate] -G [genome size] -t
[threads] -o [output prefix] [input files] --verbose 

* [-k] kmer size                [default: 31]
* [-g] gap size                 [default: 25X:3 35X:4 75X:8 140X:15 280X:25]
* [-e] error rate               [default: 0.01]
* [-G] genome size              [default: 0]
* [-t] number of threads        [default: 1]
* [-o] output prefix file name  [default: LightAssembler]

Notes

  • If the gap size parameter is missing, LightAssembler invokes its parameters extrapolation module to compute the starting gap based on the sequencing coverage and the error rate of the dataset.
  • The maximum read length for this version is 1024 bp.
  • The maximum supported read files for this version is 100 files.

Read files

LightAssembler assembles multiple input files of the sequencing reads given in fasta/fastq format. Also, LightAssembler can read directly the input files compressed with gzip fasta.gz/fastq.gz.

Outputs

The output of LightAssembler is the set of assembled contigs in fasta format, in the file:

[output prefix].contigs.fasta

LightAssembler also reports the following on the screen:

  • Number of resulted contigs.
  • Maximum contig length.
  • Total Assembly size.
  • Total genome coverage.
  • Total Assembly time as well as the total time for each step.

Also, by using the --verbose option, LightAssembler reports the additional details for each step such as the number of kmers, the false positive rate of Bloom filters and the number of branching kmers in the dataset, the average read length and the average sequencing coverage.

Example 1

./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose

--- Uniform kmers sampling. 

--- h(0):m(0):s(5) elapsed time.
--- total number of kmers in BloomA = 7791111
--- BloomA false positive rate = 0.00193375
--- average read length = 101
--- average sequencing coverage = 35
--- probability of an incorrect kmer appears in the sample : 0.0249524

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(24) elapsed time.
--- total number of kmers in BloomB = 4548112
--- BloomB false positive rate = 7.7715e-05

--- Branching-kmers computation. 

--- h(0):m(0):s(5) elapsed time.
--- number of branching kmers = 54644

--- Graph traversal. 

--- h(0):m(0):s(16) elapsed time.
--- number of contigs     = 731
--- maximum contig length = 120924
--- assembly size         = 4473869
--- genome coverage       = 95.4703%

--- The assembly session is finished. 

--- h(0):m(0):s(31) elapsed time. 

Example 2 (missing g)

./LightAssembler -k 31 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose 
--- Parameters extrapolation. 

--- h(0):m(0):s(1) elapsed time.
--- start with gap size g = 4
--- average read length = 101
--- average sequencing coverage = 35

--- Uniform kmers sampling. 

--- h(0):m(0):s(8) elapsed time.
--- total number of kmers in BloomA = 27604568
--- BloomA false positive rate = 0.0375047
--- probability of an incorrect kmer appears in the sample : 0.118144

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(9) elapsed time.
--- total number of kmers in BloomB = 4655530
--- BloomB false positive rate = 9.1219e-05

--- Branching-kmers computation. 

--- h(0):m(0):s(2) elapsed time.
--- number of branching kmers = 57242

--- Graph traversal. 

--- h(0):m(0):s(22) elapsed time.
--- number of contigs     = 747
--- maximum contig length = 127975
--- assembly size         = 4474072
--- genome coverage       = 95.4746%

--- The assembly session is finished. 

--- h(0):m(0):s(42) elapsed time.

Example 3 (without --verbose)

./LightAssembler -k 31 -g 15 -e 0.01 -G 4686137 -o ecoli_contigs -t 3 ecoli_reads_1.fq ecoli_reads_2.fq --verbose

--- Uniform kmers sampling. 

--- h(0):m(0):s(2) elapsed time.

--- Trusted/untrusted kmers filtering. 

--- h(0):m(0):s(11) elapsed time.

--- Branching-kmers computation. 

--- h(0):m(0):s(1) elapsed time.

--- Graph traversal. 

--- h(0):m(0):s(17) elapsed time.
--- number of contigs     = 731
--- maximum contig length = 120924
--- assembly size         = 4473869
--- genome coverage       = 95.4703%

--- The assembly session is finished. 

--- h(0):m(0):s(31) elapsed time.

You can’t perform that action at this time.