Skip to content
gmiclotte edited this page Apr 20, 2016 · 19 revisions

##Jabba: Hybrid Error Correction for Long Sequencing Reads

Jabba is a hybrid error correction tool to correct third generation (PacBio / ONT) sequencing data, using second generation (Illumina) data.

Input

Jabba takes as input a concatenated de Bruijn graph and a set of sequences:

  • the de Bruijn graph should appear in fasta format with 1 entry per node, the meta information should be in the format:
    >NODE <node number> <size of node> <number of in edges> <in edges represented by node number of origin, separated by tabs> <number of out edges> <out edges represented by node number of target, separated by tabs>
  • the set of sequences should be in fasta or fastq format. These sequences will be corrected (e.g. PacBio reads). The corrections will be written to a file Jabba-<input filename>.fasta.

The output is a file with corrections of the long reads.

de Bruijn graph

To build a de Bruijn graph from sequencing reads, one can use brownie, or any other suitable tool. Errors in the de Bruijn graph have to be corrected and linear paths concatenated. Correction can be achieved either by using corrected second generation data to build the graph or by directly correcting the graph, or preferably a combination of the two. For read correction of the second generation data one can use brownie or Karect. This read correction should be performed with a small k-mer size, after which a larger k-mer size can be used

brownie will concatenate linear nodes and can output the graph in the desired graph format.
To build a graph with brownie from a fastq file short_reads.fastq containing short reads:

./brownie graphCorrection -p brownie_data -k 75 short_reads.fastq

To build a graph with brownie from a fasta file genome.fasta containing a reference genome (in this case the graph is not corrected):

./brownie graphConstruction -p brownie_data -k 75 genome.fasta

In both cases the graph file brownie_data/DBGraph.fasta will be created.

Installation

At the moment Jabba is available for Linux. It requires CMake 2.6 and GCC 4.7. Jabba can be compiled as follows:

mkdir -p build
cd build
cmake ../
make cd ..
mkdir -p bin
cp -b ./build/src/Jabba ./bin/Jabba

This code is also available in the compile.sh script in the main Jabba directory.

Usage

jabba [options] [file_options] file1 [[file_options] file2]...
[options]
-h --help display help page
-i --info display information page
[options arg]
-l --length minimal seed size [default = 20]
-k --dbgk de Bruijn graph k-mer size
-e --essak sparseness factor of the essa [default = 1]
-t --threads number of threads [default = available cores]
-p --passes maximal number of passes per read [default = 2]
-m --outputmode short (do not extend the reads) or long (maximally extend reads) [default = short]
[file_options file_name]
-o --output output directory [default = Jabba_output]
-fastq fastq input files
-fasta fasta input files
-g --graph graph input file [default = DBGraph.fasta]

examples:

./jabba --dbgk 31 --graph DBGraph.txt -fastq reads.fastq
./jabba -o Jabba -l 20 -k 31 -p 2 -e 1 -g DBGraph.fasta -fastq reads1.fastq reads2.fastq -fasta reads3.fasta

Example

Given an Illumina dataset short_reads.fastq and a PacBio dataset long_reads.fastq, the following pipeline can be used:
First we download and compile the software:

git clone https://github.com/aminallam/karect.git
cd karect
make
cd ..
git clone https://github.com/jfostier/brownie.git
cd brownie
mkdir build
mkdir bin
cd build
cmake -DMAXKMERLENGTH=75 ..
make brownie
cp src/brownie ../bin/brownie
cd ../..
git clone https://github.com/gmiclotte/Jabba.git
cd Jabba
./compile.sh
cd ..

Now we are ready to run the tools:

mkdir karect_output
./karect/karect -correct -matchtype=hamming -celltype=haploid -inputfile=short_reads.fastq -resultdir=karect_output -tempdir=karect_output
mkdir brownie_output
./brownie/bin/brownie graphCorrection -p brownie_output -k 75 karect_output/karect_short_reads.fastq
./Jabba/bin/jabba -o jabba_output -k 75 -g brownie_output/DBGraph.fasta -fastq long_reads.fastq