The de bruijn graph (DBG) is one of the most commonly used data structures for assembly of sequencing data. Reads from the sequencer are chopped into small words of size k (k-mers) which form the nodes of the DBG. Two nodes are connected by an edge if they have a k-1 overlap. Each edge can be labelled with a k+1-mer formed by merging the kmers of the two nodes. For instance, if an edge connects two nodes of kmers ATCG and TCGT, the edge can be labelled as ATCGT. Assembly is generated by traversing paths in this graph. With the advances in deep sequencing technologies, assembling high coverage datasets has become a challenge in terms of memory and runtime requirements. Hence, read normalization, a lossy read filtering approach is gaining a lot of attention. Although current normalization algorithms are efficient, they provide no guarantee to preserve important k-mers that form connections between different genomic regions in the graph. There is a possibility that the resultant assembly is fragmented. In this work, normalization is phrased as a set multicover problem on reads and a linear time heuristic algorithm is proposed, named ORNA (Optimized Read Normalization Algorithm). ORNA normalizes to the minimum number of reads required to retain all labels (k+1-mers) and inturn all kmers and relative label abundances from the original dataset. Hence, no connections from the original graph are lost and coverage information is preserved.
When to use ORNA
ORNA is a read normalization software developed in spirit of Diginorm. ORNA is computationally inexpensive and it guarantees the preservation of all kmers from the original dataset. It can be used if the user has a high coverage dataset but does not have enough computational power (in particular memory but also limited time) in order to conduct a de novo assembly, because it removes the redundancy in your data. It can also be used to merge many sequencing datasets. The user must be aware that using ORNA (or in that case any normalization software) might have a significant impact on the assemblies produced as it is highly dependent on the dataset.
Enhancements to ORNA
We have implemented two additional options in ORNA to improve the reduction performance using either abundance values of kmers in reads or base quality scores.
ORNA-Q (parameter: -sorting 1):
In this mode, ORNA apart from preserving all the labels from the original dataset, also maximizes the total read quality score for the normalized dataset. The read quality score of a read is defined as the sum of phred qualities of bases in the read. ORNA-Q sorts the input dataset using read quality scores using a counting sort procedure before reduction.
ORNA-K (parameter: -ksorting 1)
In this mode, the normalization algorithm maximizes the total read abundance score of the normalized dataset (apart from preserving all labels from the original dataset). The read abundance score of a read is defined as the median of abundances of kmers present in the read. ORNA-K sorts the input dataset using the median kmer abundances of the reads in the dataset and then uses ORNA for reduction.
1. Input : Read set R, LogBase b, kmer size k 2. Initialization: k'=k+1 3. n = NumberOfDistinctK'mers(R) 4. counter(0,...,n)=0 5. Rout=null 6. Steps: 7. for r in R: 8. flag=0 9. V'=ObtainK'mers(R) 10. for v in V': 11. if(counter(v) < min(abundance(v), log_b(abundance(v)))) then: 12. counter(v)++ 13. flag=1 14. end if 15. end for 16. if flag!=0 then: 17. Rout = Rout U r 18. end if 19. end for 20. Output: Rout
- ORNA uses the GATB version 1.2.2 to store the kmer information
- It reduces the abundance of a kmer to a value which is equal to the logarithmic transformation of the abundance. The base b of the logarithm is provided by the user.
- ORNA was tested on two de bruijn graph based assemblers namely Oases and TransABySS and also worked for the assembly of metagenomics data.
Points to be noted
- Currently, as ORNA retains all the kmers from the original dataset, it would also retain erroneous kmers. Thus ORNA reduces more reads, like any other tool for read reduction, when the data is error corrected. In case of RNA-seq or other non-uniform data we suggest to use the SEECER algorithm that proved to work well with ORNA.
- ORNA-Q, ORNA-K and ORNA's paired-end mode currently does not support multithreading. Work is in progress for this and will be included in the future versions of ORNA.
For questions or suggestions regarding ORNA contact
- Dilip A Durai (ddurai_at_mmci.uni-saarland.de)
- Marcel H Schulz (mschulz_at_mmci.uni-saarland.de)
The software can be downloaded by using the following command
git clone https://github.com/SchulzLab/ORNA
The downloaded folder should contain the following files and folders:
- gatb-core (it will be empty. Files would be copied in once the install script is run)
- src(folder) (contains the source code for ORNA)
Linux operating system with gcc version >=4.7
All the analysis for the manuscript was performed on Debain 8 operating system
- Run the following command for installation
- The above command should create a build folder. The executable of ORNA will be in build/bin
|-help||shows the help message|
|-sorting||(0 or 1) quality based sorting of input data||Default 0|
|-ksorting||(0 or 1) kmer abundance based sorting of input data||Default 0|
|-base||Base value for the logarithmic function||Default 1.7|
|-kmer||the value of k for kmer size||Default 21|
|-input||Input fasta file (for single end mode)|
|-pair1||First mate of the pair (for paired-end mode)|
|-pair2||Second mate of the pair (for paired-end mode)|
|-output||Output fasta file||Default "Normalized.fa"|
|-nb-cores||number of cores (does not work for paired end mode)||Default 1|
This parameter represents the kmer size to be used for reduction. As we aim at preserving all the edge lables ((k+1)-mers) from the original dataset, internally the kmer size given by the user would be incremented by 1. For instance, if the user provides a kmer size of 21, then ORNA would increment the kmer size to 22 for all its calculations. All the analysis in the paper were done using a kmer size of 21 for reads having length of 50bps and 76bps. If you are running an DBG assembly afterwards, we recommend to use the smallest k-mer used in the assembler. Depending on the dataset memory and runtime requirements will change depending on k.
This parameter represents the base of the logarithm function used to decide the new abundance of kmer. For instance if the original abundance of a kmer is 1000 and a base of 10 is selected as a parameter then the new abundance is set to log101000 = 3. According to the analysis done in ORNA paper, a base of 1.7 seems to be a good compromise between data reduction and little loss in assembly quality.
- To run ORNA, execute the following command from the installation directory
./build/bin/ORNA -input Dataset_name -output Output -base LogBase -kmer kmerSize -nb-cores NumberOfThreads
- Run ORNA in paired-end mode from the installation directory
./build/bin/ORNA -pair1 first_pair -pair2 second_pair -output Output -base LogBase -kmer kmerSize
- For instance, if the dataset to be normalized is named as input.fa, the following command would normalize the dataset using a log base of 1.7 and a kmer size of 21
./build/bin/ORNA -input input.fa -output output.fa -base 1.7 -kmer 21 -nb-cores 1
If you use ORNA in your work please cite:
Durai DA, Schulz MH. In-silico read normalization with set multicover optimization. Bioinformatics 2018 full text
ORNA uses the GATB library for graph building and k-mer counting. We are thankful for their support.