Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


The de bruijn graph (DBG) is one of the most commonly used data structures for assembly of sequencing data. Reads from the sequencer are chopped into small words of size k (k-mers) which form the nodes of the DBG. Two nodes are connected by an edge if they have a k-1 overlap. Each edge can be labelled with a k+1-mer formed by merging the kmers of the two nodes. For instance, if an edge connects two nodes of kmers ATCG and TCGT, the edge can be labelled as ATCGT. Assembly is generated by traversing paths in this graph. With the advances in deep sequencing technologies, assembling high coverage datasets has become a challenge in terms of memory and runtime requirements. Hence, read normalization, a lossy read filtering approach is gaining a lot of attention. Although current normalization algorithms are efficient, they provide no guarantee to preserve important k-mers that form connections between different genomic regions in the graph. There is a possibility that the resultant assembly is fragmented. In this work, normalization is phrased as a set multicover problem on reads and a linear time heuristic algorithm is proposed, named ORNA (Optimized Read Normalization Algorithm). ORNA normalizes to the minimum number of reads required to retain all labels (k+1-mers) and inturn all kmers and relative label abundances from the original dataset. Hence, no connections from the original graph are lost and coverage information is preserved. aeron/snakemake_pipeline/Parameters/NA12878Shuffling1/CountMatrixWithAllRandomOnlyAmbigous.txt

When to use ORNA

ORNA is a read normalization software developed in spirit of Diginorm. ORNA is computationally inexpensive and it guarantees the preservation of all kmers from the original dataset. It can be used if the user has a high coverage dataset but does not have enough computational power (in particular memory but also limited time) in order to conduct a de novo assembly, because it removes the redundancy in your data. It can also be used to merge many sequencing datasets. The user must be aware that using ORNA (or in that case any normalization software) might have a significant impact on the assemblies produced as it is highly dependent on the dataset.

Enhancements to ORNA

We have implemented two additional options in ORNA to improve the reduction performance using either abundance values of kmers in reads or base quality scores.

ORNA-Q (parameter: -sorting 1):

In this mode, ORNA apart from preserving all the labels from the original dataset, also maximizes the total read quality score for the normalized dataset. The read quality score of a read is defined as the sum of phred qualities of bases in the read. ORNA-Q sorts the input dataset using read quality scores using a counting sort procedure before reduction.

ORNA-K (parameter: -ksorting 1)

In this mode, the normalization algorithm maximizes the total read abundance score of the normalized dataset (apart from preserving all labels from the original dataset). The read abundance score of a read is defined as the median of abundances of kmers present in the read. ORNA-K sorts the input dataset using the median kmer abundances of the reads in the dataset and then uses ORNA for reduction.

ORNA Algorithm

1.  Input : Read set R, LogBase b, kmer size k
2.  Initialization: k'=k+1
3.                  n = NumberOfDistinctK'mers(R)
4.                  counter(0,...,n)=0
5.                  Rout=null
6.  Steps:
7.          for r in R:
8.              flag=0
9.              V'=ObtainK'mers(R)
10.             for v in V':
11.                if(counter(v) < min(abundance(v), log_b(abundance(v)))) then:
12.                  counter(v)++
13.                  flag=1
14.                end if
15.              end for
16.              if flag!=0 then:
17.                Rout = Rout U r
18.              end if
19.          end for
20. Output: Rout
  • ORNA uses the GATB version 1.2.2 to store the kmer information
  • It reduces the abundance of a kmer to a value which is equal to the logarithmic transformation of the abundance. The base b of the logarithm is provided by the user.
  • ORNA was tested on two de bruijn graph based assemblers namely Oases and TransABySS and also worked for the assembly of metagenomics data.

Points to be noted

  • Currently, as ORNA retains all the kmers from the original dataset, it would also retain erroneous kmers. Thus ORNA reduces more reads, like any other tool for read reduction, when the data is error corrected. In case of RNA-seq or other non-uniform data we suggest to use the SEECER algorithm that proved to work well with ORNA.
  • ORNA-Q, ORNA-K and ORNA's paired-end mode currently does not support multithreading. Work is in progress for this and will be included in the future versions of ORNA.


Version 0.4


For questions or suggestions regarding ORNA contact

  • Dilip A Durai (
  • Marcel H Schulz (


The software can be downloaded by using the following command

	git clone

The downloaded folder should contain the following files and folders:

  • gatb-core (it will be empty. Files would be copied in once the install script is run)
  • src(folder) (contains the source code for ORNA)


Linux operating system with gcc version >=4.7
All the analysis for the manuscript was performed on Debain 8 operating system


  • Run the following command for installation
  • The above command should create a build folder. The executable of ORNA will be in build/bin

ORNA parameters

./bin/ORNA -help

short explanation note
-help shows the help message
-sorting (0 or 1) quality based sorting of input data Default 0
-ksorting (0 or 1) kmer abundance based sorting of input data Default 0
-base Base value for the logarithmic function Default 1.7
-kmer the value of k for kmer size Default 21
-input Input fasta file (for single end mode)
-pair1 First mate of the pair (for paired-end mode)
-pair2 Second mate of the pair (for paired-end mode)
-output Prefix of the output file Default "Normalized"
-nb-cores number of cores (does not work for paired end mode) Default 1
-type type of the output file (fasta/fastq) Default fasta

kmer value:
This parameter represents the kmer size to be used for reduction. As we aim at preserving all the edge lables ((k+1)-mers) from the original dataset, internally the kmer size given by the user would be incremented by 1. For instance, if the user provides a kmer size of 21, then ORNA would increment the kmer size to 22 for all its calculations. All the analysis in the paper were done using a kmer size of 21 for reads having length of 50bps and 76bps. If you are running an DBG assembly afterwards, we recommend to use the smallest k-mer used in the assembler. Depending on the dataset memory and runtime requirements will change depending on k.

This parameter represents the base of the logarithm function used to decide the new abundance of kmer. For instance if the original abundance of a kmer is 1000 and a base of 10 is selected as a parameter then the new abundance is set to log101000 = 3. The higher the base parameter the more reduction of the reads. According to the analysis done in ORNA paper, a base of 1.7 seems to be a good compromise between data reduction and little loss in assembly quality. More examples can be found in this answer.

Running ORNA

  • To run ORNA, execute the following command from the installation directory
  ./build/bin/ORNA -input Dataset_name -output Output -base LogBase -kmer kmerSize -nb-cores NumberOfThreads -type fasta
  • Run ORNA in paired-end mode from the installation directory
  ./build/bin/ORNA -pair1 first_pair -pair2 second_pair -output Output -base LogBase -kmer kmerSize -type fasta
  • For instance, if the dataset to be normalized is named as input.fa, the following command would normalize the dataset using a log base of 1.7 and a kmer size of 21
  ./build/bin/ORNA -input input.fa -output output.fa -base 1.7 -kmer 21 -nb-cores 1


If you use ORNA in the normal mode (without quality of kmer abundance based sorting) in your work please cite:

Durai DA, Schulz MH. In-silico read normalization with set multicover optimization. Bioinformatics 2018 full text

If you use ORNA-Q/S (with quality or kmer abundance based sorting), please cite:

Durai DA, Schulz MH. Improving in-silico normalization using read weights. Scientific Reports 2019 full text


ORNA uses the GATB library for graph building and k-mer counting. We are thankful for their support.

You can’t perform that action at this time.