De novo transcriptome assembly of short read sequences is an important ingredient to many RNA-seq analyses. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most of these assemblers use the de Bruijn graph (DBG) as the underlying data structure. A fundamental parameter with large influence on assembly quality with DBGs is the exact word length k. As such no single kmer value leads to optimal results which has led to the wide acceptance of multi-kmer transcriptome assemblers. For these, DBGs over different k-mer values are build and the assemblies merged to improve sensitivity. In most of the cases a suboptimal selection of kmer values is used by practitioners which results in suboptimal assembly.
We introduce the KREATION (Kmer Range EstimATION) algorithm. Given a minimum k value to start, KREATION calculates the contribution of each assembly. KREATION stops at a kmer value at which no further assemblies are required thereby removing the kmer selection problem of the user and potentially saving hours of runtime for kmers not contributing to the final merged assembly.
1. Input : read_length l, step_size s, minimum_k km, threshold t 2. Initializations : k=km 3. i=1 4. last=0 5. Tp=null 6. Steps: repeat 7. Tk = Assembly(k) #Perform assembly for a single value k 8. C = Cluster(Tp,Tk) #Cluster the assembly with previous assemblies 9. ci = log(extended(C,Tk)) #Calculate number of extended clusters 10. M0 = lm((k1,c1),...,(k(i-1),c(i-1))) #Fit a linear model till k values of i-1th iteration 11. p = d_score(M0,last) #compute the d_score for the current iteration 12. if(p>d_score) #check for the cut-off 13. break 14. else #Update the variables 15. k = k+s 16. i++ 17. Tp = Tp U Tk 18. last = last + p 19. end if 20. till k<=l
KREATION has been tested on the following assemblers (see below for configuration)
For questions or suggestions regarding KREATION please checkout the FAQ or contact
- Dilip A Durai (ddurai_at_contact.mmci.uni-saarland.de)
- Marcel H Schulz (mschulz_at_mmci.uni-saarland.de)
- R (version >=2.14.1)
The software can be downloaded by using the following command
git clone https://github.com/SchulzLab/KREATION
The downloaded folder should contain the following files/folder:
python KREATION.py --help
|-h||--help||show the help on screen|
|-c||--config||path to the config file (only text file)||required parameter|
|-r||--read||read length||required parameter|
|-s||--step||kmer step size for the assembly process||default=2|
|-o||--out||path to the output directory, directory will be created if non-existent||default=KREATION folder|
|-t||--threshold||Threshold value for d_score||default=0.01|
Config file structure
- Line 1: Name of the program to be run
- Line 2: Output file name from the assembly
- Line 3: paramater name and the value of the min kmer
- Line 4: Rest of the command
- Line 5: parameter name for the max kmer (leave as blank if there is none)
We use the dataset MAQC UHR (SRX016367) downloaded from SRA run database (http://www.ncbi.nlm.nih.gov/sra/SRX016367[accn]) for this test run. The dataset has been error corrected using the SEECER error correction algorithm.
NOTE: Below we give some example config files for some of the assemblers we tested KREATION with. However, we ask you to consult the manual files of the assemblers for exact parametrization.
Config file for oases assembler
#Program Name oases_pipeline_2.py #Output file name transcripts.fa #Minimum K -m 21 #Rest of the command -d "/path-to-the-fasta-file/MAQC_Combined.fasta_corrected.fa" -p ""
If the input file is a fastq file:
#Program Name oases_pipeline_2.py #Output file name transcripts.fa #Minimum K -m 21 #Rest of the command -d "-fastq /path-to-the-fasta-file/MAQC_Combined.fq" -p ""
Note: The current version has an inbuilt merge function. Also the default value for max kmer is 31. To avoid this KREATION requires that the users use the modified version of oases pipeline (supplied with this package). To do this type the following command in your terminal.
The modified version does not require a max kmer value and also does not implement the oases merge function.
Config file for SOAPdenovo-Trans assembler
#Program Name SOAPdenovo-Trans-127mer all #Output file name transcripts.contig #Minimum K -K 21 #Rest of the command -s /path-to-config-file/example.config -p 4 -o transcripts
Config file for Trans-ABySS
#Program Name transabyss #Output file name transabyss-final.fa #minimum K -k 21 #rest of the command --se "/path-to-the-fasta-file/MAQC_Combined.fasta_corrected.fa" --length 100 --threads 10
export PATH=/path to cd-hit/:$PATH
python KREATION.py -c config_file.txt -o complete/path/outputDirectory -s 2 -r 35
The output folder should contain three sub folders with the following names:
- Assembly (contains the assembly generated from each kmer)
- Cluster (contains the clustering results)
- Final (contains the final assembly and a report file)
Please cite the paper as:
Durai DA, Schulz MH. (Apr 2016) Informed kmer selection for de novo transcriptome assembly. Bioinformatics doi:10.1093/bioinformatics/btw217