This pipeline assemblies organelle genome from genomic skimming data.
Citation: Jian-Jun Jin*, Wen-Bin Yu*, Jun-Bo Yang, Yu Song, Ting-Shuang Yi, De-Zhu Li. 2018. GetOrganelle: a simple and fast pipeline for de novo assembly of a complete circular chloroplast genome using genome skimming data. bioRxiv, 256479. http://doi.org/10.1101/256479
License: GPL https://www.gnu.org/licenses/gpl-3.0.html
Please cite the dependencies if they are used:
SPAdes: Bankevich, A., S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D. Prjibelski, A. V. Pyshkin, A. V. Sirotkin, N. Vyahhi, G. Tesler, M. A. Alekseyev and P. A. Pevzner. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19: 455-477.
This pipeline was written in python 3.5.1, but compatible with 2.7.11.
Execute following simple git commands to download the latest version (suggested) or find older stable versions here:
# Supposing you are going to install it at ~/Applications/bin GetOrganellePATH=~/Applications/bin cd $GetOrganellePATH git clone git://github.com/Kinggerm/GetOrganelle
then add GetOrganelle to the path:
# for MacOS echo "PATH=$GetOrganellePATH/GetOrganelle:\$PATH" >> ~/.bash_profile echo "PATH=$GetOrganellePATH/GetOrganelle/Utilities:\$PATH" >> ~/.bash_profile echo "export PATH" >> ~/.bash_profile # for Linux echo "PATH=$GetOrganellePATH/GetOrganelle:\$PATH" >> ~/.bashrc echo "PATH=$GetOrganellePATH/GetOrganelle/Utilities:\$PATH" >> ~/.bashrc echo "export PATH" >> ~/.bashrc
and make them writable/executable if they are not:
chmod +x $GetOrganellePATH/GetOrganelle/*.py chmod +x $GetOrganellePATH/GetOrganelle/Utilities/*.py chmod +x $GetOrganellePATH/GetOrganelle/Library/*.py chmod +w $GetOrganellePATH/GetOrganelle/Library/*Reference
It is also very IMPORTANT to keep updated (if you find your version is out of date!):
cd $GetOrganellePATH/GetOrganelle git pull
You could run the main script (get_organelle_reads.py) to get organelle reads (*.fastq) successfully, without any third-party libraries or software.
However, to get a complete organelle genome (such as a plastome) rather than organelle reads, other files in GetOrganelle are needed in the original relative path. Also, the following software/libraries are needed to be installed and added to the PATH, since they could be called automatically:
Python libraries numpy, scipy, sympy are used to solve the assembly graph, and could be easily installed by typing in:
pip install numpy scipy sympy
SPAdes is the assembler
Bowtie2 is used to speed up initial recruitment of target-like reads
BLAST+ is used to filter target-like contigs and simplify the final assembly graph
Bandage is suggested to view the final contig graph (.fastg/.gfa).
Besides, if you installed python library psutil (pip install psutil), the memory cost of get_organelle_reads.py will be automatically logged.
What you actually need to do is just typing in one simple command as suggested in Example. But you are still invited to read the following introductions:
Currently, this script was written for illumina pair-end/single-end data (fastq or fastq.gz). 1G per end is enough for chloroplast for most normal angiosperm samples, and 5G per end is enough for mitochondria data. You could simply assign a maximum number of reads (number of seqs, not number of bases) for GetOrganelle to use with flag
--max-reads or manually cut raw data into certain size before running GetOrganelle using the Linux or Mac OS build-in command (eg.
head -n 20000000 large.fq > small.fq).
Filtering and Assembly
Take your input reference (fasta or bowtie index; the default is
Library/SeqReference/*.fasta) as probe, the script would recruit target reads in successive rounds (extending process). You could also use a more related reference, which would be safer if the sequence quality is bad (say, degraded DNA samples). The value word size (followed with "-w"), like the kmer in assembly, is crucial to the feasibility and efficiency of this process. The best word size changes from data to data and will be affected by read length, read quality, base coverage, organ DNA percent and other factors. Since version 1.4.0, if there is no user assigned word size value, GetOrganelle would automatically estimate the initial word size based no the data characters and adjust the value ("--auto-wss") according to the behaviour of extending process. Although the automatically-estimated word size value does not ensure the best performance nor the best result, you do not need to adjust the value if a complete/circular organelle result is produced, because the circular result by GetOrganelle is generally consistent under different options. After extending, this script will automatically call SPAdes to assembly the target reads produced by the former step. The best kmer depends on a wide variety of factors too.
By default, SPAdes is automatically called and produce the assembly graph file
filtered_spades/assembly_graph.fastg. Then, Utilities/slim_fastg.py is called to modify the
filtered_spades/assembly_graph.fastg file and produce a new fastg file (would be
assembly_graph.fastg.extend+cp-mt.fastg if -F cp been used) along with a tab-format annotation file (
assembly_graph.fastg.extend+cp-mt.fastg file along with the
assembly_graph.fastg.extend+cp-mt.csv file would be further parsed by disentangle_organelle_assembly.py, and your target sequence file(s)
*path_sequence.fasta would be produced as the final result, if disentangle_organelle_assembly.py successfully solve the path. Otherwise, if disentangle_organelle_assembly.py failed to solve the path, you have to manually view
assembly_graph.fastg.extend+cp-mt.fastg and load the
assembly_graph.fastg.extend+cp-mt.csv in Bandage, choose the best path(s) as the final result.
Here (or here) is a short video showing a standard way to extract the plastome from the assembly graph with Bandage. See here or here for more examples with more complicated (do not miss
3m01s - 5m53s) situations.
For 2G raw data, 150 bp reads, to assembly chloroplast, typically I use:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s cp_reference.fasta -o chloroplast_output -R 15 -k 75,85,95,105
or in a draft way:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -o chloroplast_output --fast -k 75,85,95,105
or in a slow and memory-economic way:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s cp_reference.fasta -o chloroplast_output -R 10 -k 75,85,95,105 --memory-save -a mitochondria.fasta
For 2G raw data, 150 bp reads, to assembly mitochondria
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s mt_reference.fasta -o mitochondria_output -R 50 -k 65,75,85,95,105 -P 1000000 -F mt
For 2G raw data, 150 bp reads, to assembly nuclear ribosomal RNA (18S-ITS1-5.8S-ITS2-26S)
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s nr_reference.fasta -o nr_output -R 7 -k 95,105,115 -P 0 -F nr
See illustrations of those arguments by typing in:
Published Works Using GetOrganelle
Yu Song, Wen-Bin Yu, Yun-Bong Tan, Bing Liu, Xin Yao, Jian-Jun Jin, Michael Padmanaba, Jun-Bo Yang, Richard T. Corlett. 2017. Evolutionary comparisons of the chloroplast genome in Lauraceae and insights into loss events in the Magnoliids. Genome biology and evolution. 9(9): 2354-64. doi: https://doi.org/10.1093/gbe/evx180
Twyford AD, Ness RW. 2017. Strategies for complete plastid genome sequencing. Molecular Ecology Resources. 17(5):858-68. doi: https://doi.org/10.1111/1755-0998.12626
Guan-Song Yang, Yin-Huan Wang, Yue-Hua Wang, Shi-Kang Shen. 2017. The complete chloroplast genome of a vulnerable species Champereia manillana (Opiliaceae). Conservation Genetics Resources. 9(3): 415-418. doi: https://doi.org/10.1007/s12686-017-0697-1
Thanks to Chao-Nan Fu, Han-Tao Qin, Xiao-Jian Qu, Shuo Wang, and Rong Zhang for giving tests or suggestions.