Allio, R., Schomaker-Bastos, A., Romiguier, J., Prosdocimi, F., Nabholz, B., & Delsuc, F. (2020) Mol Ecol Resour. 20, 892-905. (publication link)
MitoFinder is a pipeline to assemble mitochondrial genomes and annotate mitochondrial genes from trimmed read sequencing data.
MitoFinder is also designed to find and annotate mitochondrial sequences in existing genomic assemblies (generated from Hifi/PacBio/Nanopore/Illumina sequencing data...)
MitoFinder is distributed under the license.
This container is suitable for all systems with Singularity version >= 3.0 installed.
- Installation guide for MitoFinder
- How to use MitoFinder's container
- Detailed options
- INPUT FILES
- OUTPUT FILES
- Particular cases
- UCE annotation
- How to cite MitoFinder
- How to get reference mitochondrial genomes from ncbi
- How to submit your annotated mitochondrial genome(s) to GenBank NCBI
To run the container, you just need to have Singularity version >= 3.0 installed in your system.
To check if a right version of Singularity is actually installed:
singularity version
System dependencies:
sudo apt-get update && sudo apt-get install -y \
build-essential \
libssl-dev \
uuid-dev \
libgpgme11-dev \
squashfs-tools \
libseccomp-dev \
wget \
pkg-config \
git
Go (v>=1.13):
export VERSION=1.17 OS=linux ARCH=amd64
cd /tmp
wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz
sudo tar -C /usr/local -xzf go$VERSION.$OS-$ARCH.tar.gz
echo 'export GOPATH=${HOME}/go' >> ~/.bashrc
echo 'export PATH=/usr/local/go/bin:${PATH}:${GOPATH}/bin' >> ~/.bashrc
source ~/.bashrc
Get and install singularity:
export VERSION=3.10.1 && # adjust this as necessary \
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-ce-${VERSION}.tar.gz && \
tar -xzf singularity-ce-${VERSION}.tar.gz && \
cd singularity-ce-3.10.1 && ./mconfig
cd builddir && make
sudo make install
Check singularity version:
singularity version
The singularity container is available here. Since you have Singularity version >= 3.0 installed, you can clone MitoFinder's container directly using singularity with the following command:
singularity pull --arch amd64 library://remiallio/default/mitofinder:v1.4.2
and then run it as follows:
singularity run mitofinder_v1.4.2.sif -h
or:
mitofinder_v1.4.2.sif -h
cd PATH/TO/MITOFINDER/
p=$(pwd)
echo -e "\n#Path to MitoFinder \nexport PATH=\$PATH:$p" >> ~/.bashrc
source ~/.bashrc
WARNING: If you previously installed MitoFinder on your system and want to install a new version, you should replace the old MitoFinder PATH by the updated one in your ~/.bashrc file. To do so, you need to edit your ~/.bashrc file, remove the lines that add MitoFinder to the PATH, and close your terminal. Then, you should open a new terminal and re-execute the command lines from above.
TIP: If you are connected to cluster, you can use either nano
or vi
to edit the ~/.bashrc file.
To check if the right version of MitoFinder is actually in your PATH:
mitofinder_v1.4.2.sif -v
First, you can choose the assembler using the following options:
-- megahit (default: faster)
-- metaspades (recommended: a bit slower but more efficient (see associated paper). WARNING: Not compatible with single-end reads)
-- idba
Second, you can choose the tool for the tRNA annotation step of MitoFinder using the -t option:
-t mitfi (default, MiTFi: slower but really efficient)
-t arwen (ARWEN: faster)
-t trnascan (tRNAscan-SE)
TIP: use MitoFinder_v1.4.2 --example to print basic usage examples
mitofinder_v1.4.2.sif -j [seqid] -1 [left_reads.fastq.gz] -2 [right_reads.fastq.gz] -r [genbank_reference.gb] -o [genetic_code] -p [threads] -m [memory]
mitofinder_v1.4.2.sif -j [seqid] -s [SE_reads.fastq.gz] -r [genbank_reference.gb] -o [genetic_code] -p [threads] -m [memory]
MitoFinder can also be run directly on a previously computed assembly (one or several contig.s in fasta format)
mitofinder_v1.4.2.sif -j [seqid] -a [assembly.fasta] -r [genbank_reference.gb] -o [genetic_code] -p [threads] -m [memory]
Use the same command line.
WARNING: If you want to compute the assembly again (for example because it failed) you have to remove the assembly results' directory (--override option). If not, MitoFinder will skip the assembly step.
Depending on the proximity of your reference, you can play with the following parameters : nWalk; --blast-eval; --blast-identity-nucl; --blast-identity-prot; --blast-size
If you want to run a test with assembly and annotation steps:
cd PATH/TO/MITOFINDER_CONTAINER/test_case/
../mitofinder_v1.4.2.sif -j Aphaenogaster_megommata_SRR1303315 -1 Aphaenogaster_megommata_SRR1303315_R1_cleaned.fastq.gz -2 Aphaenogaster_megommata_SRR1303315_R2_cleaned.fastq.gz -r reference.gb -o 5 -p 5 -m 10
If you want to run a test with only the annotation step:
cd PATH/TO/MITOFINDER_CONTAINER/test_case/
../mitofinder_v1.4.2.sif -j Hospitalitermes_medioflavus_NCBI -a Hospitalitermes_medioflavus_NCBI.fasta -r Hospitalitermes_medioflavus_NCBI.gb -o 5
usage: mitofinder_v1.4.2.sif [-h] [--megahit] [--idba] [--metaspades] [-t TRNAANNOTATION]
[-j PROCESSNAME] [-1 PE1] [-2 PE2] [-s SE] [-c CONFIG]
[-a ASSEMBLY] [-m MEM] [-l SHORTESTCONTIG]
[-p PROCESSORSTOUSE] [-r REFSEQFILE] [-e BLASTEVAL]
[-n NWALK] [--override] [--adjust-direction] [--ignore]
[--new-genes] [--allow-intron] [--numt]
[--intron-size INTRONSIZE] [--max-contig MAXCONTIG]
[--cds-merge] [--out-gb] [--min-contig-size MINCONTIGSIZE]
[--max-contig-size MAXCONTIGSIZE] [--rename-contig RENAME]
[--blast-identity-nucl BLASTIDENTITYNUCL]
[--blast-identity-prot BLASTIDENTITYPROT]
[--blast-size ALIGNCUTOFF] [--circular-size CIRCULARSIZE]
[--circular-offset CIRCULAROFFSET] [-o ORGANISMTYPE] [-v]
[--example] [--citation]
Mitofinder is a pipeline to assemble and annotate mitochondrial DNA from
trimmed sequencing reads.
optional arguments:
-h, --help show this help message and exit
--megahit Use Megahit for assembly. (Default)
--idba Use IDBA-UD for assembly.
--metaspades Use MetaSPAdes for assembly.
-t TRNAANNOTATION, --tRNA-annotation TRNAANNOTATION
"arwen"/"mitfi"/"trnascan" tRNA annotater to use.
Default = mitfi
-j PROCESSNAME, --seqid PROCESSNAME
Sequence ID to be used throughout the process
-1 PE1, --Paired-end1 PE1
File with forward paired-end reads
-2 PE2, --Paired-end2 PE2
File with reverse paired-end reads
-s SE, --Single-end SE
File with single-end reads
-c CONFIG, --config CONFIG
Use this option to specify another Mitofinder.config
file.
-a ASSEMBLY, --assembly ASSEMBLY
File with your own assembly
-m MEM, --max-memory MEM
max memory to use in Go (MEGAHIT or MetaSPAdes)
-l SHORTESTCONTIG, --length SHORTESTCONTIG
Shortest contig length to be used (MEGAHIT). Default =
100
-p PROCESSORSTOUSE, --processors PROCESSORSTOUSE
Number of threads Mitofinder will use at most.
-r REFSEQFILE, --refseq REFSEQFILE
Reference mitochondrial genome in GenBank format
(.gb).
-e BLASTEVAL, --blast-eval BLASTEVAL
e-value of blast program used for contig
identification and annotation. Default = 0.00001
-n NWALK, --nwalk NWALK
Maximum number of codon steps to be tested on each
size of the gene to find the start and stop codon
during the annotation step. Default = 5 (30 bases)
--override This option forces MitoFinder to override the previous
output directory for the selected assembler.
--adjust-direction This option tells MitoFinder to adjust the direction
of selected contig(s) (given the reference).
--ignore This option tells MitoFinder to ignore the non-
standart mitochondrial genes.
--new-genes This option tells MitoFinder to try to annotate the
non-standard animal mitochondrial genes (e.g. rps3 in
fungi). If several references are used, make sure the
non-standard genes have the same names in the several
references
--allow-intron This option tells MitoFinder to search for genes with
introns. Recommendation : Use it on mitochondrial
contigs previously found with MitoFinder without this
option.
--numt This option tells MitoFinder to search for both
mitochondrial genes and NUMTs. Recommendation : Use it
on nuclear contigs previously found with MitoFinder
without this option.
--intron-size INTRONSIZE
Size of intron allowed. Default = 5000 bp
--max-contig MAXCONTIG
Maximum number of contigs matching to the reference to
keep. Default = 0 (unlimited)
--cds-merge This option tells MitoFinder to not merge the exons in
the NT and AA fasta files.
--out-gb Do not create annotation output file in GenBank
format.
--min-contig-size MINCONTIGSIZE
Minimum size of a contig to be considered. Default =
1000
--max-contig-size MAXCONTIGSIZE
Maximum size of a contig to be considered. Default =
25000
--rename-contig RENAME
"yes/no" If "yes", the contigs matching the
reference(s) are renamed. Default is "yes" for de novo
assembly and "no" for existing assembly (-a option)
--blast-identity-nucl BLASTIDENTITYNUCL
Nucleotide identity percentage for a hit to be
retained. Default = 50
--blast-identity-prot BLASTIDENTITYPROT
Amino acid identity percentage for a hit to be
retained. Default = 40
--blast-size ALIGNCUTOFF
Percentage of overlap in blast best hit to be
retained. Default = 30
--circular-size CIRCULARSIZE
Size to consider when checking for circularization.
Default = 45
--circular-offset CIRCULAROFFSET
Offset from start and finish to consider when looking
for circularization. Default = 200
-o ORGANISMTYPE, --organism ORGANISMTYPE
Organism genetic code following NCBI table (integer):
1. The Standard Code 2. The Vertebrate Mitochondrial
Code 3. The Yeast Mitochondrial Code 4. The Mold,
Protozoan, and Coelenterate Mitochondrial Code and the
Mycoplasma/Spiroplasma Code 5. The Invertebrate
Mitochondrial Code 6. The Ciliate, Dasycladacean and
Hexamita Nuclear Code 9. The Echinoderm and Flatworm
Mitochondrial Code 10. The Euplotid Nuclear Code 11.
The Bacterial, Archaeal and Plant Plastid Code 12. The
Alternative Yeast Nuclear Code 13. The Ascidian
Mitochondrial Code 14. The Alternative Flatworm
Mitochondrial Code 16. Chlorophycean Mitochondrial
Code 21. Trematode Mitochondrial Code 22. Scenedesmus
obliquus Mitochondrial Code 23. Thraustochytrium
Mitochondrial Code 24. Pterobranchia Mitochondrial
Code 25. Candidate Division SR1 and Gracilibacteria
Code
-v, --version Version 1.4.2
--example Print getting started examples
--citation How to cite MitoFinder
MitoFinder needs several files to run depending on the method you have choosen (see above):
- Reference_file.gb containing at least one mitochondrial genome of reference extracted from NCBI
- left_reads.fastq.gz containing the left reads of paired-end sequencing
- right_reads.fastq.gz containing the right reads of paired-end sequencing
- SE_reads.fastq.gz containing the reads of single-end sequencing
- assembly.fasta containing the assembly on which MitoFinder have to find and annotate mitochondrial contig.s
MitoFinder returns several files for each mitochondrial contig found:
- [Seq_ID]_final_genes_NT.fasta containing the nucleotides sequences of the final genes selected from all contigs found by MitoFinder
- [Seq_ID]_final_genes_AA.fasta containing the amino acids sequences of the final genes selected from all contigs found by MitoFinder
- [Seq_ID]_mtDNA_contig.fasta containing a mitochondrial contig
- [Seq_ID]_mtDNA_contig.gff containing the final annotation for a given contig (GFF3 format)
- [Seq_ID]_mtDNA_contig.tbl containing the final annotation for a given contig (Genbank submission format)
- [Seq_ID]_mtDNA_contig.gb containing the final annotation for a given contig (Genbank format for visualization)
- [Seq_ID]_mtDNA_contig_genes_NT.fasta containing the nucleotide sequences of annotated genes for a given contig
- [Seq_ID]_mtDNA_contig_genes_AA.fasta containing the amino acids sequences of annotated genes for a given contig
- [Seq_ID]_mtDNA_contig.png schematic representation of the annotation of the mtDNA contig
- [Seq_ID]_mtDNA_contig.infos containing the initial contig name, the length of the contig and the GC content
/!\ Close reference required /!\
For the particular cases below, we recommend using MitoFinder in two different steps. First, you can use it to assemble and/or identify mitochondrial-like contigs, then use it in a second step to annotate these particular contigs (option -a) with the corresponding additional options.
Also, these options are recommended for cases in which a (really) close reference is available.
/!\ Close reference required /!\
In some taxa (e.g. fungi), it's possible to find mitochondrial genes containing intron(s). In these cases, we add the --allow-intron option (combined with --intron-size and --cds-merge). However, it is important to note that, despite the search for start and stop codons is functional for this option, there is no search for intronic boundaries. The exon annotation is based only on the similarity with the reference. That's why a close reference is necessary and even with a good reference, we recommend to double check the exon annotation.
/!\ Close reference required /!\
Once you have identified nuclear contigs that may contain NUMTs, you can use MitoFinder to find the NUMTs using the --numt option. Basically, this option allows MitoFinder to find the same gene several times in a contig. Given that the NUMTs can be full of stop codons, we recommand to limit the number of walks (--nwalk 0) that MitoFinder can do to improve the annotation (looking for start and stop codons).
MitoFinder starts by assembling both mitochondrial and nuclear reads using de novo metagenomic assemblers. It is only in a second step that mitochondrial contigs are identified and extracted. MitoFinder thus provides UCE contigs that are already assembled and the annotation can be done from the following file:
- [Seq_ID]_link_[assembler].scafSeq containing all assembled contigs from raw reads.
To do so, we recommend the use of the PHYLUCE pipeline which is specifically designed to annotate ultraconserved elements (Faircloth 2015; Tutorial: https://phyluce.readthedocs.io/en/latest/tutorial-one.html#finding-uce-loci).
You can thus use the file [Seq_ID]_link_[assembler].scafSeq and start the Phyluce pipeline at the "Finding UCE" step.
If you use MitoFinder, please cite:
- Allio, R, Schomaker‐Bastos, A, Romiguier, J, Prosdocimi, F, Nabholz, B, Delsuc, F. (2020). MitoFinder: Efficient automated large‐scale extraction of mitogenomic data in target enrichment phylogenomics. Mol Ecol Resour. 20, 892-905. https://doi.org/10.1111/1755-0998.13160
If you use this container, please also cite:
- Kurtzer, G. M., Sochat, V., & Bauer, M. W. (2017). Singularity: Scientific containers for mobility of compute. PloS one, 12(5), e0177459.
Please also cite the following references depending on the option chosen for the assembly step in MitoFinder:
- Li, D., Luo, R., Liu, C. M., Leung, C. M., Ting, H. F., Sadakane, K., Yamashita, H. & Lam, T. W. (2016). MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102(6), 3-11.
- Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome research, 27(5), 824-834.
- Peng, Y., Leung, H. C., Yiu, S. M., & Chin, F. Y. (2012). IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428.
For tRNAs annotation, depending on the option chosen:
- MiTFi: Juhling, F., Putz, J., Bernt, M., Donath, A., Middendorf, M., Florentz, C., & Stadler, P. F. (2012). Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements. Nucleic acids research, 40(7), 2833-2845.
- Laslett, D., & Canback, B. (2008). ARWEN: a program to detect tRNA genes in metazoan mitochondrial nucleotide sequences. Bioinformatics, 24(2), 172-175.
- Chan, P. P., & Lowe, T. M. (2019). tRNAscan-SE: searching for tRNA genes in genomic sequences. In Gene Prediction (pp. 1-14). Humana, New York, NY.
For UCEs extraction:
- Faircloth, B. C. (2016). PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics, 32(5), 786-788.
- Go to NCBI
- Select "Nucleotide" in the search bar
- Search for mitochondrion genomes:
- RefSeq (if available)
- Sequence length from 12000 to 20000
- Download complete record in GenBank format
Depending on the proximity of your reference, you can play with the following parameters : nWalk; --blast-eval; --blast-identity-nucl; --blast-identity-prot; --blast-size
- Install entrez-direct utilities (instructions here, or using
conda -c bioconda entrez-direct
) - Use the following in a shell to batch download all mitochondrial genomes associated with Carnivora:
taxa=Carnivora
esearch -db nuccore -query "\"mitochondrion\"[All Fields] AND (\"${taxa}\"[Organism]) AND (refseq[filter] AND mitochondrion[filter] AND (\"12000\"[SLEN] : \"20000\"[SLEN]))" | efetch -format gbwithparts > reference.gb
If you have few mitochondrial genomes to submit, you should be able to do it with BankIt through the NCBI submission portal.
If you want to submit several complete or partial mitogenomes, we designed MitoFinder to strealine the submission process using tbl2asn.
tbl2asn requires:
- Template file containing a text ASN.1 Submit-block object (suffix .sbt). Create submission template.
- Nucleotide sequence data containing the mitochondrial sequence(s) and associated information (suffix .fsa).
- Feature Table containing annotation information for the mitochondrial sequence(s).
- Comment file containing assembly and annotation method information (assembly.cmt). Create comment template
Because tbl2asn requires the FASTA file to contain information associated with the data, we wrote a script to create a FASTA file containing the mitochondrial contig(s) found by MitoFinder for each species (Seq_ID) with the associated information. This script and the associated example files can be found in the MitoFinder directory named "NCBI_submission".
- index_file.csv A CSV file (comma-delimited table) containing the metadata information.
The headers of the index file are as follows: Directory path, Seq ID, organism, location, mgcode, SRA, keywords ...
The first two columns are mandatory and the names cannot be changed but you can complete the index file with the different source modifiers of NCBI by adding columns in the index file.
The directory path correponds to the path where the [Seq_ID]_mtDNA_contig.fasta file, or [Seq_ID]mtDNA_contig*.fasta files if you have several contigs for the same individual, could be found. If left blank, the script will search for the contig in the directory where you run the script from (./).
/PATH/TO/MITOFINDER_CONTAINER/NCBI_submission/create_tbl2asn_files.py -i index_file.csv
TIPS:
(1) You can copy or link (symbolic links) all your FASTA and TBL contig files in the same directory and run the script from this directory.
(2) You can leave blanks in the index file if some species do not need a given source modifier.
- [Seq_ID].fsa new FASTA file containing all mtDNA contigs and the information for a given [Seq_ID]
- [Seq_ID].tbl new TBL file containing all mtDNA contigs and the information for a given [Seq_ID]
Once your FASTA and TBL files have been created, you can run tbl2asn (download here: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/) as follows:
tbl2asn -t template.sbt -i [Seq_ID].fsa -V v -w assembly.cmt -a s
This command will create several files:
- [Seq_ID].sqn Submission file (.sqn) to be sent by e-mail to gb-sub@ncbi.nlm.nih.gov
- [Seq_ID].val Containing ERROR and WARNING values associated with tbl2asn. (ERROR explanations here)
If you don't have any error and you are happy with the annotation, you can submit your mitochondrial contig(s) by sending the .sqn files to gb-sub@ncbi.nlm.nih.gov