Skip to content

What is GToTree?

Mike Lee edited this page Mar 15, 2024 · 58 revisions

Phylogenomics is the practice of trying to infer evolutionary relationships at something closer to the genome-level than individual-gene trees (see this page here if wanting a little more on this; or this video for a bit on phylogenetics vs phylogenomics). It is becoming an increasingly essential step in many biologists’ work.

GToTree is a program that aims to give more researchers the capability to generate phylogenomic trees to help guide their work. At its heart it just takes in genomes and outputs an alignment and phylogenomic tree based on the specified single-copy gene set. But I think a part of its value comes from:

  1. its flexibility with regard to input-genome format (taking nucleotide/amino-acid fasta files, GenBank files, and/or NCBI accessions), and how it will retrieve and handle each individual reference genome we want to include for us (rather than having to find and download each one ourselves)
  2. its automation of required between-tool tasks such as filtering hits by gene-length, filtering out genomes with too few hits to the target genes, aligning and trimming all targets individually before concatenating together, and swapping genome labels for something more useful (i.e. lineage info at desired ranks rather than just accessions, and/or appending identifiers of characteristics we care about) so we can more easily navigate and explore the final alignment and tree
  3. its scalability – GToTree can turn 200 genomes into a tree in 5 minutes on a standard laptop, making iterating phylogenomic trees a cinch :)

GToTree also comes packaged with 15 single-copy gene sets suitable for phylogenomic analysis of different major taxa.

The open-access Bioinformatics Journal publication is available here.


A quick conda installation can be run like so:

# installing mamba if needed first (for faster conda installs)
conda install -n base -c conda-forge mamba
mamba create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree

Overview

Presented above is an overview image of the processing GToTree does, with more details below. For practical ways GToTree can be helpful, check out the Example-usage page. And for detailed information on using GToTree, see the User guide.

Input files - any combination of fasta files (nucleotide or amino acid), GenBank files, and/or NCBI assembly accessions

  • fasta files - will identify coding sequences (CDSs) with prodigal if nucleotide
  • GenBank files - will extract CDSs if they are annotated in the GenBank file, if not will identify them with prodigal
  • NCBI assembly accessions - downloads NCBI assembly summary files, builds ftp links to download the appropriate assembly, attempts to download just the amino acid (AA) sequences of CDSs if annotations exist for it, if not will download the assembly in fasta format and identify CDSs with prodigal – examples of generating this accessions file from NCBI via either the site or from the command line, as well as examples searching and using the stellar Genome Taxonomy Database with helper GToTree programs, are shown in the examples page

Identify target genes

  • GToTree then uses HMMER3 to search each genome for the target genes specified by the provided HMM file
    • 15 of these are provided with the software, listed on the SCG-sets page

Estimate genome completeness/redundancy

  • using the information from the HMM search, reports estimates of % completeness and redundancy for each genome, also outputs a table of hits per target-gene per genome (this is for a rough overview, there are much better methods now for estimating genome quality, like CheckM2)

Optionally identify additional target genes of interest

Filter gene hits and genomes

  • filter out genes based on length - get the median of all genes in that set, filter out those whose length is not within a certain range of the median length (20% by default)
  • filter out genomes if they do not have hits to at least a certain fraction of the total genes searched (50% by default)

Add needed gap-sequences

  • adds the appropriate-sized gap-sequences for target genes that are missing from genomes being retained in the analysis

Align, trim, concatenate

  • align each gene set with Muscle
  • perform automated trimming with Trimal
  • concatenate all together into full alignment

Add taxonomic information

  • Add either NCBI or GTDB taxonomic information to the labels that will be displayed on the tree

Add additional information to the labels

  • a two- or three-column tab-delimited mapping file can be provided with either the NCBI accession or input file name in column 1 (depending on input source), and the desired genome label in column 2, and/or text to append to the label in column 3 (not all input genomes need to be provided)

Tree

Outputs

Primary outputs include:

  • the tree file and alignment file
  • a genome summary table mapping all modified labels to original genome IDs, estimates of completion/redundancy, and any available taxonomy information
  • a table showing number of hits per single-copy gene per genome, and a table of any additional genes searched if any
  • reports on what, if anything, was filtered out at which steps
  • a citations file that includes citation information specific to the run for all the programs GToTree utilized to help with proper citation 👍

Citation information

GToTree relies on many great programs. Along with all other outputs, it will generate a citations.txt file with citation information specific for every run that accounts for all programs it relies upon. Please be sure to cite the developers appropriately :)

Here is an example output citations.txt file from a run, and how I'd cite it in the methods:

GToTree v1.6.31
Lee MD. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019; (March):1-3. doi:10.1093/bioinformatics/btz188

Prodigal v2.6.3
Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2010; 28, 2223–2230. doi.org/10.1186/1471-2105-11-119

HMMER3 v3.3.2
Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011; (7)10. doi:10.1371/journal.pcbi.1002195

Muscle v5.1
Edgar RC. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. bioRxiv. 2021. doi.org/10.1101/2021.06.20.449169

TrimAl v1.4.rev15
Gutierrez SC. et al. TrimAl: a Tool for automatic alignment trimming. Bioinformatics. 2009; 25, 1972–1973. doi:10.1093/bioinformatics/btp348

TaxonKit v0.9.0
Shen W and Ren H. TaxonKit: a practical and efficient NCBI Taxonomy toolkit. Journal of Genetics and Genomics. 2021. doi.org/10.1016/j.jgg.2021.03.006

FastTree 2 v2.1.11
Price MN et al. FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5. doi:10.1371/journal.pone.0009490

Example methods text based on above citation output (be sure to modify as appropriate for your run)

The archaeal phylogenomic tree was produced with GToTree v1.6.31 (Lee 2019), using the prepackaged single-copy gene-set for archaea (76 target genes). Briefly, prodigal v2.6.3 (Hyatt et al. 2010) was used to predict genes on input genomes provided as fasta files. Target genes were identified with HMMER3 v3.2.2 (Eddy 2011), individually aligned with muscle v5.1 (Edgar 2021), trimmed with trimal v1.4.rev15 (Capella-Gutiérrez et al. 2009), and concatenated prior to phylogenetic estimation with FastTree2 v2.1.11 (Price et al. 2010). TaxonKit (Shen and Ren 2021) was used to connect full lineages to taxonomic IDs.