Skip to content

Infer and decorate tree for new release

Donovan H. Parks edited this page Oct 30, 2023 · 95 revisions

Tree inference

Trees should be placed in:

/srv/projects/gtdb/{release}/{domain}/pre_curation/bac120/{date}

The u_cluster_de_novo method of the GTDB Species Cluster Toolkit creates the files gtdb_reps_bac.lst and gtdb_reps_ar.lst which list the NCBI accession numbers for all GTDB bacterial and archaeal reference genomes.

Two bacterial reference trees are inferred with FastTree as follows:

  • Bacterial 120 (bac120) marker set:

    • MSA: gtdb -t 32 tree create --genome_batchfile gtdb_reps_bac.lst --guaranteed_batchfile gtdb_reps_bac.lst --marker_set_ids 1 --output ./msa --no_tree --individual --prefix gtdb_r<cur>_bac120
    • Tree inference: FastTreeMP -nosupport -wag -log gtdb_r<cur>_bac120_fasttree.log ./msa/gtdb_r<cur>_bac120_concatenated.faa > gtdb_r<cur>_bac120_fasttree.tree 2> gtdb_r<cur>_bac120_fasttree.out
    • Non-parametric bootstraps are calculated with the GenomeTreeTk. GTDB R214 required ~75 GB per bac120 bootstrap tree so it is possible to do 6 trees at a time on a 500 GB machine (e.g. Page). It is possible and recommended to calculate bootstrap trees across several machines. The code is smart enough to skip to the next replicate if intermediate files for a given replicate already exist (required genometreetk >=0.1.8). Bootstraps must be done with the WAG model and without the GAMMA model. For GenomeTreeTk v0.1.8 the command is as followed, but be warned older version of GenomeTreeTk are loaded on some machines: genometreetk bootstrap gtdb_r<cur>_bac120_fasttree.tree ./msa/gtdb_r<cur>_bac120_concatenated.faa bootstraps -c 6
  • Ribosomal protein set 2 (23 universal ribosomal proteins; modified from Rinke et al. 2013):

    • MSA: gtdb -t 32 tree create --genome_batchfile gtdb_reps_bac.lst --marker_set_ids 11 --output ./msa_rp2 --custom_msa_filters --quality_threshold 0 --completeness_threshold 0 --contamination_threshold 100 --cols_per_gene 10000 --min_perc_aa 40 --min_rep_perc_aa 40 --no_tree --individual --prefix gtdb_r<cur>_bac_rp2
    • Tree inference and non-parametric bootstraps are calculated in an analogous manner to the bac120 marker set.

Three archaeal reference trees are inferred with IQ-Tree using initial guide trees inferred with FastTree as follows:

  • Archaeal 122 (ar122) marker set:

    • MSA: gtdb -t 32 tree create --genome_batchfile gtdb_reps_ar.lst --guaranteed_batchfile gtdb_reps_ar.lst --marker_set_ids 2 --output ./msa --no_tree --individual --prefix gtdb_r<cur>_ar122
    • Guide Tree inference: FastTreeMP -nosupport -wag -gamma -log gtdb_r<cur>_ar122_fasttree.log ./msa/gtdb_r<cur>_ar122_concatenated.faa > gtdb_r<cur>_ar122_fasttree.tree 2> gtdb_r<cur>_ar122_fasttree.out
    • Infer IQ-Tree (required 24 hours with 96 CPUs for R207): iqtree -nt 64 -s ./msa/gtdb_r<cur>_ar122_concatenated.faa -m LG+C10+F+G -ft gtdb_r<cur>_ar122_fasttree.tree -pre gtdb_r<cur>_ar122_iqtree
    • Non-parametric bootstraps: ar53 IQ-Tree requires ~310 GB of memory for GTDB R214 which allows 1 tree to be inferred on a machine with 500 GB of memory. A simple script, run_iqtree_bootstraps.py, was used to infer the bootstrap trees. This script can currently be obtained from the previous GTDB release and needs to be customized for each release and reference tree. Bootstrap trees can be inferred across multiple servers and a simple canary file scheme is used to protect against processes on different servers working on the same bootstrap tree (not a guarantee, but highly unlikely). Ultimately, this script should be built into one of the toolkits. Once all bootstrap trees have been inferred, support values can be determined on the original tree with:
      • genometreetk bootstrap gtdb_r<cur>_ar53_iqtree.treefile NONE . --boot_dir ./bootstrap_trees
  • Archaeal 53 marker set (ar53; modified from Dombrowski et al. 2020):

    • MSA: gtdb -t 16 tree create --genome_batchfile gtdb_reps_ar.lst --guaranteed_batchfile gtdb_reps_ar.lst --marker_set_ids 19 --output ./msa --custom_msa_filters --cols_per_gene 10000 --no_tree --individual --prefix gtdb_r<cur>_ar53
    • Tree inference with support values are computed in an analogous manner to the ar122 marker set.
    • IQ-Tree inference took 32 hours and 315 GB of RAM for R214
  • Ribosomal protein set 2 (23 universal ribosomal proteins; modified from Rinke et al. 2013):

    • MSA: gtdb -t 16 tree create --genome_batchfile gtdb_reps_ar.lst --marker_set_ids 11 --output ./msa --custom_msa_filters --quality_threshold 0 --completeness_threshold 0 --contamination_threshold 100 --cols_per_gene 10000 --min_perc_aa 40 --min_rep_perc_aa 40 --no_tree --individual --prefix gtdb_r<cur>_ar_rp2
    • Tree inference with support values are computed in an analogous manner to the ar122 marker set.

Producing decorated and RED scaled trees

Note: These instructions are for producing the final curation trees. Typically, we also provide curators with initial snapshot trees which lack non-parametric bootstraps so they can do an initial inspection. The following instructions can equally be applied for decorating and scaling snapshot trees. The u_pmc_species_names method of the GTDB Species Cluster Toolkit produces initial bacterial and archaeal taxonomy files which should be used to root and decorating the snapshot trees.

Trees can be rooted using:

  • Bacteria: genometreetk outgroup ./bootstraps/gtdb_r<cur>_bac120_fasttree.bootstrap.tree ./u_pmc_species_names_bac/final_taxonomy.tsv p__Spirochaetota gtdb_r<cur>_bac120.rooted.tree

  • Archaea: genometreetk outgroup gtdb_r<cur>_ar53_iqtree.bootstrap.tree ./u_pmc_species_names_ar/final_taxonomy.tsv p__Undinarchaeota gtdb_r<cur>_ar53.rooted.tree

    • It is important to verify the selected outgroup phylum is monophyletic in the tree. A warning will be produced if this is not the case. Other recommended outgroup phyla are p__Thermotogota, p__Patescibacteria, or p__Chloroflexota for the bacterial trees and p__QMZS01 for the archaeal trees.

Dummy leaf nodes used to provide an internal node for species labels useful during manual curation can be added to reference trees and their corresponding taxonomy files using:

  • gtdb_validation_tk add_dummy gtdb_r<cur>_bac120.rooted.tree --taxonomy_file ./u_pmc_species_names_bac/final_taxonomy.tsv
  • This produces the output files gtdb_r<cur>_bac120.rooted.dummy_nodes.tree and final_taxonomy.dummy_nodes.tsv.
  • Note: perhaps this functionality should be moved to genometreetk or gtdb_migration_tk

Trees can be decorated with PhyloRank >= v0.1.11 as follows:

  • phylorank decorate gtdb_r<cur>_bac120.rooted.dummy_nodes.tree final_taxonomy.dummy_nodes.tsv gtdb_r<cur>_bac120.decorated.tree --skip_rd_refine --gtdb_metadata gtdb_r<cur>_metadata.updated_reps.<date>.tsv

PhyloRank is used to calculated RED values, the RED plot, and RED scaled tree:

  • phylorank outliers gtdb_r<cur>_bac120.decorated.tree final_taxonomy.dummy_nodes.tsv phylorank_outliers --fmeasure_table gtdb_r<cur>_bac120.decorated.tree-table --highlight_polyphyly --dpi 300