Skip to content

Example usage

Mike Lee edited this page Jul 12, 2022 · 99 revisions

This page demonstrates a few examples from start-to-finish. The data files, run log, and results for most examples (except the large gene/presence visualization one) are packaged with GToTree.


Examples

  • Here we imagine we have a newly recovered genome of interest and we want to see where it fits into its evolutionary landscape alongside its closest known relatives.
  • We can add GTDB taxonomic info instead of NCBI taxonomic info by providing the -D flag.
  • We can also search for the genomes we'd like to include in our phylogenomic tree based on GTDB taxonomy using the program gtt-get-accessions-from-GTDB as demonstrated with this example.
  • It can sometimes be useful to visualize the presence/absence of a gene or trait across a clade of interest or entire domain, as this may reveal interesting patterns about its evolutionary distribution.
  • This demonstrates one way we can utilize the alignment and partitions file generated by GToTree to create a mixed-model tree with the glorious IQ-TREE program.


Alteromonas example

Here we imagine we have a newly recovered genome of interest and we want to see where it fits into its evolutionary landscape alongside its closest known relatives. The exact files for when this example was initially done are available here, but it can also be just followed along with below (though things will be slightly different due to more genomes available.

It takes about 6 minutes to put our new genome (provided as fasta) into a phylogenomic tree with Alteromonas references (provided as NCBI accessions).

The setup

During my PhD I had the privilege of working on a really cool, nitrogen-fixing, marine cyanobacterium called Trichodesmium. This bug seems to only live in perpetual association with a consortium of other microorganisms – there are no pure (axenic) cultures of Trichodesmium by itself, it just doesn't seem to be happy without its buddies. One of the highly conserved organisms (here meaning consistently present across all Trichodesmium samples, but not in non-Trichodesmium "controls") that came out of this work was an Alteromonas metagenome-assembled genome (MAG) – some of that fun work is presented in this paper here.

Here we are going to use GToTree to generate a de novo phylogenomic tree with this "new" Alteromonas MAG and all of the complete Alteromonas genomes available in NCBI's RefSeq database.

Generating inputs

Genomes

The inputs for this example will be: 1) the fasta file of our "new" MAG; 2) a list of accessions of Alteromonas genomes from NCBI's RefSeq; and 3) an Alphaproteobacteria to serve as a root.

1) We can download the MAG fasta file from NCBI and decompress it like such:

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/271/865/GCA_002271865.1_ASM227186v1/GCA_002271865.1_ASM227186v1_genomic.fna.gz | gunzip - > GCA_002271865.1.fa

2) We can search NCBI's assembly database on their website to get the accessions for all RefSeq Alteromonas genomes with the following search string: Alteromonas[ORGN] AND "latest refseq"[filter] AND "complete genome"[filter] (when this was put together on 1-Jan-2019, this returned 31 hits). You can download a summary file by selecting "Send to:" at the top right, and setting the options as shown here:

Clicking create file will download these as "assembly_result.txt". Here we are using the RefSeq assembly accessions (those that start with "GCF_...", but GToTree also handles GenBank assembly accessions (those that start with "GCA_"). For us here we are going to take the RefSeq accessions from the 3rd column, but if you are following this and want to work with genomes available in GenBank but not in RefSeq, you would want to take the first column.

In our case here, we can take just the RefSeq accessions with the following:

tail -n +2 assembly_result.txt | cut -f 3 > alteromonas_refseq_accessions.txt

If you're unfamiliar with this line of code, and want to get to know working at the command line better, a good place to start is here :)


NOTE: This part of the process (generating the accessions list of the reference genomes we want) can actually be done completely at the command line. EDirect is a command-line tool for accessing NCBI's databases. It can have a bit of a learning curve, but is definitely worthwhile if you use information from NCBI databases frequently. EDirect comes installed with GToTree if installed with conda. With EDirect, this accessions file could be created at the command line as such: esearch -query 'Alteromonas[ORGN] AND "latest refseq"[filter] AND "complete genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > alteromonas_refseq_accessions.txt. I have more examples of using EDirect at this page here if interested πŸ™‚


3) And to have an outgroup to root the tree with, and to incorporate a GenBank file which GToTree can handle as input as well, we'll take an alphaproteobacterium:

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/365/GCF_000011365.1_ASM1136v1/GCF_000011365.1_ASM1136v1_genomic.gbff.gz | gunzip - > GCF_000011365.1.gbff

Mapping file for labeling specific genomes

Often it is helpful to have specific labels for specific genomes in a tree. In this case we may want to have our "new" MAG labeled as "Our_Alteromonas_MAG", instead of just "GCA_002271865.1", and we may want to label our root as "Alpha_root" instead of "GCF_000011365.1". GToTree uses TaxonKit to add lineage information to any genomes that have such information associated with them (whether provided as NCBI accessions or GenBank files), but we can also swap labels of specific genomes we know we care about and want to be able to find more easily.

To do that we just need to provide a 2-column, tab-delimited file that has the initial genome ID in the first column (this will be either the NCBI accession or the file name (depending on how the genome was provided). This can be made anywhere, but here is quickly making one for this example:

printf "GCA_002271865.1.fa\tOur_Alteromonas_MAG\nGCF_000011365.1.gbff\tGCF_000011365.1_Alpha_Outgroup\n" > genome_to_id_map.tsv

Which looks like this:

cat genome_to_id_map.tsv
GCA_002271865.1.fa	Our_Alteromonas_MAG
GCF_000011365.1.gbff	GCF_000011365.1_Alpha_Outgroup

NOTE: User-provided labels given to genomes listed in this mapping file (passed to the program with the -m flag) will always take precedence over any automated lineage swapping.


Running GToTree

The accessions file can be provided as-is, but to tell GToTree which fasta and genbank files to work on, we need to put their names (or paths) into files. In the case here, this will get the job done:

ls *.fa > fasta_files.txt
ls *.gbff > genbank_files.txt

Now we are set to run GToTree :)

We can check which single-copy gene HMM profiles are available by running gtt-hmms, and since Alteromonas is within the Gammaproteobacteria, we will use that one:

Note: If you're not familiar with seeing things like the codeblock just below, you can ignore the backslashes (\) in here. They are only there to "escape" the return characters so this can all be seen at once (instead of one long line you'd have to scroll), and so it can be copy-and-pasted if wanted 😊

GToTree -a alteromonas_refseq_accessions.txt -g genbank_files.txt \
        -f fasta_files.txt -H Gammaproteobacteria \
        -t -L Species,Strain -m genome_to_id_map.tsv -j 4 \
        -o Alteromonas_example

Code Breakdown:

  • -a – the file with the list of accessions
  • -g – the file holding the genbank paths
  • -f – the file holding the fasta paths
  • -H – the desired HMM profiles to use (can view all default available with gtt-hmms)
  • -t – specifies to use TaxonKit to add labels with lineage information to the tree
  • -L – specifies the ranks to add when adding lineage information to the tree, since here we are working with all Alteromonas, we don't really need more than the Species (or "specific name", which in NCBI includes the Genus) and Strain (if available)
  • -m – the mapping file specifying specific labels for specific input genomes
  • -j – the number of jobs to run in parallel when possible
  • -o – the output prefix for primary output files

Viewing/Editing the tree

The output tree file "Alteromonas_example.tre" is in newick format and can be viewed/edited with any general tree program. A good one for large trees that I have installed on my laptop and use regularly is Dendroscope. And a good web-based one is the Interactive Tree of Life.

If we go to the Interactive Tree of Life upload page, we can upload the tree file we just created. After rooting at our alpha outgroup and coloring our "new" MAG blue, we can see it's among the Alteromonas macleodii references:

And taking a closer look, based on these 172 gammaproteobacterial SCGs it is most closely related to reference strain Te101:

Beyond taxonomy and knowing the most similar reference genome available, having some idea of this layout can help guide further decisions. For instance, if comparative genomics were the goal, it would help to know at the start that the species A. macleodii and A. mediterranea are much more closely related to each other than either are to A. stellipolaris or A. naphthalenivorans. Or if the goal is pursuing pangenomics with an analysis and visualization program like anvi’o, knowing the evolutionary landscape like this can also help in choosing which reference genomes to include.



Using GToTree with the Genome Taxonomy Database (GTDB)

There are two primary ways we can use the stellar GTDB with GToTree (from v1.5+):

  1. We can add GTDB taxonomy to our tree labels instead of NCBI taxonomy if we'd like. This is done by adding the -D flag in the main GToTree call.
  2. We can do a search by GTDB taxonomy to gather the accessions of genomes we want to include in a GToTree run. This is handled with the program gtt-get-accessions-from-GTDB.

For example, say we have a few new metagenome-assembled genomes (MAGs) that we just classified with the wonderful gtdb-tk program as part of the Nitrososphaerales order. If we'd now like to make a de novo phylogenomic tree with our new MAGs placed within the context of all the Nitrososphaerales in the GTDB, we can search the GTDB based on taxonomy using the gtt-get-acessions-from-GTDB program:

This will report how many Nitrososphaerales there are in the GTDB:

gtt-get-accessions-from-GTDB -t Nitrososphaerales --get-taxon-counts

Output

  Reading in the GTDB info table...
    Using GTDB v95: Released July 17, 2020


    The rank 'order' has 168 Nitrososphaerales entries.

Seems there are 168 in the database. We can see how many there are that represent GTDB species clusters by adding the --GTDB-representatives-only flag (this is very useful for covering diversity without having a ton of closely related genomes, similar to NCBI RefSeq's representative genomes):

gtt-get-accessions-from-GTDB -t Nitrososphaerales --get-taxon-counts --GTDB-representatives-only

Output

  Reading in the GTDB info table...
    Using GTDB v95: Released July 17, 2020


    The rank 'order' has 168 Nitrososphaerales entries.

  In considering only GTDB representative genomes:

    The rank 'order' has 100 Nitrososphaerales representative genome entries.

Which tells us there are 100. To download those accessions and a summary table of them, we can just remove the --get-taxon-counts flag:

gtt-get-accessions-from-GTDB -t Nitrososphaerales --GTDB-representatives-only

Output

  Reading in the GTDB info table...
    Using GTDB v95: Released July 17, 2020


    The rank 'order' has 168 Nitrososphaerales entries.

  In considering only GTDB representative genomes:

    The rank 'order' has 100 Nitrososphaerales representative genome entries.

  The targeted NCBI accessions were written to:
    GTDB-Nitrososphaerales-order-GTDB-rep-accs.txt

  A subset GTDB table of these targets was written to:
    GTDB-Nitrososphaerales-order-GTDB-rep-metadata.tsv

Then we are ready to provide those accessions to GToTree as usual to build a de novo tree with our new MAGs:

GToTree -f our-MAG-fasta-files.txt -a GTDB-Nitrososphaerales-order-GTDB-rep-accs.txt -H Archaea -D -j 4

Where the -D flag will also swap out the input accessions for their GTDB taxonomy on the final tree 😊

Random notes on gtt-get-accessions-from-GTDB

For other usage info, see gtt-get-accessions-from-GTDB -h. Like if we wanted to download all genomes, we could provide "all" to the -t target taxon argument.

And if we wanted to just download the genomes, rather than use them with GToTree, we could get the accessions as shown above, and then use bit-dl-ncbi-assemblies from the bit package, e.g.:

  # search GTDB based on taxonomy to get corresponding NCBI accessions
gtt-get-accessions-from-GTDB -t Nitrososphaerales --GTDB-representatives-only

  # download accessions in fasta format 
bit-dl-ncbi-assemblies -w GTDB-Nitrososphaerales-order-GTDB-rep-accs.txt -f fasta -j 4


Visualization of gene presence/absence across the bacterial domain


The setup

Depending on the scope of the question being pursued, it can sometimes be useful to visualize a trait or characteristic across an entire domain or large clade, as this may reveal patterns about its evolutionary distribution. The presence or absence of a particular trait in a given genome can be acquired in any fashion – e.g., using annotations already available, using a program to identify a specific antibiotic resistance marker if that's your thing, or searching for genes of interest based on hidden Markov Models (HMMs) with HMMER3. GToTree already uses these to find single-copy genes, but now supports the added functionality of searching for any additional target PFam accessions. While downloading all genomes to find single-copy genes for building the tree, if provided a file of additional PFam targets, GToTree will also search for these in each genome and: 1) pull out and store the hit sequences; 2) generate a count table of hits per genome; and 3) generate an Interactive Tree of Life-compatible file for each target so all can be easily visualized at their upload site!

Here we will demonstrate this searching for the presence of flagellar protein FliT (PF05400) in 5,550 bacterial genomes pulled from NCBI’s RefSeq.

Note: This is not a particularly quick example because it is done at a large scale, so while it may not be great for following along executing the code, it shows the process. Things can be done the same way on any smaller subset of organisms with any PFam targets (or with any other metric for a characteristic, as discussed below). If you'd like to run a quick example, you can in the "test_data" directory that comes with GToTree. The data, results, and log files for this example are stored here on figshare.

Getting accessions

The accessions from NCBI's RefSeq could be attained with the following search string on NCBI's website "bacteria"[filter] AND "latest refseq"[filter] AND ("reference genome"[filter] OR "representative genome"[filter]), as shown above in the Alteromonas example. These can also be retrieved entirely from the command line like so:

esearch -query '"bacteria"[filter] AND "latest refseq"[filter] AND ("reference genome"[filter] OR "representative genome"[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > accessions.txt

When performed on 3-June-2019, this returned 5,550 genomes.

Specifying additional target PFams

For this example we're just looking for the flagellar protein FliT (PF05400), but we could search for as many as we'd like. So for good measure let's add another one to the list, a ribosomal protein (PF00238). Here's just throwing them into a file:

echo "PF00238\nPF05400" > pfam-targets.txt

NOTE
The latest HMM model version will always be pulled from PFam.

Running GToTree

With our NCBI assembly accessions stored in the file "accessions.txt":

head accessions.txt
GCF_000765805.2
GCF_000763135.2
GCF_000762875.2
GCF_000765745.2
GCF_000765905.2
GCF_002075285.3
GCF_000158095.2
GCF_000974685.2
GCF_001687475.2
GCF_000953855.2

Note: As when running GToTree without additional PFam targets, we could provide any combination of NCBI accessions, GenBank, fasta, or amino acid files here, though we're just using accessions now.

And our additional target PFam accessions stored in a file, here called "pfam-targets.txt":

head pfam-targets.txt
PF00238.19
PF05400.13

We're all set to run GToTree:

GToTree -a accessions.txt -H Universal -p pfam-targets.txt -t -j 50 -o FliT_bacterial_tree

Note: I used the Universal SCG-set with just 15 targets to save time just for this example. But in general, it's best to use the SCG-set that is most specific to the breadth of organisms you are treeing, and here that would be the Bacteria SCG-set which has 74 targets. You can see all available SGC-sets that come with GToTree by running gtt-hmms.

Visualizing the tree

When that finishes, among other things, our output FliT_bacterial_tree/ directory holds our final tree, FliT_bacterial_tree.tre, and a subdirectory holding information about our additional PFam targets called additional_pfam_search_results/ that contains our FliT iToL-compatible file for coloring those branches on the tree 😊

To get the figure pictured at the start of this example, we just need to drag-and-drop the FliT_bacterial_tree.tre file onto the Interactive Tree of Life upload page, and then after that loads we can drag-and-drop our additional_pfam_search_results/PF05400.13-iToL.txt file over the page to color the branches. Then I clicked "Unrooted", Labels "Off", and set the color of the regular branches to gray. That gets us to this point:


Then I exported the tree from iToL and added the lines grouping major taxa and their names in a figure-editing program 😊 – I use Affinity Designer, which has a one-time purchase fee of $50, but Inkscape is a free and open-source alternative.

And that's how we end up with this:

NOTE: If the characteristic you'd like to visualize does not come from PFams, GToTree has a helper script to create the iToL mapping file following a typical GToTree run. GToTree-gen-iToL-map takes as input the All_genomes_summary_info.tsv output file from a standard run, and a single-column file of the genomes that have the characteristic to be colored.



ToL Example

The code and original files from when this example was first put together are available here. It can be done solely following the code below, but things will be slightly different largely due to available genomes having changed.

Run log:

# getting accessions for all refseq, complete, representative genomes (search originally performed on 20-Dec-2018)
esearch -query '"latest refseq"[filter] AND "complete genome"[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > GToTree_ToL_accessions

# running GToTree:
GToTree -a GToTree_ToL_accessions -H Universal -t -j 4 -o ToL -G 0.4

NOTE: This took about 60 minutes on my late 2013 MacBook Pro.

Taking the output tree file "ToL_aligned_SCGs_mod_names.tre" from that and loading it into a tree viewer/editor such as the web-hosted Interactive Tree of Life gives us this view:

The iToL is a pretty stellar program for viewing/editing trees. I also use Dendroscope pretty regularly.



Using the alignment and partitions file with another program

Some tree-building programs allow for the incorporation of mixed models – meaning they will treat each gene differently rather than all the same. GToTree generates a partitions file that is compatible with most of these. If you wanted to do this, you might want to run GToTree in alignment-only mode (by adding the -N flag when running), and then take the alignment and partitions file into a stellar program like iqtree – which has excellent documentation here. While iqtree is installed with GToTree by default, running it within GToTree is doing things in a very standard, generic way. As discussed on the Things to Consider page, you have to sacrifice some level of control as you broaden out to an entire workflow. So GToTree gives you the alignment and partitions file so if you'd like you can take them into something like iqtree and take advantage of all their functionality. If you're looking specifically for their information on mixed-model analysis, a good place to start is their advanced documentation here.

Here is an example of one way you might do this with iqtree using the alignment and partitions outputs from the Alteromonas example above.

iqtree -s Alteromonas_example/Aligned_SCGs_mod_names.faa \
       -spp Alteromonas_example/run_files/Partitions.txt \
       -m MFP -bb 1000 -nt 4 -pre iqtree_out

Code Breakdown:

  • -s – the alignment file we got from GToTree
  • -spp – the partitions file we got from GToTree
  • -m – telling iqtree which model-finder settings to use
  • -bb – number of "ultrafast" bootstraps we want
  • -nt – number of threads we want to use
  • -pre – the prefix of the output files generated