This page demonstrates two examples from start-to-finish. The data files, run log, and results for all examples are packaged with GToTree in the noted sub-directories.
The code and files for this example are in the
During my PhD I had the privilege of working on a really cool, nitrogen-fixing, marine cyanobacterium called Trichodesmium. This bug seems to only live in perpetual association with a consortium of other microorganisms – there are no pure (axenic) cultures of Trichodesmium by itself, it just doesn't seem to be happy without its buddies. One of the highly conserved organisms (here meaning consistently present to some extent across all Trichodesmium samples, but not in non-Trichodesmium "controls") that came out of this work was an Alteromonas metagenome-assembled genome (MAG) – work on this is presented in this paper (https://www.nature.com/articles/ismej201749).
Here we are going to use GToTree to place this "new" Alteromonas MAG into a phylogenomic tree of all the Alteromonas genomes available in NCBI's RefSeq database.
The inputs will be: 1) the fasta file of our "new" MAG; 2) a list of accessions of Alteromonas genomes from NCBI's RefSeq; and 3) an Alphaproteobacteria to serve as a root.
1) We can download the MAG fasta file from NCBI and decompress it like such:
curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/271/865/GCA_002271865.1_ASM227186v1/GCA_002271865.1_ASM227186v1_genomic.fna.gz | gunzip - > GCA_002271865.1.fa
2) We can search NCBI's assembly database on their website to get the accessions for all RefSeq Alteromonas genomes with the following search string:
Alteromonas[ORGN] AND "latest refseq"[filter] AND "complete genome"[filter] (when this was put together on 1-Jan-2019, this returned 31 hits). You can download a summary file by selecting "Send to:" at the top right, and setting the options as shown here:
Clicking create file will download these as "assembly_result.txt". Here we are using the RefSeq assembly accessions (those that start with "GCF_...", but GToTree also handles GenBank assembly accessions (those that start with "GCA_"). For us here we are going to take the RefSeq accessions from the 3rd column, but if you are following this and want to work with genomes available in GenBank but not in RefSeq, you would want to take the first column.
In our case here, we can take just the RefSeq accessions with the following:
tail -n +2 assembly_result.txt | cut -f3 > alteromonas_refseq_accessions.txt
If you're unfamiliar with this line of code, and want to get to know working at the command line better, a good place to start is here :)
NOTE: This part of the process (generating the accessions list of the reference genomes we want) can actually be done completely at the command line. Entrez-Direct is a command-line tool for accessing NCBI's databases. It can have a bit of a learning curve, but is definitely worthwhile if you use information from NCBI databases frequently. It is also installable through conda with
conda install -c bioconda entrez-direct. With Entrez-Direct, (as shown in the "run_log.txt" file in the
example_runsub-directory), this accessions file could be created at the command line as such:
esearch -query 'Alteromonas[ORGN] AND "latest refseq"[filter] AND "complete genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > alteromonas_refseq_accessions.txt.
3) And to have an outgroup to root the tree with, and to incorporate a GenBank file which GToTree can handle as input as well, we'll take an alphaproteobacterium:
curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/365/GCF_000011365.1_ASM1136v1/GCF_000011365.1_ASM1136v1_genomic.gbff.gz | gunzip - > GCF_000011365.1.gbff
Mapping file for labeling specific genomes
Often it is helpful to have specific labels for specific genomes in a tree. In this case we may want to have our "new" MAG labeled as "Our_Alteromonas_MAG", instead of just "GCA_002271865.1", and we may want to label our root as "Alpha_root" instead of "GCF_000011365.1". GToTree uses TaxonKit to add lineage information to any genomes that have such information associated with them (whether provided as NCBI accessions or GenBank files), but we can also swap labels of specific genomes we know we care about and want to be able to find more easily.
To do that we just need to provide a 2-column, tab-delimited file that has the initial genome ID in the first column (this will be either the NCBI accession or the file name (depending on how the genome was provided). Here's how ours looks in this case:
cat genome_to_id_map.tsv GCA_002271865.1.fa Our_Alteromonas_MAG GCF_000011365.1.gbff GCF_000011365.1_Alpha_Outgroup
NOTE: User-provided labels given to genomes listed in this mapping file (passed to the program with the
-mflag) will always take precedence over any automated lineage swapping.
The accessions file can be provided as-is, but to tell GToTree which fasta and genbank files to work on, we need to put their names (or paths) into files. In the case here, this will get the job done:
ls *.fa > fasta_files.txt ls *.gbff > genbank_files.txt
Now we are set to run GToTree:
GToTree -a alteromonas_refseq_accessions.txt -g genbank_files.txt -f fasta_files.txt -H Gammaproteobacteria.hmm -t -L Species,Strain -m genome_to_id_map.tsv -j 4 -o Alteromonas_example
- -a – the file with the list of accessions
- -g – the file holding the genbank paths
- -f – the file holding the fasta paths
- -H – the desired HMM profiles to use (can view all default available with
- -t – specifies to use TaxonKit to add labels with lineage information to the tree
- -L – specifies the ranks to add when adding lineage information to the tree, since here we are working with all Alteromonas, we don't really need more than the Species (or "specific name", which in NCBI includes the Genus) and Strain (if available)
- -m – the mapping file specifying specific labels for specific input genomes
- -j – the number of jobs to run in parallel when possible
- -o – the output prefix for primary output files
Viewing/Editing the tree
The output tree file "Alteromonas_aligned_SCGs_mod_names.tre" is in newick format and can be viewed/edited with any general tree program. A good one for large trees that I have installed on my laptop and use regularly is Dendroscope. And a good web-based one is the Interactive Tree of Life.
If we go to the Interactive Tree of Life upload page, we can upload the tree file we just created. After rooting at our alpha outgroup and coloring our "new" MAG blue, we can see it's among the Alteromonas macleodii references:
And taking a closer look, based on these 172 gammaproteobacterial SCGs it is most closely related to reference strain Te101:
The code and files for this example are in the
# getting accessions for all refseq, complete, representative genomes (search originally performed on 20-Dec-2018) esearch -query '"latest refseq"[filter] AND "complete genome"[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > GToTree_ToL_accessions # running GToTree: time GToTree -a GToTree_ToL_accessions -H Universal_Hug_et_al.hmm -t -j 4 -o ToL -G 0.4
NOTE: This took about 60 minutes on my late 2013 MacBook Pro.
Taking the output tree file "ToL_aligned_SCGs_mod_names.tre" from that and loading it into a tree viewer/editor such as the web-hosted Interactive Tree of Life gives us this view: