This page serves as the general user guide for GToTree, which may be helpful if you're looking for something specific. To jump right into practical ways GToTree can be helpful it may be more useful to start with the Example-usage page :)
- Required inputs
- Optional arguments and parameters
- Options set for programs run
GToTreewith no arguments will provide the help menu.
The minimum required inputs to GToTree are specifying the genomes you want to incorporate (provided via any combination of NCBI Accessions, GenBank files, and/or fasta files) and specifying which single-copy gene-set to use.
Input genomes can be specified as any combination of NCBI assembly accessions, GenBank files, and/or fasta files.
You can specify which NCBI-archived genomes you'd like to incorporate by providing a single-column file holding NCBI assembly accessions to the
-a argument. This file can be created "manually" by searching NCBI's website and downloading a results table, or it can be generated at the command line by using Entrez-Direct – examples of doing both are presented on the examples page here.
- Those provided can have version numbers (what comes after the "." in the accession, e.g. GCF_000153765.1), or they can be version-less (e.g. GCF_000153765). In the case where no version is provided, GToTree will automatically take the newest released version of that accession.
- If any of the provided accessions cannot be found at NCBI, they will be printed to the screen at the start of the run and will be reported in the output directory in the file "NCBI_accessions_not_found.txt".
- An example input accessions file can be found in the GToTree sub-directory here:
To specify which GenBank files to include, you need to provide a single-column file that holds the file names (or paths) to each of the GenBank files you'd like incorporated. This is passed to the
- An example file can be found in the GToTree sub-directory here:
Fasta files are provided similarly to the GenBank files, but passed to the
-f argument. You need to provide a single-column file that holds the file names (or paths) to each of the fasta files you'd like incorporated.
- An example file can be found in the GToTree sub-directory here:
Specifying which single-copy gene-set to use
GToTree also needs to know which SCG-set to use – passed with the
-H flag. There are 14 provided with the program that are stored in the
hmm_sets sub-directory (discussed in some more detail here). If you followed the conda quick-start installation instructions, or set up the appropriate environment variable yourself as detailed here, you can view which HMM files are available by running
gtt-hmms by itself (and you don't need to specify the full path to the HMM file, just the name as printed by
Each GToTree run creates an output directory to hold all of the output files. This defaults to "GToTree_output", but can be specified with the
-o argument, and the names of the files below that include "GToTree_output" would be changed accordingly.
Primary output files
- The final tree file in newick format.
- FastTree reports "local support values" that appear as labels on internal nodes to estimate the reliability of each split in the tree. You can find more information about this at their user page here.
- IQ-TREE reports ultrafast bootstrap (UFBoot) support values. Their help pages state that values of 95% indicate a 95% probability that clade is true.
- Alignment file in fasta format.
Aligned_SCGs_mod_names.faa (if TaxonKit was used to add lineage info to labels – specified with the
Genomes summary info
- A tab-delimited table of summary information for each genome including the following columns:
|1||assembly_id||the input assembly ID (either the accession or base file name depending on input source)|
|2||label||the label assigned to the genome in the output tree file|
|3||taxid||the NCBI taxid if genome was provided by NCBI accession or GenBank with taxid information|
|4||uniq_SCG_hits||number of unique gene hits to the target HMMs|
|5||perc_comp||estimated percent completion based on the target HMMs|
|6||perc_redund||estimated percent redundancy based on the target HMMs|
|7||in_final_tree||Yes or No, did this genome end up in the final tree|
|8||num_genes_contributed_to_alignmend||the total number of genes from this genome that contributed to the final alignment|
|9-15||lineage info||domain, phylum, class, order, family, genus, specific_name (if taxid info available and TaxonKit was specified)|
SCG-hit counts per genome
- A tab-delimited file where the first column holds each genome ID, and the rest of the columns hold counts for how many hits there were to each target gene for each genome.
Report output files
Report files will only be written if they are needed. For instance if a genome is dropped from analysis due to having too few hits to the target genes, the file "Genomes_removed_for_too_few_hits.tsv" will be created. But if no genomes were removed for this reason, the file will not be generated. So you should not expect to find all of these files after any particular run.
- If there were duplicate accessions in the input NCBI accessions file, they will be reported here.
- If any of the provided NCBI accessions were not found at NCBI, they will be reported here.
- If any NCBI accessions were found at NCBI but neither their genes nor genome could be downloaded, they will be reported here.
- If any genomes were removed from analysis due to having too few hits to the target genes (set with
-Gargument), they will be listed here along with how many hits they had.
- If any genes didn't have hits in any of the input genomes, they will be reported here.
- If any GenBank files were provided that didn't have genes annotated, their genes would be called with Prodigal and the genomes retained in the analysis, but they would be reported here just in case this is a cause for a red flag for you (like if you intended to be using only fully annotated GenBank files).
Optional arguments and parameters
The GToTree help menu can be viewed by running
GToTree with no arguments.
- [-o <str>] default: GToTree_output
GToTree writes all output files to an output directory. By default this is set to "GToTree_output", but you can specify it by passing an argument to the
-o flag. (E.g.:
Specify desired genome labels
- [-m ] specify desired genome labels
Often it is helpful to have specific labels for specific genomes in a tree (as exemplified in the Alteromonas example. GToTree uses TaxonKit to add lineage information to any genomes that have such information associated with them (whether provided as NCBI accessions or GenBank files), but we can also swap labels of specific genomes we know we care about and want to be able to find more easily. We also may want to just append certain information to the label. For example, maybe we want the lineage information added from TaxonKit, but we may also know something about specific genomes that we want marked on the tree also (like they all possess a certain gene cluster we are interested in for some reason and we want to be able to quickly search and highlight them on the tree).
Either or both of these can be done by providing a mapping file to the
-m argument. It should be a 2- or 3-column tab-delimited file that has the initial genome ID in the first column (this will be either the NCBI accession or the file name (depending on how the input genome was provided). The second column may or may not be empty. If you want to specify the complete label yourself for that genome, then put that new label in column 2. If you don't want to specify the complete label, leave column 2 empty. Column 3 may or may not be empty. If you'd like to append something to the label (whether that's the initial label, the modified lineage label, or the label you may have specified in column 2), then add that text to column 3. If there is nothing you want to append, leave column 3 empty.
NOTE: Not all input genomes need to be provided in the file being passed to
Specify to add lineage info to genome labels
- [-t ] default: false
By setting the
-t flag, GToTree will: get strain information if it is available for those provided by NCBI accession; get the NCBI taxids for any genomes that possess them (either from the NCBI accessions provided or if they are present in any GenBank files provided) and use TaxonKit to convert them into lineage information; and add this information to the genome labels – making the output tree much more useful than just a collection of odd identifiers. Which specific taxonomic ranks get added can be specified with the
Specify which taxonomic ranks to add to genome labels
- [-L ] default: Domain,Phylum,Class,Species,Strain
-t flag with no arguments in order to add lineage info to the genome labels. By default this will add Domain, Phylum, Class, Species, and strain info, where available. This may be suitable when making a tree across multiple domains, but may be unnecessarily cumbersome when just making a tree of one genus, for instance like shown here in the Alteromonas example. You can specify which ranks you'd like added to the labels with the
-L argument as a comma-separated list. For instance, to add all would look like this:
Filtering gene-hits by length
- [-c ] default: 0.2
When scanning many genomes for many genes, it becomes harder or completely impractical to visually inspect alignments of everything. One way to try to filter out potential spurious gene hits is to filter by some expected length. The
-c parameter uses the median length of each particular gene-set to calculate an upper- and lower-length threshold to filter out potentially spurious genes. It takes float between 0-1 specifying the range about the median of sequences to be retained. The default is 0.2. For example, under the default setting, if the median length of a set of sequences is 100 AAs, those genes with sequences longer than 120 or shorter than 80 will be filtered out before alignment of that gene set. The reliability of this approach can depend on how closely related or not your input genomes are, and becomes less useful when using very few genomes (see note here). By default, this is set to 0.2.
Filtering genomes based on hits to target genes
- [-G ] default: 0.5
-G parameter allows you to filter out genomes that have too few hits to the target genes. It takes a float between 0-1 specifying the minimum fraction of hits a genome must have of the SCG-set. The default is 0.5. For example, under the default setting, if there are 100 target genes in the HMM profile, and genome X only has hits to 49 of them, it will be removed from analysis. How you want this set may depend on the breadth of diversity of the tree you are making (see note here.
Number of cpus to use during HMM searches
- [-n ] default: 2
The number of cpus you'd like to use during the HMM searches.
Number of jobs to run in parallel where possible
- [-j ] default: 1
This determines how many jobs to run in parallel during steps that are parallelizable – such as the processing/searching of each individual genome, the filtering of genes and genomes, and alignment of each individual gene-set.
- [-B ] default: false
-B flag with no arguments if you'd like to run GToTree in "best-hit" mode. By default, if a target gene has more than one hit in a given genome, GToTree won't include a sequence for that target gene from that genome in the final alignment. With this flag provided, GToTree will take the best hit and incorporate it into the alignment, even if that genome has more than one hit to the target gene. See here for more discussion on this.
Keep temporary directory
- [-d ] default: false
-d flag with no arguments if you'd like to keep the temporary directory that is used during the run. This is mostly useful for debugging purposes.
If you run
GToTree with no arguments you can see the help menu:
GToTree v1.1.12 (github.com/AstrobioMike/GToTree) -------------------------------- HELP INFO --------------------------------- This program takes input genomes from various sources and ultimately produces a phylogenomic tree. You can find detailed usage information at: github.com/AstrobioMike/GToTree/wiki Required inputs include: 1) Input genomes in one or any combination of the following formats: - [-a <file>] single-column file of NCBI assembly accessions - [-g <file>] single-column file with the paths to each GenBank file - [-f <file>] single-column file with the paths to each fasta file - [-A <file>] single-column file with the paths to each amino acid file, each file should hold the coding sequences for just one genome 2) [-H <file>] location of the uncompressed HMM file being used, or just the HMM name if you've set the environment variable 'GToTree_HMM_dir' to the appropriate location (run 'gtt-hmms' by itself to view the available gene-sets) Optional arguments include: - [-o <str>] default: GToTree_output Specify the desired output directory. - [-m <file>] specify desired genome labels A two- or three-column tab-delimited file where column 1 holds either the file name or NCBI accession of the genome to name (depending on the input source), column 2 holds the desired new genome label, and column 3 holds something to be appended to either initial or modified labels (e.g. useful for "tagging" genomes in the tree based on some characteristic). Columns 2 or 3 can be empty, and the file does not need to include all input genomes. - [-t ] default: false Provide this flag with no arguments if you'd like to add lineage info to the sequence headers for any genomes with NCBI taxids. - [-L <str>] default: Domain,Phylum,Class,Species,Strain A comma-separated list of the taxonomic ranks you'd like added to the labels if using TaxonKit (-t flag specified). E.g., all would be "-L Domain,Phylum,Class,Order,Family,Genus,Species,Strain" - [-T <str>] default: FastTree Which program to use for tree generation. Currently supported are "FastTree" and "IQ-TREE". As of now, these run with default settings only (and QT-TREE includes "-mset WAG,LG". To run either with more specific options (and there is a lot of room for variation here), you can use the output alignment file from GToTree as input. - [-c <float>] default: 0.2 A float between 0-1 specifying the range about the median of sequences to be retained. For example, if the median length of a set of sequences is 100 AAs, those seqs longer than 120 or shorter than 80 will be filtered out before alignment of that gene set - [-G <float>] default: 0.5 A float between 0-1 specifying the minimum fraction of hits a genome must have of the SCG-set. For example, if there are 100 target genes in the HMM profile, and Genome X only has hits to 49 of them, it will be removed from analysis. - [-n <int> ] default: 2 The number of cpus you'd like to use during the HMM search. - [-j <int> ] default: 1 The number of jobs you'd like to run in parallel during steps that are parallelizable. - [-B ] default: false Provide this flag with no arguments if you'd like to run GToTree in "best-hit" mode. By default, if a SCG has more than one hit in a given genome, GToTree won't include a sequence for that target from that genome in the final alignment. With this flag provided, GToTree will use the best hit. See here for more discussion: github.com/AstrobioMike/GToTree/wiki/things-to-consider - [-d ] default: false Provide this flag with no arguments if you'd like to keep the temporary directory. (mostly useful for debugging) Example usage: GToTree -a ncbi_accessions.txt -g genbank_files.txt -f fasta_files.txt -H Bacteria -t -j 4
Options set for programs run
Prodigal is run with default settings other than setting the
-c flag, which means only include complete genes.
Hmmsearch is run with default settings other than setting the
--cut_ga flag, which uses the gathering score stored in the HMM profile being used for cutoff values.
Muscle is run with default settings other than setting the
-diags flag, which is faster when aligning similar sequences.
Trimal is run with default settings other than setting the
-automated1 flag, which performs "a heuristic selection of the automatic method based on similarity statistics. (Optimized for Maximum Likelihood phylogenetic tree reconstruction)."
FastTree is run with default settings.
IQ-TREE is currently run with default settings other than
-bb 1000, and