Skip to content

User Guide

Mike Lee edited this page May 19, 2019 · 41 revisions

This page serves as the general user guide for GToTree, which may be helpful if you're looking for something specific. To jump right into practical ways GToTree can be helpful it may be more useful to start with the Example-usage page :)


User-Guide Contents


NOTE: Running GToTree with no arguments will provide the help menu.


Required Inputs

The minimum required inputs to GToTree are specifying the genomes you want to incorporate (provided via any combination of NCBI Accessions, GenBank files, and/or fasta files) and specifying which single-copy gene-set to use.

Input Genomes

Input genomes can be specified as any combination of NCBI assembly accessions, GenBank files, and/or fasta files.

NCBI Accessions

You can specify which NCBI-archived genomes you'd like to incorporate by providing a single-column file holding NCBI assembly accessions to the -a argument. This file can be created "manually" by searching NCBI's website and downloading a results table, or it can be generated at the command line by using Entrez-Direct – examples of doing both are presented on the examples page here.

  • Those provided can have version numbers (what comes after the "." in the accession, e.g. GCF_000153765.1), or they can be version-less (e.g. GCF_000153765). In the case where no version is provided, GToTree will automatically take the newest released version of that accession.
  • If any of the provided accessions cannot be found at NCBI, they will be printed to the screen at the start of the run and will be reported in the output directory in the file "NCBI_accessions_not_found.txt".
  • An example input accessions file can be found in the GToTree sub-directory here: GToTree/test_data/ncbi_accessions.txt.

GenBank files

To specify which GenBank files to include, you need to provide a single-column file that holds the file names (or paths) to each of the GenBank files you'd like incorporated. This is passed to the -g argument.

  • An example file can be found in the GToTree sub-directory here: GToTree/test_data/genbank_files.txt.

Fasta files

Fasta files are provided similarly to the GenBank files, but passed to the -f argument. You need to provide a single-column file that holds the file names (or paths) to each of the fasta files you'd like incorporated.

  • An example file can be found in the GToTree sub-directory here: GToTree/test_data/fasta_files.txt.

Specifying which single-copy gene-set to use

GToTree also needs to know which SCG-set to use – passed with the -H flag. There are 14 provided with the program that are stored in the hmm_sets sub-directory (discussed in some more detail here). If you followed the conda quick-start installation instructions, or set up the appropriate environment variable yourself as detailed here, you can view which HMM files are available by running gtt-hmms by itself (and you don't need to specify the full path to the HMM file, just the name as printed by gtt-hmms, e.g. -H Bacterial.hmm).

Outputs

Each GToTree run creates an output directory to hold all of the output files. This defaults to "GToTree_output", but can be specified with the -o argument, and the names of the files below that include "GToTree_output" would be changed accordingly.

Primary output files

Tree

  • Aligned_SCGs.tre
    • The final tree file in newick format.
    • FastTree reports "local support values" that appear as labels on internal nodes to estimate the reliability of each split in the tree. You can find more information about this at their user page here.
    • IQ-TREE reports ultrafast bootstrap (UFBoot) support values. Their help pages state that values of 95% indicate a 95% probability that clade is true.

Alignment files

  • Aligned_SCGs.faa
    • Alignment file in fasta format.
  • Aligned_SCGs_mod_names.faa (if TaxonKit was used to add lineage info to labels – specified with the -t flag)

Genomes summary info

  • All_genomes_summary_info.tsv
    • A tab-delimited table of summary information for each genome including the following columns:
Column Name Contents
1 assembly_id the input assembly ID (either the accession or base file name depending on input source)
2 label the label assigned to the genome in the output tree file
3 taxid the NCBI taxid if genome was provided by NCBI accession or GenBank with taxid information
4 uniq_SCG_hits number of unique gene hits to the target HMMs
5 perc_comp estimated percent completion based on the target HMMs
6 perc_redund estimated percent redundancy based on the target HMMs
7 in_final_tree Yes or No, did this genome end up in the final tree
8 num_genes_contributed_to_alignmend the total number of genes from this genome that contributed to the final alignment
9-15 lineage info domain, phylum, class, order, family, genus, specific_name (if taxid info available and TaxonKit was specified)

SCG-hit counts per genome

  • All_genomes_SCG_hit_counts.tsv
    • A tab-delimited file where the first column holds each genome ID, and the rest of the columns hold counts for how many hits there were to each target gene for each genome.

Report output files

Report files will only be written if they are needed. For instance if a genome is dropped from analysis due to having too few hits to the target genes, the file "Genomes_removed_for_too_few_hits.tsv" will be created. But if no genomes were removed for this reason, the file will not be generated. So you should not expect to find all of these files after any particular run.

Redundant_input_accessions.txt

  • If there were duplicate accessions in the input NCBI accessions file, they will be reported here.

NCBI_accessions_not_found.txt

  • If any of the provided NCBI accessions were not found at NCBI, they will be reported here.

NCBI_accessions_not_downloaded.txt

  • If any NCBI accessions were found at NCBI but neither their genes nor genome could be downloaded, they will be reported here.

Genomes_removed_for_too_few_hits.tsv

  • If any genomes were removed from analysis due to having too few hits to the target genes (set with -G argument), they will be listed here along with how many hits they had.

Genes_with_no_hits_to_any_genomes.txt

  • If any genes didn't have hits in any of the input genomes, they will be reported here.

Genbank_files_with_no_CDSs.txt

  • If any GenBank files were provided that didn't have genes annotated, their genes would be called with Prodigal and the genomes retained in the analysis, but they would be reported here just in case this is a cause for a red flag for you (like if you intended to be using only fully annotated GenBank files).

Optional arguments and parameters

The GToTree help menu can be viewed by running GToTree with no arguments.


Output directory

  • [-o <str>] default: GToTree_output

GToTree writes all output files to an output directory. By default this is set to "GToTree_output", but you can specify it by passing an argument to the -o flag. (E.g.: -o Alteromonas_output)


Specify desired genome labels

  • [-m ] specify desired genome labels

Often it is helpful to have specific labels for specific genomes in a tree (as exemplified in the Alteromonas example. GToTree uses TaxonKit to add lineage information to any genomes that have such information associated with them (whether provided as NCBI accessions or GenBank files), but we can also swap labels of specific genomes we know we care about and want to be able to find more easily. We also may want to just append certain information to the label. For example, maybe we want the lineage information added from TaxonKit, but we may also know something about specific genomes that we want marked on the tree also (like they all possess a certain gene cluster we are interested in for some reason and we want to be able to quickly search and highlight them on the tree).

Either or both of these can be done by providing a mapping file to the -m argument. It should be a 2- or 3-column tab-delimited file that has the initial genome ID in the first column (this will be either the NCBI accession or the file name (depending on how the input genome was provided). The second column may or may not be empty. If you want to specify the complete label yourself for that genome, then put that new label in column 2. If you don't want to specify the complete label, leave column 2 empty. Column 3 may or may not be empty. If you'd like to append something to the label (whether that's the initial label, the modified lineage label, or the label you may have specified in column 2), then add that text to column 3. If there is nothing you want to append, leave column 3 empty.

NOTE: Not all input genomes need to be provided in the file being passed to -m.


Specify to add lineage info to genome labels

  • [-t ] default: false

By setting the -t flag, GToTree will: get strain information if it is available for those provided by NCBI accession; get the NCBI taxids for any genomes that possess them (either from the NCBI accessions provided or if they are present in any GenBank files provided) and use TaxonKit to convert them into lineage information; and add this information to the genome labels – making the output tree much more useful than just a collection of odd identifiers. Which specific taxonomic ranks get added can be specified with the -L argument.


Specify which taxonomic ranks to add to genome labels

  • [-L ] default: Domain,Phylum,Class,Species,Strain

Provide the -t flag with no arguments in order to add lineage info to the genome labels. By default this will add Domain, Phylum, Class, Species, and strain info, where available. This may be suitable when making a tree across multiple domains, but may be unnecessarily cumbersome when just making a tree of one genus, for instance like shown here in the Alteromonas example. You can specify which ranks you'd like added to the labels with the -L argument as a comma-separated list. For instance, to add all would look like this: -L Domain,Phylum,Class,Order,Family,Genus,Species,Strain.


Filtering gene-hits by length

  • [-c ] default: 0.2

When scanning many genomes for many genes, it becomes harder or completely impractical to visually inspect alignments of everything. One way to try to filter out potential spurious gene hits is to filter by some expected length. The -c parameter uses the median length of each particular gene-set to calculate an upper- and lower-length threshold to filter out potentially spurious genes. It takes float between 0-1 specifying the range about the median of sequences to be retained. The default is 0.2. For example, under the default setting, if the median length of a set of sequences is 100 AAs, those genes with sequences longer than 120 or shorter than 80 will be filtered out before alignment of that gene set. The reliability of this approach can depend on how closely related or not your input genomes are, and becomes less useful when using very few genomes (see note here). By default, this is set to 0.2.


Filtering genomes based on hits to target genes

  • [-G ] default: 0.5

The -G parameter allows you to filter out genomes that have too few hits to the target genes. It takes a float between 0-1 specifying the minimum fraction of hits a genome must have of the SCG-set. The default is 0.5. For example, under the default setting, if there are 100 target genes in the HMM profile, and genome X only has hits to 49 of them, it will be removed from analysis. How you want this set may depend on the breadth of diversity of the tree you are making (see note here.


Number of cpus to use during HMM searches

  • [-n ] default: 2

The number of cpus you'd like to use during the HMM searches.


Number of jobs to run in parallel where possible

  • [-j ] default: 1

This determines how many jobs to run in parallel during steps that are parallelizable – such as the processing/searching of each individual genome, the filtering of genes and genomes, and alignment of each individual gene-set.


Best-hit mode

  • [-B ] default: false

Provide the -B flag with no arguments if you'd like to run GToTree in "best-hit" mode. By default, if a target gene has more than one hit in a given genome, GToTree won't include a sequence for that target gene from that genome in the final alignment. With this flag provided, GToTree will take the best hit and incorporate it into the alignment, even if that genome has more than one hit to the target gene. See here for more discussion on this.


Keep temporary directory

  • [-d ] default: false

Provide the -d flag with no arguments if you'd like to keep the temporary directory that is used during the run. This is mostly useful for debugging purposes.


If you run GToTree with no arguments you can see the help menu:

               GToTree v1.1.12 (github.com/AstrobioMike/GToTree)


 --------------------------------  HELP INFO  ---------------------------------

  This program takes input genomes from various sources and ultimately produces
  a phylogenomic tree. You can find detailed usage information at:
                                  github.com/AstrobioMike/GToTree/wiki

    Required inputs include:

      1) Input genomes in one or any combination of the following formats:
        - [-a <file>] single-column file of NCBI assembly accessions
        - [-g <file>] single-column file with the paths to each GenBank file
        - [-f <file>] single-column file with the paths to each fasta file
        - [-A <file>] single-column file with the paths to each amino acid file,
                      each file should hold the coding sequences for just one genome

      2)  [-H <file>] location of the uncompressed HMM file being used, or just the
                      HMM name if you've set the environment variable 'GToTree_HMM_dir'
                      to the appropriate location (run 'gtt-hmms' by itself to view
                      the available gene-sets)

    Optional arguments include:

        - [-o <str>] default: GToTree_output
                  Specify the desired output directory.

        - [-m <file>] specify desired genome labels
                  A two- or three-column tab-delimited file where column 1 holds either
                  the file name or NCBI accession of the genome to name (depending
                  on the input source), column 2 holds the desired new genome label,
                  and column 3 holds something to be appended to either initial or
                  modified labels (e.g. useful for "tagging" genomes in the tree based
                  on some characteristic). Columns 2 or 3 can be empty, and the file does
                  not need to include all input genomes.

        - [-t ] default: false
                  Provide this flag with no arguments if you'd like to add lineage
                  info to the sequence headers for any genomes with NCBI taxids.

        - [-L <str>] default: Domain,Phylum,Class,Species,Strain
                  A comma-separated list of the taxonomic ranks you'd like added to
                  the labels if using TaxonKit (-t flag specified). E.g., all would be
                  "-L Domain,Phylum,Class,Order,Family,Genus,Species,Strain"

        - [-T <str>] default: FastTree
                  Which program to use for tree generation. Currently supported are
                  "FastTree" and "IQ-TREE". As of now, these run with default settings
                  only (and QT-TREE includes "-mset WAG,LG". To run either with more
                  specific options (and there is a lot of room for variation here), you
                  can use the output alignment file from GToTree as input.

        - [-c <float>] default: 0.2
                  A float between 0-1 specifying the range about the median of
                  sequences to be retained. For example, if the median length of a
                  set of sequences is 100 AAs, those seqs longer than 120 or shorter
                  than 80 will be filtered out before alignment of that gene set

        - [-G <float>] default: 0.5
                  A float between 0-1 specifying the minimum fraction of hits a
                  genome must have of the SCG-set. For example, if there are 100
                  target genes in the HMM profile, and Genome X only has hits to 49
                  of them, it will be removed from analysis.

        - [-n <int> ] default: 2
                  The number of cpus you'd like to use during the HMM search.

        - [-j <int> ] default: 1
                  The number of jobs you'd like to run in parallel during steps
                  that are parallelizable.

        - [-B ] default: false
                  Provide this flag with no arguments if you'd like to run GToTree
                  in "best-hit" mode. By default, if a SCG has more than one hit
                  in a given genome, GToTree won't include a sequence for that target
                  from that genome in the final alignment. With this flag provided,
                  GToTree will use the best hit. See here for more discussion:
                  github.com/AstrobioMike/GToTree/wiki/things-to-consider

        - [-d ] default: false
                  Provide this flag with no arguments if you'd like to keep the
                  temporary directory. (mostly useful for debugging)

    Example usage:

	 GToTree -a ncbi_accessions.txt -g genbank_files.txt -f fasta_files.txt -H Bacteria -t -j 4

Options set for programs run

prodigal v2.6.3

Prodigal is run with default settings other than setting the -c flag, which means only include complete genes.

hmmsearch v3.2.1

Hmmsearch is run with default settings other than setting the --cut_ga flag, which uses the gathering score stored in the HMM profile being used for cutoff values.

muscle v3.8.1551

Muscle is run with default settings other than setting the -diags flag, which is faster when aligning similar sequences.

trimal v1.4.rev15

Trimal is run with default settings other than setting the -automated1 flag, which performs "a heuristic selection of the automatic method based on similarity statistics. (Optimized for Maximum Likelihood phylogenetic tree reconstruction)."

FastTree v2.1.10

FastTree is run with default settings.

iq-tree v1.6.9

IQ-TREE is currently run with default settings other than -nt, -bb 1000, and -mset WAG,LG.

You can’t perform that action at this time.