Skip to content

Installation

Mike Lee edited this page Nov 8, 2023 · 136 revisions

GToTree runs in a Unix-like command-line environment. This means it will work on Mac and Linux computers in the standard terminal programs available with them. And to use GToTree on a Windows computer, I would recommend installing the Windows Subsystem for Linux (WSL), then when in the WSL terminal, install a Linux version of miniconda. Then installing with conda as shown below will work in the WSL environment 👍

Conda quickstart!

If you don't already have the glorious package manager conda, I highly recommend you get it. This really isn't the venue to go into why it's so helpful, but it really is, I promise 🙂

To get conda up and running (which is very quick), you can follow the instructions to install miniconda (a light-weight version) for your appropriate system starting from here. You will want a python 3.X version, and more than likely a 64-bit version. And if you'd like to learn more about conda sometime, I have an introduction page here 🙂


The following line will create a gtotree conda environment and install GToTree, you want to run these in the base conda environment:

# installing mamba if needed first (for faster conda installs)
conda install -n base -c conda-forge mamba
mamba create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree

DONE!

Now you should be able to enter and exit the environment with conda activate gtotree and conda deactivate gtotree. If you enter the environment and run the following:

gtt-hmms

It will print out where the GToTree default HMMs directory is located, and list the available pre-built HMMs. And if you enter GToTree with no arguments, you can see the help menu.

Test run

You can run a test that takes about 3 minutes like so:

gtt-test.sh

For which the end of the standard output should look like this:

#################################################################################
####                                 Done!!                                  ####
#################################################################################

  Overall, 12 genomes of the input 14 were retained (see notes below).

    Tree written to:
        GToTree-test-output/GToTree-test-output.tre

    Alignment written to:
        GToTree-test-output/Aligned_SCGs_mod_names.faa

    Main genomes summary table written to:
        GToTree-test-output/Genomes_summary_info.tsv

    Summary table with hits per target gene per genome written to:
        GToTree-test-output/SCG_hit_counts.tsv

    Outputs from Pfam searching written to:
        GToTree-test-output/Pfam_search_results/

    Partitions file (for downstream use with mixed-model treeing) written to:
        GToTree-test-output/run_files/Partitions.txt

 _______________________________________________________________________________

  Notes:

        1 accession(s) not successfully found at NCBI.
        1 genome(s) removed due to having too few hits to the targeted SCGs.
        2 gene(s) either had no hits or only multiple hits in each genome.

    Reported along with additional informative run files in:
        GToTree-test-output/run_files/

 _______________________________________________________________________________

    Log file written to:
        GToTree-test-output/gtotree-runlog.txt
 _______________________________________________________________________________

    Programs used and their citations have been written to:
        GToTree-test-output/citations.txt

 _______________________________________________________________________________


                                         Total process runtime: 0 hours and 2 minutes.

And if you took that output tree file "GToTree-test-output/GToTree-test-output.tre" and threw it into a tree viewer, such as uploading it to the Interactive Tree of Life site, rooting it at the included archaeal sequence, and dragging and dropping in the "GToTree-test-output/Pfam_search_results/iToL_files/PF05400.*-iToL.txt" file, it would look something like this:

Where the blue branches go to those genomes in which the FliT protein involved in flagellar biosynthesis was detected (searched for by it's PFam, PF05400, being specified in the "pfam_targets.txt" input file).

You can clean out the test data and results by running:

gtt-clean-after-test.sh

Updating to a newer version

If wanting to update to the latest GToTree version, it is best to remove the previous conda environment and install fresh. This can be done as follows:

# from outside the gtotree conda environment (assuming that's what it was named like the install above)
conda env remove -n gtotree

# then re-installing in a new environment same as above
# installing mamba if needed first (for faster conda installs)
conda install -n base -c conda-forge mamba
mamba create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree

Then the new environment can be activated with conda activate gtotree.


Installation without conda (not recommended)

Again, the conda installation is highly recommended as it is more robust across different systems. But to try installing without conda, download and unpack/decompress GToTree wherever you'd like it to live on your system (be sure to change the versions below to the latest found here:

curl -L https://github.com/AstrobioMike/GToTree/archive/v1.5.22.tar.gz -o GToTree-v1.5.22.tar.gz
tar -xzvf GToTree-v1.5.22.tar.gz

Add the bin to your PATH

Now we need to add the "bin" directory to our PATH (see here if you are unfamiliar with what the PATH is and you'd like to know more). One way we can do this is change directories into the bin, and use pwd inside an echo command to put the full path into our PATH:

cd GToTree-1.5.22/bin # make sure you are in this bin directory
echo "export PATH=\"$(pwd):\$PATH\"" >> ~/.bash_profile

Add path to included HMM files

If you'd like to more easily be able to use the included single-copy gene HMM profiles, you can also add a variable to your bash profile so that you don't need to provide the full path to them whenever you use them. If you change directories into the "hmm_sets" directory, this can be done in a similar way as above:

cd ../hmm_sets/ # from where we were above
echo "export GToTree_HMM_dir=\"$(pwd)/\"" >> ~/.bash_profile

Last thing to do is source the ~/.bash_profile we just modified so those changes take effect in our current session:

source ~/.bash_profile

You can run gtt-hmms with no arguments to make sure the default HMM directory is set, and see what taxa the currently available HMM files can more specifically target.

And now if you type GToTree with no arguments, you should see the help menu (but note that you still need to take care of the dependencies presented below before you're ready to rock):

                                  GToTree v1.6.20
                         (github.com/AstrobioMike/GToTree)


 ----------------------------------  HELP INFO  ----------------------------------

  This program takes input genomes from various sources and ultimately produces
  a phylogenomic tree. You can find detailed usage information at:
                                  github.com/AstrobioMike/GToTree/wiki


 -------------------------------  REQUIRED INPUTS  -------------------------------

      1) Input genomes in one or any combination of the following formats:
        - [-a <file>] single-column file of NCBI assembly accessions
        - [-g <file>] single-column file with the paths to each GenBank file
        - [-f <file>] single-column file with the paths to each fasta file
        - [-A <file>] single-column file with the paths to each amino acid file,
                      each file should hold the coding sequences for just one genome

      2)  [-H <file>] location of the uncompressed HMM file being used, or just the
                      HMM name if you've set the environment variable 'GToTree_HMM_dir'
                      to the appropriate location or installed via conda (run 'gtt-hmms'
                      by itself to view the available gene-sets)


 -------------------------------  OPTIONAL INPUTS  -------------------------------


      Output directory specification:

        - [-o <str>] default: GToTree_output
                  Specify the desired output directory.


      User-specified modification of genome labels:

        - [-m <file>] specify desired genome labels
                  A two- or three-column tab-delimited file where column 1 holds either
                  the file name or NCBI accession of the genome to name (depending
                  on the input source), column 2 holds the desired new genome label,
                  and column 3 holds something to be appended to either initial or
                  modified labels (e.g. useful for "tagging" genomes in the tree based
                  on some characteristic). Columns 2 or 3 can be empty, and the file does
                  not need to include all input genomes.


      Options for adding taxonomy information:

        - [-t ] default: false
                  Provide this flag with no arguments if you'd like to add NCBI taxonomy
                  info to the sequence headers for any genomes with NCBI taxids. This will
                  will largely be effective for input genomes provided as NCBI accessions
                  (provided to the `-a` argument), but any input GenBank files will also
                  be searched for an NCBI taxid. See `-L` argument for specifying desired
                  ranks.

        - [-D ] default: false
                  Provide this flag with no arguments if you'd like to add taxonomy from the
                  Genome Taxonomy Database (GTDB; gtdb.ecogenomic.org). This will only be
                  effective for input genomes provided as NCBI accessions (provided to the
                  `-a` argument). This can be used in combination with the `-t` flag, in
                  which case any input accessions not represented in the GTDB will have NCBI
                  taxonomic infomation added (with '_NCBI' appended). See `-L` argument for
                  specifying desired ranks, and see helper script `gtt-get-accessions-from-GTDB`
                  for help getting input accessions based on GTDB taxonomy searches.

        - [-L <str>] default: Domain,Phylum,Class,Species,Strain
                  A comma-separated list of the taxonomic ranks you'd like added to
                  the labels if adding taxonomic information. E.g., all would be
                  "-L Domain,Phylum,Class,Order,Family,Genus,Species". Note that
                  strain-level information is available through NCBI, but not GTDB.


      Filtering settings:

        - [-c <float>] default: 0.2
                  A float between 0-1 specifying the range about the median of
                  sequences to be retained. For example, if the median length of a
                  set of sequences is 100 AAs, those seqs longer than 120 or shorter
                  than 80 will be filtered out before alignment of that gene set
                  with the default 0.2 setting.

        - [-G <float>] default: 0.5
                  A float between 0-1 specifying the minimum fraction of hits a
                  genome must have of the SCG-set. For example, if there are 100
                  target genes in the HMM profile, and Genome X only has hits to 49
                  of them, it will be removed from analysis with default value 0.5.

        - [-B ] default: false
                  Provide this flag with no arguments if you'd like to run GToTree
                  in "best-hit" mode. By default, if a SCG has more than one hit
                  in a given genome, GToTree won't include a sequence for that target
                  from that genome in the final alignment. With this flag provided,
                  GToTree will use the best hit. See here for more discussion:
                  github.com/AstrobioMike/GToTree/wiki/things-to-consider


      Additional PFam searching:

        - [-p <file>] single-column file of additional PFam targets to search for.
                  Table of hit counts, fasta of hit sequences, and files compatible
                  with the iToL web-based tree-viewer will be generated for each
                  target. See visualization of gene presence/absence example at
                  github.com/AstrobioMike/GToTree/wiki/example-usage for example.


      General run settings:

        - [-N ] default: false
                  No tree. Generate alignment only.

        - [-k ] default: false
                  Keep individual protein alignment files.

        - [-T <str>] default: FastTreeMP if available, FastTree if not
                  Which program to use for tree generation. Currently supported are
                  "FastTree", "FastTreeMP", and "IQ-TREE". As of now, these run
                  with default settings only (and IQ-TREE includes "-mset WAG,LG"). To
                  run either with more specific options (and there is a lot of room for
                  variation here), you can use the output alignment file from GToTree (and
                  partitions file if wanted for mixed-model specification) as input into
                  a dedicated treeing program.
                  Note on FastTreeMP (http://www.microbesonline.org/fasttree/#OpenMP). FastTreeMP
                  parallelizes some steps of the treeing step. Currently, conda installs
                  FastTreeMP with FastTree on linux systems, but not on Mac OSX systems.
                  So if using the conda installation, you may not have FastTreeMP if on a Mac,
                  in which case FastTree will be used instead – this will be reported when the
                  program starts, and be in the log file.

        - [-n <int> ] default: 2
                  The number of cpus you'd like to use during the HMM search. (Given
                  these are individual small searches on single genomes, 2 is probably
                  always sufficient.)

        - [-j <int> ] default: 1
                  The number of jobs you'd like to run in parallel during steps
                  that are parallelizable. This includes things like downloading input
                  accession genomes and running parallel alignments, and portions of the
                  tree step if using FastTree on a Linux system (e.g. see FastTree docs
                  here: http://www.microbesonline.org/fasttree/#OpenMP).

                  Note that I've occassionally noticed NCBI not being happy with over ~50
                  downloads being attempted concurrently. So if using a `-j` setting around
                  there or higher, and GToTree is saying a lot of input accessions were not
                  successfully downloaded, consider trying with fewer.

        - [-X ] default: false
                  If working with greater than 1,000 target genomes, GToTree will by default
                  use the 'super5' muscle alignment algorithm to increase the speed of the alignments (see
                  github.com/AstrobioMike/GToTree/wiki/things-to-consider#working-with-many-genomes
                  for more details and the note just above there on using representative genomes).
                  Anyway, provide this flag with no arguments if you don't want to speed up
                  the alignments.

        - [-P ] default: false
                  Provide this flag with no arguments if your system can't use ftp,
                  and you'd like to try using http.

        - [-F ] default: false
                  Provide this flag with no arguments if you'd like to force
                  overwriting the output directory if it exists.

        - [-d ] default: false
                  Provide this flag with no arguments if you'd like to keep the
                  temporary directory. (Mostly useful for debugging.)


 --------------------------------  EXAMPLE USAGE  --------------------------------

	GToTree -a ncbi_accessions.txt -f fasta_files.txt -H Bacteria -D -j 4

Installing dependencies without conda

By far, the easiest way to get all the dependencies up and running is with conda as done above. But if you don't want to use conda, here are links to installing all the dependencies (be sure to install Easel along with HMMER3 as well if you are doing things the non-conda way).

Essential dependencies

If you use GToTree, please be sure to cite these folks – a citations.txt file including used programs is produced with each run to help 🙂

Note on versions The versions listed below were used specifically at one point in GToTree's history, and are left here as a reference if someone is trying to install without conda. But with the conda installation, it can sometimes be better to be more flexible with regard to versions. We can check specific versions in our conda installation manually, and/or the citations.txt file produced by a GToTree run will list the versions of programs used for that run.

Optional dependencies depending on use

If you use GToTree in a manner that uses these tools, please cite these folks – a citations.txt file including used programs is produced with each run to help 🙂

  • Prodigal v2.6.3 - citation
    • if providing input genomes in fasta format, or GenBank format with no CDS annotations, or NCBI accessions to genomes with no gene calls
    • if providing input genomes as NCBI assembly accessions
  • TaxonKit v0.6.0 - citation
    • if adding NCBI taxonomy information to input genomes
  • Genome Taxonomy Database Release R05-RS95 - citation
    • if adding GTDB taxonomy information to input genomes
  • GNU Parallel v20161122 - citation info
    • if running things in parallel (specifically set with the -j argument)
  • IQ-TREE v2.0.3 - citation

Note on versions The versions listed above were used specifically at one point in GToTree's history, and are left here as a reference if someone is trying to install without conda. But with the conda installation, it can sometimes be better to be more flexible with regard to versions. We can check specific versions in our conda installation manually, and/or the citations.txt file produced by a GToTree run will list the versions of programs used for that run.


NOTE: If doing a non-conda installation, you may need to also temporarily change your terminal's localization settings if you're not in the United States or Australia, as GToTree expect things to be encoded a certain way. If you run locale in the terminal, you will get a list of these. If any do not say "en_US.UTF-8", then you can run these two commands to temporarily change them (for the current terminal session): export LC_ALL="en_US.UTF-8" and export LANG="en_US.UTF-8". Now in this terminal window, GToTree will run appropriately. When you open a new terminal, your settings will be back to the way they were.