Skip to content

molbiodiv/bcgTree

Repository files navigation

bcgTree

Automatized phylogenetic tree building from bacterial core genomes.

doc/bcgTree.png

An article describing bcgTree is published in Genome. If you use bcgTree for your research please cite: https://img.shields.io/badge/DOI-10.1139%2Fgen--2015--0175-blue.svg

Also please cite the external programs and the source of the HMMs (see Background section) if you use the default essential.hmm

If you like the tool consider voting for it on LabWorm.

See this file for instructions on how to reproduce results from our article.

Dependencies

Please note that some of the dependencies have their own Licenses.

Perl

To execute bcgTree perl5 is required along with the following modules (should be part of the core installation, but also available via cpan):

  • Getopt::Long
  • Pod::Usage
  • FindBin
  • File::Path
  • File::Spec

The following modules are included in this repo:

Java

To use the graphical user interface a Java Runtime Environment (JRE, tested with version 8) is required. The following extra frameworks are used (included):

External programs

bcgTree is a wrapper around multiple existing tools. The following external programs are called by bcgTree and have to be installed (prodigal is optional, it is only needed if you provide nucleotide sequences using --genome). The specified versions are the ones we used for testing (older versions might or might not work). Newer versions should work (otherwise feel free to open an issue).

  • hmmsearch (HMMER version 3.3) - Eddy et al, 2010. HMMER3: a new generation of sequence homology search software.
  • muscle (v3.8.1551) - Edgar et al, 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797. NOTE: please use 3.8. Version 5+ is not supported
  • Gblocks (version 0.91b) - Castresana et al, 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552.
  • RAxML (version 8.2.12) - Stamatakis et al, 2014. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313.
  • prodigal (version 2.6.3) - Hyatt et al, 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119. https://doi.org/10.1186/1471-2105-11-119

Additionally SeqFilter (Hackl et al, 2014. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004-3011) is needed but it is included in this repo.

Data

There are two hmm files included in the data directory. Both contain HMMs from TIGRFAM (LGPL) and PFAM (CC0). The ubcg.hmm file is reproduced from UBCG as described in Na, SI., Kim, Y.O., Yoon, SH. et al. J Microbiol. (2018) 56: 280.

Installation

You can download the latest release from here. After extraction you can start the graphical interface by double clicking the bcgTree.jar file in the bcgTreeGUI folder. Please remember to install the required external programms (see above). Alternatively you can start the perl script bcgTree.pl in the bin folder from the command line. Using the command line interface is the recommended way. You can also clone the bcgTree git repository by executing the following commands:

git clone https://github.com/molbiodiv/bcgTree.git
# Now you can run bcgTree.pl
bcgTree/bin/bcgTree.pl --help
# Or start the java GUI
cd bcgTree/bcgTreeGUI
java -jar bcgTree.jar

Usage

GUI

The graphical user interface is just a convenient way to call bcgTree. Usage is meant to be self-explanatory, any suggestions for improvement are welcome (please open an issue). In fact it adds no extra functionality it just collects the parameters and calls the perl script on the command line. The output is written to a text field as well as a log file (bcgTree.log in the output folder). In order to preserve replicability all parameters are written in a file called options.txt in the output folder. The internal call of bcgTree is then just:

bcgTree.pl @outdir/options.txt

If you feel comfortable using the command line, calling the perl script directly is still the recommendet way to use bcgTree. Nevertheless here are some screenshots of the GUI:

doc/screenshot0.png

doc/screenshot1.png

doc/screenshot2.png

Command line

Usage:
      $ bcgTree.pl [@configfile] --proteome bac1=bacterium1.pep.fa --proteome bac2=bacterium2.faa [options]

Options:
    [@configfile]            Optional path to a configfile with @ as prefix.
                             Config files consist of command line parameters
                             and arguments just as passed on the command
                             line. Space and comment lines are allowed (and
                             ignored). Spreading over multiple lines is
                             supported.

    --proteome <ORGANISM>=<FASTA> [--proteome <ORGANISM>=<FASTA> ..]
                             Multiple pairs of organism and proteomes as
                             peptide fasta file paths Attention: If you
                             provide a proteome and genome with the same
                             name, only the genome will be used.

    --genome <ORGANISM>=<FASTA> [--genome <ORGANISM>=<FASTA> ..]
                             Multiple pairs of organism and genomes as
                             nucleotide fasta file paths. Attention: If you
                             provide a proteome and genome with the same
                             name, only the genome will be used.

    [--outdir <STRING>]      output directory for the generated output files
                             (default: bcgTree)

    [--help]                 show help

    [--version]              show version number of bcgTree and exit

    [--check-external-programs]
                             Check if all of the required external programs
                             can be found and are executable, then exit.
                             Report table with program, status (ok or
                             !fail!) and path. If all external programs are
                             found exit code is 0 otherwise 1. Note that
                             this parameter does not check that the paths
                             belong to the actual programs, it only checks
                             that the given locations are executable files.

    [--hmmsearch-bin=<FILE>] Path to hmmsearch binary file. Default tries if
                             hmmsearch is in PATH;

    [--muscle-bin=<FILE>]    Path to muscle binary file. Default tries if
                             muscle is in PATH;

    [--gblocks-bin=<FILE>]   Path to the Gblocks binary file. Default tries
                             if Gblocks is in PATH;

    [--raxml-bin=<FILE>]     Path to the raxml binary file. Default tries if
                             raxmlHPC is in PATH;

    [--prodigal-bin=<FILE>]
        Path to the prodigal binary file. Default tries if prodigal is in
        PATH;

    [--threads=<INT>]
        Number of threads to be used (currently only relevant for raxml).
        Default: 2 From the raxml man page: PTHREADS VERSION ONLY! Specify
        the number of threads you want to run. Make sure to set "-T" to at
        most the number of CPUs you have on your machine, otherwise, there
        will be a huge performance decrease!

    [--bootstraps=<INT>]
        Number of bootstraps to be used (passed to raxml). Default: 100

    [--min-proteomes=<INT>]
        Minimum number of proteomes in which a gene must occur in order to
        be kept. Default: 2 All genes with less hits are discarded prior to
        the alignment step. This option is ignored if --all-proteomes is
        set.

    [--all-proteomes]
        Sets --min-proteomes to the total number of proteomes supplied.
        Default: not set All genes that do not hit all of the proteomes are
        discarded prior to the alignment step. If set --min-proteomes is
        ignored.

    [--hmmfile=<PATH>]
        Path to HMM file to be used for hmmsearch. Default:
        <bcgTreeDir>/data/essential.hmm

    [--raxml-x-rapidBootstrapRandomNumberSeed=<INT>]
        Random number seed for raxml (passed through as -x option to raxml).
        Default: Random number in range 1..1000000 (see raxml command in log
        file to find out the actual value). Note: you can abbreviate options
        (as long as they stay unique) so --raxml-x=12345 is equivalent to
        --raxml-x-rapidBootstrapRandomNumberSeed=12345

    [--raxml-p-parsimonyRandomSeed=<INT>]
        Random number seed for raxml (passed through as -p option to raxml).
        Default: Random number in range 1..1000000 (see raxml command in log
        file to find out the actual value). Note: you can abbreviate options
        (as long as they stay unique) so --raxml-p=12345 is equivalent to
        --raxml-p-parsimonyRandomSeed=12345

    [--raxml-aa-substitiution-model "<MODEL>"]
        The aminoacid substitution model used for the partitions by RAxML.
        Valid options for RAxML 8.x are: DAYHOFF, DCMUT, JTT, MTREV, WAG,
        RTREV, CPREV, VT, BLOSUM62, MTMAM, LG, MTART, MTZOA, PMB, HIVB,
        HIVW, JTTDCMUT, FLU, STMTREV, DUMMY, DUMMY2, AUTO, LG4M, LG4X,
        PROT_FILE, GTR_UNLINKED, GTR bcgTree will not check whether the
        provided option is valid but rather pass it to RAxML literally.
        Default: AUTO

    [--raxml-args "<ARGS>"]
        Arbitrary options to pass through to RAxML. The ARGS part should be
        in quotes and is appended to the RAxML command as given.

Results

The results all end up in the directory specified via –outdir (or bcgTree if none is specified). This folder contains lots of intermediate files from all steps. If the run was successful the most interesting files will be the RAxML files:

  • <outdir>/RAxML_bestTree.final
  • <outdir>/RAxML_bipartitionsBranchLabels.final
  • <outdir>/RAxML_bipartitions.final
  • <outdir>/RAxML_bootstrap.final
  • <outdir>/RAxML_info.final

Further the log file (<outdir>/bcgTree.log) contains all executed commands and their output. This is useful as a reference, for re-executing steps manually and for debugging in case something went wrong. All other files are the outputs of different steps of the pipeline. Their names should be self-explanatory.

Background

107 essential genes as described in: Dupont CL, Rusch DB, Yooseph S, et al. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. The ISME Journal. 2012;6(6):1186-1199. doi:10.1038/ismej.2011.189. Supplementary Table S1 (which is actually an image) contains a list of the used genes and HMMs with cut-offs.

From the manuscript: “Genome completeness estimates Using the Comprehensive Microbial Resource as a database, 107 hidden Markov models (HMMs) that hit only one gene in greater than 95% of bacterial genomes were identified (Supplementary Table S1). Trusted cutoff scores for the TIGRFAMs and Pfam HMMs were those supplied by the TIGRFAMs and Pfam libraries (Haft et al., 2003; Finn et al., 2010).”

In the publication: M Albertsen, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW and Nielsen PH, Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnology 31, 533–538 (2013) doi:10.1038/nbt.2579 the authors use the same list of 107 genes (111 HMMs, glyS, pheT, proS and rpoC have two HMMs each) as above and provide a readily created hmm file via GitHub. This file has been used as a starting point but an error had to be fixed.

Logo

The logo has been designed by Markus J. Ankenbrand and Alexander Keller. Cliparts from openclipart.org have been used:

The font is from fontlibrary.org:

Related Tools

These are some tools with similar goals and approaches. If you know another one, please open a pull request to add it to the list.

  • PO_2_MLSA - by @jvollme - Enables the calculation of phylogenies based on single copy core genome gene products, based on bidirectional BLAST results obtained with proteinortho.
  • UBCG - by Na, S. I., Kim, Y. O., Yoon, S. H., Ha, S. M., Baek, I. & Chun, J. (2018). UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. J Microbiol 56. Paper

Changes

https://github.com/molbiodiv/bcgTree/actions/workflows/test_perl.yml/badge.svg?branch=master

v1.2.1 <2024-01-03>

  • Fix issue in GUI if no proteome is provided (#50)

v1.2.0 <2021-10-27>

  • Add parameter --genome with translation via prodigal (#15)
  • Update and improve documentation (#19, #35, #38, #44)
  • Make GUI independent of working directory (#33)
  • Switch CI from Travis to GitHub Actions (#45)

v1.1.0 <2018-07-19>

  • Breaking: the default aa substitution model for RAxML changed from WAG to AUTO. This has an impact on performance (it is faster to set this parameter to a fixed value). To get the same behaviour as in earlier versions pass --raxml-aa-substitution-model=WAG
  • Add parameter --raxml-aa-substitution-model (#29)
  • Add HMMs of UBCG (#25)

v1.0.10 <2017-03-07>

  • Fix GUI, add scrollbar (#23)
  • Add parameter –raxml-args (#22)

v1.0.9 <2017-03-03>

  • Add parameters –min-proteomes and –all-proteomes (#21)

v1.0.8 <2016-09-07>

  • Set default bootstraps to 100
  • Add description for reproduction of results in paper

v1.0.7 <2016-06-16>

  • Add logo to GUI

v1.0.6 <2016-03-17>

  • Improve layout (avoid errors with large text fields)
  • Update jar file

v1.0.5 <2016-03-17>

  • Add advanced settings and external programs to GUI
  • Add GUI screenshots to README
  • Finish GUI layout
  • Fix outdir bug (manually entered text was ignored)
  • Update documentation in README
  • Improve layout of GUI (proteomes panel)

v1.0.4 <2016-02-23>

  • Add parameter to check external programs
  • Fix SeqFilter dependencies
  • Add swingx and own accordion element for GUI
  • Improve GUI design (GridBagLayout)

v1.0.3 <2016-02-23>

  • Add log4perl and Getopt::ArgvFile to package (simplify installation)

v1.0.2 <2016-02-22>

  • Remove Bioperl dependency
  • Add submodules directly (SeqFilter)
  • Update documentation

v1.0.1 <2016-02-22>

  • Add java GUI