-
Notifications
You must be signed in to change notification settings - Fork 13
Tutorial
Open a terminal and install GeneRax:
git clone --recursive https://github.com/BenoitMorel/GeneRax.git
cd GeneRax
./install.sh
You should obtain a generax executable under build/bin.
Then add GeneRax to your path: either copy the executable to a directory that is already in your path or append the absolute path to build/bin to your path. For instance, I had to add the following line to my ~/.bashrc file and to reopen a terminal:
# WARNING!! You need to edit the following path
export PATH="$PATH:/home/benoit/github/GeneRax/build/bin"
Check that GeneRax was correctly installed:
generax --help
This should display a short help message. To get more help, check the wiki.
Thirdkind is a tool that takes as input a reconciliation file and that outputs a SVG file to visualize the reconciliation. They usually have a web server, but unfortunately it seems to be unavailable, so you will have to install the tool on your machine. Please follow the instructions on their github page
If for some reason you can't install Thirdkind (it's a very recent tool and it has several dependencies), don't waste too much time and just continue the tutorial. I provide the expected output SVG files in the tutorial.
First of all, go to the root of the GeneRax repository, and make sure that you have all the data that we will use in the tutorial:
git pull
We will now reconcile two gene trees with a species tree. We will us gene trees that were inferred from gene MSAs with RAxML-NG. From the root of the GeneRax repository, type:
generax --families examples/gene_tree_correction/families_plants.txt --species-tree data/plants/species_trees/speciesTree.newick --rec-model UndatedDL --prefix no_correction --strategy EVAL
- The file families_plants.txt contains information about the gene families (the gene trees, the gene-species mappings, and the substitution model to use).
- The file speciesTree.newick is a plant species tree. Note that the species tree should be rooted and binary.
-
UndatedDLis the model of gene tree evolution, that allows duplications and losses (no HGT). To allow HGT, you need to replace it withUndatedDTL(which is the default model). Here, we don't expect HGT so we disable it. -
EVALis the gene tree search strategy. Here, we do not want to optimize the gene tree topology. We only want to reconcile the gene tree with the species tree. -
no_correctionis the output directory. This directory will be created. Do not forget to remove it if you want to rerun GeneRax from scratch.
We can now check the number of inferred events in the file no_correction/reconciliations/Phy003AED5_CUCME_eventCounts.txt.
- S is the number of speciations for which both lineages survived
- SL is the number of speciations of which one of the two lineages went extinct
- D is the number of duplications
- T is the number of transfers for which both child lineages survived
- TL is the number of transfers for which the gene went extinct in the origin species.
- Leaf is the number of leaves in the gene tree
That's quite a lot of duplication and losses... We can also have a closer look by visualizing the reconciliation. Open the reconciliation file with either ThirdKind or RecPhyloVisu. For instance:
thirdkind -f no_correction/reconciliations/Phy003AED5_CUCME_reconciliated.xml -o no_correction.svg
This will produce an SVG file no_correction.svg that can be opened with any image viewer.
If their web servers are down (which happens quite often) and if you don't have the time to install thirdkind, you can also have a look at the thirdkind output that I generated here.
Why do we observe so many duplications and losses? What is the maximum number of ancestral gene copies? If we wanted to estimate the ancestral genome sizes with this method, what would happen?
Gene MSAs are often too short to correctly resolve the gene trees. As a result, the gene trees inferred from the MSAs are often inaccurate. Most of the time, this causes reconciliation tools to overestimate the number of duplication, loss, and transfer events. Indeed, adding "artificial" gene events is the only way for making the (wrong) gene trees and the species tree compatible with each other.
To solve this problem, we have to perform gene tree correction first. GeneRax uses a joint model of sequence and gene evolution to infer accurate gene trees from the MSAs and the species tree. For more information about the method, please read our manuscript. Run the following command (if you have MPI installed, you can also parallelize with mpiexec):
generax --families examples/gene_tree_correction/families_plants.txt --species-tree data/plants/species_trees/speciesTree.newick --rec-model UndatedDL --prefix correction --strategy SPR
We have changed the strategy (SPR instead of EVAL) and the output directory name. The gene tree correction is a tree search algorithm and takes more time than the reconciliation alone (40sec on my laptop). The new reconciliation file is: correction/reconciliations/Phy003AED5_CUCME_reconciliated.xml. I obtain the following reconciliation.
What is the effect on the number of gene duplications and losses? Does it look more plausible?
We can also use the software seaview to perform gene tree reconstruction while taking into account a species tree. We first load the alignment into seaview when launching it:
seaview data/plants/families/Phy003AEDB_CUCME/alignment.msa &
In the graphical interface, we click on the "Trees" menu, then PhyML. In the popup window, we can keep the default options or decide to change them, and hit the "Run" button. When the computation is finished, we click "OK". That creates a new tree window where the resulting tree is plotted. We can then click on the "Reconcile" button. In the popup window, we select the species tree and load data/plants/species_trees/speciesTree.newick. Here there are several options to the reconciliation algorithm. One in particular is the Branch support threshold: this specifies which bipartitions in the gene tree we don't trust and are willing to improve using the information coming from the species tree. If it is set to -1.0, obviously no bipartition will be improved; if we set it to e.g. 0.9, all branches with support <=0.8 will be re-evaluated in the light of the species tree. Feel free to play around with the threshold and see how the resulting scenario of duplications and losses changes. The gene tree in a reconciled format is plotted by seaview, and can be saved as a svg picture or as a xml file for further plotting with ThirdKind.