-
Notifications
You must be signed in to change notification settings - Fork 13
Tutorial
Open a terminal and install GeneRax:
git clone --recursive https://github.com/BenoitMorel/GeneRax.git
cd GeneRax
./install.sh
You should obtain a generax executable under build/bin.
Then add GeneRax to your path: either copy the executable to a directory that is already in your path or append the absolute path to build/bin to your path. For instance, I had to add the following line to my ~/.bashrc file and to reopen a terminal:
# WARNING!! You need to edit the following path
export PATH="$PATH:/home/benoit/github/GeneRax/build/bin"
Check that GeneRax was correctly installed:
generax --help
This should display a short help message. To get more help, check the wiki.
Thirdkind is a tool that takes as input a reconciliation file and that outputs a SVG file to visualize the reconciliation. You can either use their webserver or install it on your machine. In this tutorial, I assume that you installed it on your machine, but feel free to use the webserver instead.
First of all, go to the root of the GeneRax repository, and make sure that you have all the data that we will use in the tutorial:
git pull
We will now reconcile two gene trees with a species tree. We will us gene trees that were inferred from gene MSAs with RAxML-NG. From the root of the GeneRax repository, type:
generax --families examples/gene_tree_correction/families_plants.txt --species-tree data/plants/species_trees/speciesTree.newick --rec-model UndatedDL --prefix no_correction --strategy EVAL
- The file families_plants.txt contains information about the gene families (the gene trees, the gene-species mappings, and the substitution model to use).
- The file speciesTree.newick is a plant species tree. Note that the species tree should be rooted and binary.
-
UndatedDLis the model of gene tree evolution, that allows duplications and losses (no HGT). To allow HGT, you need to replace it withUndatedDTL(which is the default model). Here, we don't expect HGT so we disable it. -
EVALis the gene tree search strategy. Here, we do not want to optimize the gene tree topology. We only want to reconcile the gene tree with the species tree. -
no_correctionis the output directory. This directory will be created. Do not forget to remove it if you want to rerun GeneRax from scratch.
We can now check the number of inferred events in the file no_correction/reconciliations/Phy003AED5_CUCME_eventCounts.txt.
- S is the number of speciations for which both lineages survived
- SL is the number of speciations of which one of the two lineages went extinct
- D is the number of duplications
- T is the number of transfers for which both child lineages survived
- TL is the number of transfers for which the gene went extinct in the origin species.
- Leaf is the number of leaves in the gene tree
That's quite a lot of duplication and losses... We can also have a closer look by visualizing the reconciliation. Open the reconciliation file with either ThirdKind or RecPhyloVisu. For instance:
thirdkind -f no_correction/reconciliations/Phy003AED5_CUCME_reconciliated.xml -o no_correction.svg
This will produce an SVG file no_correction.svg that can be opened with any image viewer.
You can also have a look at the Thirdkind output that I generated here.
Why do we observe so many duplications and losses? What is the maximum number of ancestral gene copies? Do you think that it is overestimated or underestimated?
Gene MSAs are often too short to correctly resolve the gene trees. As a result, the gene trees inferred from the MSAs are often inaccurate. Most of the time, this causes reconciliation tools to overestimate the number of duplication, loss, and transfer events, as well as the ancestral gene content size. Indeed, adding "artificial" gene events is the only way for making the (wrong) gene trees and the species tree compatible with each other.
To solve this problem, we have to perform gene tree correction first. GeneRax uses a joint model of sequence and gene evolution to infer accurate gene trees from the MSAs and the species tree. For more information about the method, please read our manuscript. Run the following command (if you have MPI installed, you can also parallelize with mpiexec -np 4 and replace 4 with the number of cores that you have on your machine):
generax --families examples/gene_tree_correction/families_plants.txt --species-tree data/plants/species_trees/speciesTree.newick --rec-model UndatedDL --prefix correction --strategy SPR
We have changed the strategy (SPR instead of EVAL) and the output directory name. The gene tree correction is a tree search algorithm and takes more time than the reconciliation alone (40sec on my laptop). The new reconciliation file is: correction/reconciliations/Phy003AED5_CUCME_reconciliated.xml. I obtain the following reconciliation.
What is the effect on the number of gene duplications and losses? Does it look more plausible?
You can also try to run GeneRax on the cyanobacteria dataset. You will need to create a family file with the unique gene family from this dataset and to provide the corresponding species tree. Do not forget to replace the UndatedDL model with the UndatedDTL model.