Skip to content

Tutorial

BenoitMorel edited this page Jul 23, 2022 · 11 revisions

Installing GeneRax

Open a terminal and install GeneRax:

git clone --recursive  https://github.com/BenoitMorel/GeneRax.git
cd GeneRax
./install.sh

You should obtain a generax executable under build/bin. Then add GeneRax to your path: either copy the executable to a directory that is already in your path or append the absolute path to build/bin to your path. For instance, I had to add the following line to my ~/.bashrc file and to reopen a terminal:

# WARNING!! You need to edit the following path
export PATH="$PATH:/home/benoit/github/GeneRax/build/bin"

Check that GeneRax was correctly installed:

generax --help

This should display a short help message. To get more help, check the wikiwiki.

Reconciliation with GeneRax (without gene tree correction)

We will now reconcile two gene trees with a species tree. We will us gene trees that were inferred from gene MSAs with RAxML-NG. From the root of the GeneRax repository, type:

generax --families examples/gene_tree_correction/families_plants.txt --species-tree data/plants/species_trees/speciesTree.newick --rec-model UnrootedDL --prefix no_correction --strategy EVAL
  • The file families_plants.txt contains information about the gene families (the gene trees, the gene-species mappings, and the substitution model to use).
  • The file speciesTree.newick is a plant species tree. Note that the species tree should be rooted and binary.
  • UndatedDL is the model of gene tree evolution, that allows duplications and losses (no HGT). To allow HGT, you need to replace it with UndatedDTL (which is the default model).
  • EVAL is the gene tree search strategy. Here, we do not want to optimize the gene tree topology. We only want to reconcile the gene tree with the species tree.
  • no_correction is the output directory. This directory will be created. Do not forget to remove it if you want to rerun GeneRax from scratch.

We can now check the number of inferred events in the file no_correction/reconciliations/Phy003AED5_CUCME_eventCounts.txt.

  • S is the number of speciations for which both lineages survived
  • SL is the number of speciations of which one of the two lineages went extinct
  • D is the number of duplications
  • T is the number of transfers for which both child lineages survived
  • TL is the number of transfers for which the gene went extinct in the origin species.
  • Leaf is the number of leaves in the gene tree

That's quite a lot of duplication and losses... We can also have a more closer look by visualizing the reconciliation. Open the reconciliation file with either ThirdKind or RecPhyloVisu: no_correction/reconciliations/Phy003AED5_CUCME_reconciliated.xml

If their web servers are down (which happens quite often) and if you don't have the time to install thirdkind, you can also have a look at the thirdkind output that I generated here.

Why do we observe so many duplication and losses?

Gene tree correction and reconciliation

Gene MSAs are often too short to correctly resolve the gene trees. As a result, the gene trees inferred from the MSAs are often inaccurate. Most of the time, this causes reconciliation tools to overestimate the number of duplication, loss, and transfer events. Indeed, adding "artificial" gene events is the only way for making the (wrong) gene trees and the species tree compatible with each other.

To solve this problem, we have to perform gene tree correction first. GeneRax uses a joint model of sequence and gene evolution to infer accurate gene trees from the MSAs and the species tree. For more information about the method, please read our manuscript. Run the following command (if you have MPI installed, you can also parallelize with mpiexec):

generax --families examples/gene_tree_correction/families_plants.txt --species-tree data/plants/species_trees/speciesTree.newick --rec-model UndatedDL --prefix correction --strategy SPR

We have changed the strategy (SPR instead of EVAL) and the output directory name. The gene tree correction is a tree search algorithm and takes more time than the reconciliation alone (40sec on my laptop). The new reconciliation file is: correction/reconciliations/Phy003AED5_CUCME_reconciliated.xml. I obtain the following https://github.com/BenoitMorel/GeneRax/blob/master/data/pictures/rec_correction.svg. Note that the number of duplications and losses is substantially smaller.

Clone this wiki locally