Large datasets

WARNING: I HAVEN'T FINISHED TO WRITE THIS PAGE! (23.05.2023)

This page discusses the options you have when dealing with a dataset that is too large to be analyzed in a reasonable amount of time with GeneRax. In short, you can:

Reduce the number of search steps with the --max-spr-radius parameter
Filter out gene families
Reduce the number of the species
Run GeneRax on a large cluster

Your choice will depends on your dataset and on which compromise you can make. This page will help you to make this choice...

Runtime estimation

Don't waste to much time analyzing a dataset that is way too large! There is no way to accurately predict the runtime, because GeneRax implements a search heuristic (which might converge fast or not). The best way to have a rough estimation is to drastically down-sample the size of your dataset, and to run the analysis on this smaller dataset. If such an analysis is already too slow, then GeneRax won't be able to handle the whole dataset. If GeneRax runs fast, try again with a larger subset of your dataset.

Dataset dimensions and runtime

The runtime depends on:

The number of species
The number of gene families
The number of sequences in each family ( = size of the gene trees)
The number of sites (columns) in your alignment
The models

For most steps of the algorithm, GeneRax treats each gene family independently (and totally independently if you use the option --per-family-rates).

For a given gene family, GeneRax spends most of its runtime in evaluating the joint likelihood score. The runtime depends on how many times the likelihood has to be computed, and on the time required for one likelihood evaluation.

The number of times the likelihood is evaluated should roughly be linear to the number of sequences in the alignment (note that this is a very rough approximation!)
The time spent in one likelihood evaluation is split between the phylogenetic and the reconciliation likelihood scores.
- Phylogenetic likelihood: its evaluation is linear to the number of sites (columns) time the number of sequences. It also depends on the substitution model. For instance, GTR+G should be 4 times slower than GTR. PROTGTR is really not recommended (it is very slow and does not make much sense for a gene alignment).
- Reconciliation likelihood: its evaluation is linear to the number of sequences (the size of the gene tree) times the number of species tree. The reconciliation model also has an impact (UndatedDL should be faster than UndatedDTL).

Changing the search radius

TODO WRITE

Reducing the number of species

TODO: FINISH WRITING

The number of species has a great impact on the runtime, because it affects both the size of the species tree and the size of the gene trees (if you remove a species from the analysis, you also have to remove its sequences...). In theory, the runtime could increase cubic (O(species^3)) to the number of species.

Reducing the number of gene families

TODO: FINISH WRITING

The effect of the number of gene families is easier to predict, because in the slowest steps (the tree search itself), the gene families are analyzed independently. So if you randomly remove half of the gene families, the run should be twice faster. The speedup depends on the size of the

Parallelize computations

TODO FINISH WRITING

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large datasets

Runtime estimation

Dataset dimensions and runtime

Changing the search radius

Reducing the number of species

Reducing the number of gene families

Parallelize computations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally