-
Notifications
You must be signed in to change notification settings - Fork 14
Large datasets
WARNING: I HAVEN'T FINISHED TO WRITE THIS PAGE! (23.05.2023)
This page discusses the options you have when dealing with a dataset that is too large to be analyzed in a reasonable amount of time with GeneRax. In short, you can:
- Reduce the number of search steps with the --max-spr-radius parameter
- Filter out gene families
- Reduce the number of the species
- Run GeneRax on a large cluster
Your choice will depends on your dataset and on which compromise you can make. This page will help you to make this choice...
Don't waste to much time analyzing a dataset that is way too large! There is no way to accurately predict the runtime, because GeneRax implements a search heuristic (which might converge fast or not). The best way to have a rough estimation is to drastically down-sample the size of your dataset, and to run the analysis on this smaller dataset. If such an analysis is already too slow, then GeneRax won't be able to handle the whole dataset. If GeneRax runs fast, try again with a larger subset of your dataset.
The runtime depends on:
- The number of species
- The number of gene families
- The number of sequences in each family ( = size of the gene trees)
- The number of sites (columns) in your alignment
- The models
For most steps of the algorithm, GeneRax treats each gene family independently (and totally independently if you use the option --per-family-rates).
For a given gene family, GeneRax spends most of its runtime in evaluating the joint likelihood score. The runtime depends on how many times the likelihood has to be computed, and on the time required for one likelihood evaluation.
- The number of times the likelihood is evaluated should roughly be linear to the number of sequences in the alignment (note that this is a very rough approximation!)
- The time spent in one likelihood evaluation is split between the phylogenetic and the reconciliation likelihood scores.
- Phylogenetic likelihood: its evaluation is linear to the number of sites (columns) time the number of sequences. It also depends on the substitution model. For instance,
GTR+Gshould be 4 times slower thanGTR.PROTGTRis really not recommended (it is very slow and does not make much sense for a gene alignment). - Reconciliation likelihood: its evaluation is linear to the number of sequences (the size of the gene tree) times the number of species tree. The reconciliation model also has an impact (
UndatedDLshould be faster thanUndatedDTL).
- Phylogenetic likelihood: its evaluation is linear to the number of sites (columns) time the number of sequences. It also depends on the substitution model. For instance,
TODO WRITE
TODO: FINISH WRITING
The number of species has a great impact on the runtime, because it affects both the size of the species tree and the size of the gene trees (if you remove a species from the analysis, you also have to remove its sequences...). In theory, the runtime could increase cubic (O(species^3)) to the number of species.
TODO: FINISH WRITING
The effect of the number of gene families is easier to predict, because in the slowest steps (the tree search itself), the gene families are analyzed independently. So if you randomly remove half of the gene families, the run should be twice faster. The speedup depends on the size of the
TODO FINISH WRITING