Skip to content

Generating Genomes

AADavin edited this page Mar 16, 2018 · 35 revisions

Generating a genome (G)

To generate a genome it is first necessary to simulate a Species Tree using simuLyon. And no, it is not possible to input an externally computed tree, but it will be in future versions

To simulate genomes, simuLyon starts with an ancestral genome at the root, with a given number of genes. For now in the current version, all genes families present in this genome have a single copy (so in this ancestral genome there are no duplicated genes).

A genome is an ordered collection of genes. So if we begin with a genome that has 5 genes, what we see is something like

Position Gene_family Orientation Id
0 1 + 1
1 2 - 1
2 3 + 1
3 4 + 1
4 5 - 1

The meaning of this is:

  • Position: The position in the genome. The genome is circular, so the position 4 is adjacent to the position 3 and 0
  • Gene_family: The identifier of the gene family
  • Orientation: The orientation of that gene in the genome
  • Id: The identifier of the gene.

Genomes evolve undergoing a series of events:

  • D: Duplications. A segment of the genome is duplicated. The new copy is inserted next to the old one
  • L: Losses. A segment of the genome is lost
  • T: Transfers. A segment of the genome is transferred to a contemporary species. The segment is inserted in a random position. Transfers can be replacement transfers
  • C: Translocations. A segment of the genome changes its position within the genome
  • I: Inversions. A segment of the genome inverts its position
  • O: Originations. A new gene family appears and it is inserted in a random position

The rates in this case are genome-wise. For instance, a duplication rate of 3 means 3 duplication events per genome per unit of time.

There is also an additional rate for each event. This is called the extension_rate. This number (between 0 and 1) is the p parameters of a geometric distribution that controls the length of the affected segment.

For example, if DUPLICATION_EXTENSION == 1, the extension of the segment duplicated will be always 1 (meaning that only one gene is duplicated at a time)

By changing this parameter we can fine tune the extension associated to the different events. If inversions affect normally large chunks of the genome, it suffices to use a low p.

Origination of new gene families are always of size 1, meaning that it is not possible to have an origination of two gene families in the same step of time. Once that the full evolution of genomes has been simulated, simuLyon prints also the gene trees associated to the different gene families, all the events taking place in each gene family, the events taking place in each branch and the genomes of each node in the species tree.

There are two other events that do not depend intrinsically on genomes but in the species tree that is used to simulate genome evolution

  • S: Speciation. When a genome arrives at a speciation node, the genome is divided and continues to evolve in both descendant branches
  • E: Extinction. When a genome arrives at a extinction event, the genome stop its evolution

Some advances details regarding the genes identifiers: You might want to skip this part if you are reading this for the first time

Events that introduce nodes in the topology of the gene tree, change the identifier of the gene. For example, let us say that in the root we have a gene whose identifier is 1. If the genome where the gene undergoes a speciation, the two branches will inherit: one a gene whose identifier is 2 and the other one 3. A duplication will change also the identifiers of the duplicated genes. When a gene has been transferred, it changes the identifier of the gene remaining in the genome and in the recipient genome. This way is easy to track the events that have given rise to different tree topologies. Inversions and translocations do not introduce changes in the tree topology and for that reason they do not change the identifier of the affected genes.


Clone this wiki locally