-
Notifications
You must be signed in to change notification settings - Fork 5
Generating Genomes
To generate a genome it is first necessary to simulate a Species Tree using Zombi. If you want to use an user-given tree, please use first the Ti mode.
To simulate genomes, Zombi starts with an ancestral genome at the origin of time, with a given number of genes. In the current version, all genes families present in this genome have a single copy.
A genome is an ordered collection of genes. So if we begin with a genome that has 4 genes, what we see is something like
| Position | Gene_family | Orientation | Id |
|---|---|---|---|
| 0 | 1 | + | 1 |
| 1 | 2 | - | 1 |
| 2 | 3 | + | 1 |
| 3 | 4 | + | 1 |
The meaning of this is:
- Position: The position in the genome. The genome is circular, so the position 4 is adjacent to the position 3 and 0
- Gene_family: The identifier of the gene family
- Orientation: The orientation of that gene in the genome
- Id: The identifier of the gene.
Genomes evolve undergoing a series of events:
- D: Duplications. A segment of the genome is duplicated. The new copy is inserted next to the old one
- L: Losses. A segment of the genome is lost
- T: Transfers. A segment of the genome is transferred to a contemporary species. The segment is inserted in a random position. Transfers can be replacement transfers
- C: Translocations. A segment of the genome changes its position within the genome
- I: Inversions. A segment of the genome inverts its position
- O: Originations. A new gene family appears and it is inserted in a random position
The rates are genome-wise. For instance, a duplication rate of 3 means 3 duplication events per genome per unit of time.
There is also an additional rate for each event. This is called the extension_rate. This number (between 0 and 1) is the p parameters of a geometric distribution that controls the length of the affected segment.
For example, if DUPLICATION_EXTENSION == 1, the extension of the segment duplicated will be always 1 (meaning that only one gene is duplicated at a time)
By changing this parameter we can fine tune the extension associated with the different events. If we want that inversions affect normally large chunks of the genome, it suffices to use a low p.
Origination of new gene families is always of size 1, meaning that it is not possible to have an origination of two gene families in the same step of time. Once that the full evolution of genomes has been simulated, Zombi prints also the gene trees associated with the different gene families, all the events taking place in each gene family, the events taking place in each branch and the genomes of each node in the species tree.
There are two other events that do not depend intrinsically on genomes but in the species tree that is used to simulate genome evolution
- S: Speciation. When a genome arrives at a speciation node, the genome is divided and continues to evolve in both descendant branches
- E: Extinction. When a genome arrives at an extinction event, the genome stop its evolution
An example of how genomes evolve can be seen in the next figure:

In this figure, we can see the Original genome (Ori), the Root genome (R), the ancestral genomes (one for each inner node of the Species Tree) and the genomes in the surviving leaves. Different events modify the genome composition. The genes affected are represented next to the letter indicating the event. For example, there is a loss event in the branch leading to n4 affecting the blue gene. The transfer event takes place between the branch n8 (that goes extinct) and the branch n6 and affects the green gene. The inversion event affects a region of the genome (genes green, blue and purple).
Some advances details regarding the genes identifiers: You might want to skip this part if you are reading this for the first time
Events that introduce nodes in the topology of the gene tree, change the identifier of the gene. For example, let us say that in the root we have a gene whose identifier is 1. If the genome where the gene undergoes a speciation, the two branches will inherit: one a gene whose identifier is 2 and the other one 3. A duplication will change also the identifiers of the duplicated genes. When a gene has been transferred, it changes the identifier of the gene remaining in the genome and in the recipient genome. This way is easy to track the events that have given rise to different tree topologies. Inversions and translocations do not introduce changes in the tree topology and for that reason, they do not change the identifier of the affected genes.
This mode allows the user to fine control the genome rates. Three additional files are needed
- Transfer_rates.tsv
- Event_rates.tsv
- Extension_rates.tsv
The files can be generated with the help of the script RateCustomizer. To launch it, just type:
python RateCustomizer.py G ./Parameters/GenomeParameters.tsv ExperimentFolder
The tree files will be created in the folder ExperimentFolder/G/Rates. You can modify those files to select the rates per branch for each kind of event:
Transfer_rates.tsv
The first column corresponds to the donor, the second column corresponds to the recipient, and the third column corresponds to the weight that the connection between the two has. When a transfer event takes place, the donor searches for all the possible candidates and it will transfer a given gene with a probability proportional to the weight of the candidate. This is an easy way to model transfers that are forbidden between some pairs (it suffices to give a weight of 0 to the couple) or highways of transfers
Event_rates.tsv
Each line corresponds to a node of the species tree. The different column correspond to the branch-wise rates for the different events
- Extension_rates.tsv
Similar to the previous file, but with the values for the extension parameters
To launch this mode, then simply write
python Zombi Gu GenomeParametersFile.tsv ExperimentFolder
The rates will be read from the files just generated. Watch out, if you try to run the Gu mode before having generated those files, an error message will appear
Genomes: A folder with one file per node of the species tree. Each file contains information about the genome composition.
Gene_families: A folder with one file per gene family. Each file contains information about the events taking place in that gene family. There are 3 fields.
- 1. Time: The time at which the event takes place
- 2. Event: The type of event that takes place in a given time (S, E, D, T, L, I, C, O and F. F stands for Final, meaning that the gene arrived alive till the end of the run)
- 3. Nodes: Some more information about the kind of event:
S, D and T: 6 fields separated by semicolons. This can be better understood looking at the picture:

- L, I, C, O and F: 2 fields separated by semicolons. First, the species tree branch where the event takes place and second, the identifier of the gene affected
GeneTrees: A folder containing the gene trees corresponding to the evolution of the different families and the gene trees pruned so that only surviving genes are represented.
There are two types of trees:
- _completetree.nwk: A tree showing the complete evolution of that gene family
- _prunedtree.nwk: A tree in which the genes that have not survived till the present time have been removed. Normally you want to use this tree!
It is also possible to output the reconciled trees in the format RecPhyloXML
EventsPerBranch: (Not output by default) A folder with one file per branch of the species tree. Each file contains information about the events taking place in that branch. The codes are similar to the previously explained, but not the same. There are two main differences (for the sake of clarity). The first one is that transfers are divided into:
- LT: Leaving Transfers. Transfers that leave this branch
- AT: Arriving Transfers. Transfers that arrive at this branch.
The second difference is that the node of the nodes affected is given by:
GeneFamily_GeneIdentifier
So for example, if we go to the file n2_branchevents and we find the event L affecting at 4_3, means that the gene whose identifier is 3 belonging to the family 4 was lost in that branch in time given by the first column
Please also notice that in the case of events that affect to several genes, this will be reflected in the first column (several events taking place at the same unit of time)
Profiles: (Not output by default) Here there is a file called Profiles.tsv that contains the node of the species tree in the columns and the gene families in the rows. The entries give the number of copies that each gene family has for each node of the species tree.
DUPLICATION, TRANSFER, LOSS, INVERSION, TRANSLOCATION, ORIGINATION
The value for each type of event.
DUPLICATION_EXTENSION, TRANSFER_EXTENSION, LOSS_EXTENSION, INVERSION_EXTENSION, TRANSLOCATION_EXTENSION
The value of the p parameter of a geometric distribution that determines the extension of the genome (measured in number of genes) affected by an event
REPLACEMENT_TRANSFER
A number between 0 and 1 controlling the probability of replacement transfers (they only happen if there is a homologous position in the recipient genome)
STEM_FAMILIES
Number of gene families present in the ancestral genome at the Original Genome
MIN_GENOME_SIZE
The minimal size for a given genome. Smaller genomes will not be affected by losses events
EVENTS_PER_BRANCH
0 or 1, indicating whether outputting the Events per branch or not
PROFILES
0 or 1, indicating whether outputting the Profile or not
GENE_TREES
0 or 1, indicating whether outputting the Gene Trees or not
RECONCILED_TREES
0 or 1, indicating whether outputting the reconciled trees in format RecPhyloXML or not