Skip to content

TomMakesThings/Wright-Fisher-Simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Haploid and Diploid Wright-Fisher Simulator

🟒🟒🟒 Code by TomMakesThings 🟒🟒🟒

April 2022


About

This repository contains a simulator, implemented in a Jupyter notebook, to test the effects of selection, recombination and linkage disequilibrium on haploid and diploid versions of the Wright-Fisher model of genetic drift (Figure A). For the haploid model, the simulator demonstrates that speed and rate of fixation of an allele in a population can be increased through positive selection. For the diploid model, the simulator demonstrates how LD can be broken down faster with higher recombination and negative selection.

Wright-Fisher Model Types

Figure (A) Diagram demonstrating allele genetic drift over time for the standard Wright-Fisher model (left) and the modified two loci model (right).

Haploid Wright-Fisher

The classic Wright-Fisher model simulates a haploid, asexual, panmictic population of size $N$ over $t$ generations for a single loci with two alleles $a$ and $A$.

  1. A population is initialised with an $a$ allele count of $n$.
  2. For each preceding generation, the number of alleles $n'$ is sampled independently with replacement from a binomial distribution. Selection can be introduced via the selection coefficient $s \neq 0$. As alleles $a$ and $A$ have relative fitness $w_{a} = 1 + s$ and $w_{A} = 1$, an $s > 0$ gives allele $a$ a fitness advantage, while $s < 0$ means the allele is deleterious and less likely to be passed to the next generation.
  3. Drift is modelled as a stochastic process in which the variant allele can be lost if its frequency in a generation reaches zero, fixed if its frequency reaches $N$ or fluctuating for any other $n$. If either extreme is encountered, the simulator will end prematurely as only one allele remains.

$n' \sim binomial(\frac{n(1 + s)}{n(1 + s) + N - n}, N)$

Diploid Wright-Fisher with Recombination

A variant of the Wright-Fisher model was created to model genetic drift between two linked loci $A$ and $B$ under both sexual and asexual reproduction. This simulator models a diploid population featuring two alleles in linkage disequilibrium (LD) with partial recombination.

  1. The number of individuals with the $A$ allele is initialised through randomly sampling a distribution $S(n)$, while the rest of the population is set to have the $a$ allele.
  2. Then a singleton variant allele is introduced at random at site $B$ which can occur on either a chromosome with the $A$ allele or the $a$ allele. This is equivalent to introducing a new mutation at $B$.
  3. For each generation, a new population is filled through selecting haplotypes from the previous generation with probability $1 - r$, and creating recombinant haplotypes with probability $r$. In the case of recombination, two alleles are selected at random with probability proportional to the allele frequencies of the previous generation. All generations are kept at a constant size $N$ and the simulator iterates until an allele at one loci become fixed.

Linkage Disequilibrium

Linkage disequilibrium (LD) is the non-random association of alleles at different loci. One method to quantify LD between two alleles $A$ and $B$ at different loci is the coefficient of linkage disequilibrium $D$. This is the difference between the frequency of a haplotype for a pair of alleles $A$ and $B$, $p_{AB}$, and the product of the allele frequencies $p_{A} p_{B}$.

$D_{AB} = p_{AB} - p_{A} p_{B}$

The range of potential values for $D_{AB}$ relies on allele frequencies $A$ and $B$, and so it is difficult to compare the level of LD between other pairs of alleles. Therefore it can be normalised to range between -1 to 1, denoted $D'$, by dividing by the maximum difference $D_{max}$ between the observed and expected haplotype frequencies.

$D' = \frac{D}{D_{max}}$

$\text{if D < 0, } D_{max} = max{p_{A}p_{B}, (1 - p_{A})(1 - p_{B})}$

$\text{else } D_{max} = min{p_{A}(1 - p_{A}), (1 - p_{A})p_{B}}$

Another normalised measure of LD is the genetic correlation $r^{2}$ between pairs of loci. This metric has ranges between 0 and 1.

$r^{2} = \frac{D^{2}}{p_{A}(1 - p_{A}) p_{B}(1 - p_{B})}$

Results

Haploid Wright-Fisher

For the single loci model, the effects of introducing positive/negative selection on genetic drift of an allele with initial frequency $n = 1$ were investigated by running the simulator for 1,000 iterations with selection coefficient $s = 0.02$, $s = 0.005$, $s = 0$, $s = -0.005$ and $s = -0.02$. The number of times the allele fixed or was lost was recorded in Table A, with results showing that higher than average occurrences of fixation occur with positive selection coefficients. This is to be expected as these correspond to situations in which the allele is beneficial and thus confers a positive fitness advantage. By contrast for negative selection coefficients, fixation rate is lower as the allele is deleterious. When comparing fixation and loss times, fixation on average occurs faster for a higher positive $s$, while loss is slightly quicker for a lower negative $s$.

Selection Coefficient (s) Selection Effect Number Fixed Number Lost Mean Fixation Time Mean Loss Time
0.02 Strongly beneficial 40 960 171.475 7.554
0.005 Weakly beneficial 29 970 198.931 8.737
0 Alleles have equal fitness 12 988 171.0 8.694
-0.005 Weakly deleterious 7 993 194.143 9.986
-0.02 Strongly deleterious 1 999 114.0 8.269
Table (A) Summary statistics of the classic Wright-Fisher model with and without selection for 1,000 simulations and population size N = 100. Each time the population is initialised with a singleton allele, n = 1.

Diploid Wright-Fisher with Recombination

Comparing LD at Initialisation

For the linked loci model, the initial values for three common linkage disequilibrium metrics $D$, $D'$ and $r^{2}$ were evaluated by repeatedly generating an initial population via random sampling 1,000 times. The haplotype frequencies for the first five runs are recorded in Table B, while the distribution of $r^{2}_{1}$ is plotted in Figure B. This demonstrates that $D$ and $r^{2}$ are highly dependent on the haplotype frequencies in the initial population as these measures are calculated based on the frequency of the background allele $A$. Only $D'$ is consistently initialised as 1 suggesting the population is in complete LD.

Run ID Singleton Haplotype AB Ab aB ab $D_{1}$ $D'_{1}$ $r^{2}_{1}$
1 Ab 4 1 95 0 0.0095 1 0.1919
2 ab 27 0 72 1 0.0027 1 0.0037
3 AB 1 1 0 98 0.0098 1 0.4949
4 aB 0 3 1 96 0.0003 1 0.0003
5 Ab 22 1 77 0 0.0077 1 0.0338

Table (B) Measuring linkage disequilibrium for the initial population across 5 runs.

Figure (B) Initial distribution of $r^{2}$ for 1000 two-loci Wright-Fisher simulations without selection.

Testing Recombination Rates and Selection

Recombination

For each generation in the diploid simulator, recombinant haplotypes are created between two loci with probability $r$. To test the effects of recombination rate on LD, an initial population was randomly generated giving haplotypes frequencies: ${AB: 1, Ab: 10, aB: 0, ab: 89}$. In this case, the singleton $B$ allele occurred on a chromosome with the $A$ allele, and so the haplotype $aB$ initially has frequency zero. These initial haplotype frequencies were consistently used to seed the experiments so that results of changing parameters were comparable and averages could be calculated.

The simulator was run 1,000 times for $r = 0.05$, $r = 0.01$ and $r = 0.02$ and the LD measures over time plotted in Figure C. Here the results for each simulation per time point, as well as the average over time are depicted. The average across all simulations demonstrate that for higher $r$, fewer generations are required to break down LD. In real populations, higher recombination rate is expected when two loci are further apart. Therefore this simulation implies that a greater distance between loci would speed up LD decay.

Figure (C) Three measures of linkage disequilibrium for different recombination rates. On the left $D$, $D'$ and $r^{2}$ measured per generation are plotted for each simulation, while on the right each LD measured was averaged over all 1,000 simulations.

Selection

The distribution of average allele frequency after a fixation event occurred for different $s$ is plotted in Figure D. As expected with positive selection towards allele $A$, both its frequency and the frequency of the linked allele $B$ in the final generation is higher than chance. $A$ therefore acts as a driver mutation. Even though the linked allele has no effect on a chromosome's fitness, with a low recombination rate it often hitchhikes when $A$ increases in frequency. This has the effect of increasing LD.

When negative selection is stronger, the $B$ allele frequency is higher than for a weaker negative selection. This can be explained by the occurrence of recombination events forming the haplotype $ab$. As selection is very strong against $AB$ and $Ab$, their frequencies are reduced which has the effect of increasing haplotypes $aB$ and $ab$. The beneficial effect of recombination therefore increases population diversity which explains the reduction of LD with negative selection on allele $A$.

Figure (D) Bar chart of allele frequency in final generation averaged across 1,000 simulation for different selection coefficients.

Releases

No releases published

Packages

No packages published