-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Pi and Tajima's D calculations as Eidos Functions #28
Closed
Closed
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
052a9b1
Add Eidos functions for Pi and Tajima's D
npb596 2f55662
Adding author credits
npb596 37390f4
Add documentation and fix minor bugs
npb596 93a46f6
Add intervals within pi and wt calcs in Tajima's D
npb596 6e9df15
Added comment to Tajima's D
npb596 cd0187c
Use SLiM mutation counts instead of frequencies for simpler code
npb596 7a1a887
Correct minor error in comment
npb596 f6c4646
Remove and edit credits as needed
npb596 95b8311
Merge branch 'master' of https://github.com/npb596/SLiM-Extras
npb596 13465b6
Add a little more more description on pi equation
npb596 18b9510
Calculate using defined length, not always entire chromosome
npb596 1b8126c
Change names, remove combinationTwo function
npb596 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
function (float)calcPi(object<Genome> genomes, [No<Mutation> muts = NULL], [Ni$ start = NULL], [Ni$ end = NULL]) | ||
|
||
{ if (genomes.length() == 0) | ||
stop("ERROR (calcPi()): genomes must be non-empty."); | ||
if (community.allSpecies.length() > 1) | ||
{ | ||
species = unique(genomes.individual.subpopulation.species, preserveOrder=F); | ||
if (species.length() != 1) | ||
stop("ERROR (calcPi()): genomes must all belong to the same species."); | ||
if (!isNULL(muts)) | ||
if (!all(muts.mutationType.species == species)) | ||
stop("ERROR (calcPi()): muts must all belong to the same species as genomes."); | ||
} | ||
else | ||
{ | ||
species = community.allSpecies; | ||
} | ||
|
||
length = species.chromosome.lastPosition + 1; | ||
|
||
if (isNULL(muts)) | ||
muts = species.mutations; | ||
|
||
// handle windowing | ||
if (!isNULL(start) & !isNULL(end)) | ||
{ | ||
if (start > end) | ||
stop("ERROR (calcPi()): start must be less than or equal to end."); | ||
mpos = muts.position; | ||
muts = muts[(mpos >= start) & (mpos <= end)]; | ||
length = end - start + 1; | ||
} | ||
else if (!isNULL(start) | !isNULL(end)) | ||
{ | ||
stop("ERROR (calcPi()): start and end must both be NULL or both be non-NULL."); | ||
} | ||
|
||
// narrow down to the mutations that are actually present in the genomes and aren't fixed | ||
p = genomes.mutationFrequenciesInGenomes(muts); | ||
muts = muts[(p != 0.0) & (p != 1.0)]; | ||
|
||
// do the calculation | ||
// obtain counts of variant sequences for all segregating sites | ||
varCount = genomes.mutationCountsInGenomes(muts); | ||
// total count of sequences subtracted by count of variant sequences equals count of invariant sequences | ||
invarCount = genomes.size() - varCount; | ||
// count of pairwise differences per site is the product of counts of both alleles (equation 1 in Korunes and Samuk 2021), this is then summed for all sites | ||
diffs = sum(varCount * invarCount); | ||
// pi is the ratio of pairwise differences to number of possible combinations of the given sequences | ||
// the latter is calculated by a standard formula defined in combinationTwo function (not by default in SLiM) | ||
pi = sum(varCount * invarCount) / ((genomes.size() * (genomes.size() - 1)) / 2);; | ||
// pi is conventionally averaged per site and this is consistent with SLiM's calculation of Watterson's theta | ||
avg_pi = pi / length; | ||
return avg_pi; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Calculates pi (a metric of genetic diversity based on pairwise sequence differences) for a vector of genomes, based upon the mutations in the genomes. The mathematical formulation (as an estimator of the population parameter theta) is based on work in (Nei and Li 1979; Nei and Tajima 1981; Tajima 1983 (eq. A3)) and the exact formula used here is common in textbooks (e.g. equation 3.3 in Hahn 2018 or 2.2 in Coop 2020). This value is averaged by the number of sites. | ||
Often genomes will be all of the genomes in a subpopulation, or in the entire population, but any genome vector may be used. By default, with muts=NULL, the calculation is based upon all mutations in the simulation; the calculation can instead be based upon a subset of mutations, such as mutations of a specific mutation type, by passing the desired vector of mutations for muts. | ||
The calculation can be narrowed to apply to only a window – a subrange of the full chromosome – by passing the interval bounds [start, end] for the desired window. In this case, the vector of mutations used for the calculation will be subset to include only mutations within the specified window. The default behavior, with start and end of NULL, provides the genome-wide pi. | ||
The implementation of calcPiTheta(), viewable with functionSource(), treats every mutation as independent in the heterozygosity calculations. One could regard this choice as embodying an infinite-sites interpretation of the segregating mutations, as with calcHeterozygosity(). Indeed, finite-sites models of pi have been derived (Tajima 1996) though are not used here. In most biologically realistic models, such genetic states will be quite rare, and so the impact of this assumption will be negligible; however, in some models this distinction may be important. See calcPairHeterozygosity() for further discussion. This function was written by Nick Bailey (currently affiliated with the CNRS and Laboratory of Biometry and Evolutionary Biology at University Lyon 1), based on code in calcWattersonsTheta, and with helpful input from Peter Ralph. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
function (float)calcTajimaD(object<Genome> genomes, [No<Mutation> muts = NULL], [Ni$ start = NULL], [Ni$ end = NULL]) | ||
|
||
{ if (genomes.length() == 0) | ||
stop("ERROR (calcTajimaD()): genomes must be non-empty."); | ||
if (community.allSpecies.length() > 1) | ||
{ | ||
species = unique(genomes.individual.subpopulation.species, preserveOrder=F); | ||
if (species.length() != 1) | ||
stop("ERROR (calcTajimaD()): genomes must all belong to the same species."); | ||
if (!isNULL(muts)) | ||
if (!all(muts.mutationType.species == species)) | ||
stop("ERROR (calcTajimaD()): muts must all belong to the same species as genomes."); | ||
} | ||
else | ||
{ | ||
species = community.allSpecies; | ||
} | ||
|
||
length = species.chromosome.lastPosition + 1; | ||
|
||
if (isNULL(muts)) | ||
muts = species.mutations; | ||
|
||
// handle windowing | ||
if (!isNULL(start) & !isNULL(end)) | ||
{ | ||
if (start > end) | ||
stop("ERROR (calcTajimaD()): start must be less than or equal to end."); | ||
mpos = muts.position; | ||
muts = muts[(mpos >= start) & (mpos <= end)]; | ||
length = end - start + 1; | ||
} | ||
else if (!isNULL(start) | !isNULL(end)) | ||
{ | ||
stop("ERROR (calcTajimaD()): start and end must both be NULL or both be non-NULL."); | ||
} | ||
|
||
// narrow down to the mutations that are actually present in the genomes and aren't fixed | ||
p = genomes.mutationFrequenciesInGenomes(muts); | ||
muts = muts[(p != 0.0) & (p != 1.0)]; | ||
|
||
// do the calculation | ||
// Pi and Watterson's theta functions divide by sequence length so this must be undone in Tajima's D | ||
// Sequence length is constant (i.e. no missing data or indels) so this can be applied equally over both metrics | ||
diff = (calcPi(genomes, muts, start, end) - calcWattersonsTheta(genomes, muts, start, end)) * length; | ||
// calculate standard deviation of covariance of pi and Watterson's theta | ||
// note that first 3 variables defined below are sufficient for Watterson's theta calculation as well, though the function is used above for proper interval handling and clarity | ||
k = size(muts); | ||
n = genomes.size(); | ||
a_1 = sum(1 / 1:(n - 1)); | ||
a_2 = sum(1 / (1:(n - 1)) ^ 2); | ||
b_1 = (n + 1) / (3 * (n - 1)); | ||
b_2 = 2 * (n ^ 2 + n + 3) / (9 * n * (n - 1)); | ||
c_1 = b_1 - 1 / a_1; | ||
c_2 = b_2 - (n + 2) / (a_1 * n) + a_2 / a_1 ^ 2; | ||
e_1 = c_1 / a_1; | ||
e_2 = c_2 / (a_1 ^ 2 + a_2); | ||
covar = e_1 * k + e_2 * k * (k - 1); | ||
stdev = sqrt(covar); | ||
tajima_d = diff / stdev; | ||
return tajima_d; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Calculates Tajima's D (a test of neutrality based on the allele frequency spectrum) for a vector of genomes, based upon the mutations in the genomes. The mathematical formulation is given in Tajima 1989 (equation 38) and remains unchanged (e.g. equations 2.30 in Durrett 2008, 8.4 in Hahn 2018, and 4.44 in Coop 2020). Often genomes will be all of the genomes in a subpopulation, or in the entire population, but any genome vector may be used. By default, with muts=NULL, the calculation is based upon all mutations in the simulation; the calculation can instead be based upon a subset of mutations, such as mutations of a specific mutation type, by passing the desired vector of mutations for muts. | ||
The calculation can be narrowed to apply to only a window – a subrange of the full chromosome – by passing the interval bounds [start, end] for the desired window. In this case, the vector of mutations used for the calculation will be subset to include only mutations within the specified window. The default behavior, with start and end of NULL, provides the genome-wide Tajima's D. | ||
The implementation of calcTajimaD(), viewable with functionSource(), treats every mutation as independent in the heterozygosity calculations. One could regard this choice as embodying an infinite-sites interpretation of the segregating mutations, as with calcHeterozygosity(). Indeed, Tajima's D can be modified with finite-sites models of pi and theta (Misawa and Tajima 1997) though these are not used here. In most biologically realistic models, such genetic states will be quite rare, and so the impact of this assumption will be negligible; however, in some models this distinction may be important. See calcPairHeterozygosity() for further discussion. This function was written by Nick Bailey (currently affiliated with the CNRS and Laboratory of Biometry and Evolutionary Biology at University Lyon 1), based on code in calcWattersonsTheta, and with helpful input from Peter Ralph. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it the case that
calcWattersonsTheta( )
is giving usk/a_1
? It seems more transparent to either (a) calculate watterson's theta directly here like that, or (b) usecalcWattersonsTheta( )
to get the value ofk
here. (This is not obvious because of various possible definitions like "number of mutations" or "number of sites with a mutation", etc.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k here should represent number of sites with a mutation. Not sure what to do here because I see how it's simpler to just calculate k/a_1 and divide by sites (as is done in calcWattersonsTheta) since that information is present already but as in the above comment I need to change this so intervals work properly over calcPiTheta and calcWattersonTheta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see - perhaps a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I kept the above interval handling, I have now just left a comment that the variables given also define Watterson's theta.