Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Pi and Tajima's D calculations as Eidos Functions #28

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

npb596
Copy link

@npb596 npb596 commented Jun 13, 2024

Hi Ben!

I've used the calcWattersonTheta function in my work and figured it was worth putting in Pi and Tajima's D as well. Feel free to check my calculations and code quality, I've bitten as much as possible off of the calcWattersonTheta function. Aside from rechecking the code itself I have run simple neutral simulations and they seem to fall in line with expectations. For example, at the beginning Watterson's theta is much higher than Pi, presumably due to high numbers of new low-frequency mutations, and Tajima's D is consequently very negative. Then sometime around 10N - 20N ticks burn-in Watterson's theta and pi seem about even and Tajima's D fluctuates around 0. The simple combinationTwo equation is necessary for pi's calculation.

Cheers,
Nick

@bhaller
Copy link
Contributor

bhaller commented Jun 13, 2024

Wow, great! I'll tag @philippmesser and @petrelharp here; guys, it'd be great if you could have a look at what Nick has done here, since you probably understand it, and I don't. :-> Once I've got a thumbs-up from one or both of you, I'll process this PR.

I'll probably add these to SLiM itself, rather than using the PR here, since this seems generally useful and polished. :-> Thanks a bunch!

@bhaller bhaller requested review from philippmesser and removed request for philippmesser June 13, 2024 14:13
@npb596
Copy link
Author

npb596 commented Jun 13, 2024

Thanks to you! I wasn't sure how best to submit new functions other than here, that would be awesome to see them in SLiM itself! I will eagerly await Philipp and/or Peter's opinions.

@bhaller
Copy link
Contributor

bhaller commented Jun 13, 2024

Thanks to you! I wasn't sure how best to submit new functions other than here, that would be awesome to see them in SLiM itself! I will eagerly await Philipp and/or Peter's opinions.

The way you did it is great; merging it into SLiM requires extra glue, which I will provide. :->

varCount = genomes.mutationFrequenciesInGenomes(species.mutations) * size(genomes);
// total count of sequences subtracted by count of variant sequences equals count of invariant sequences
invarCount = size(genomes) - varCount;
// count of pairwise differences per site (assuming a biallelic site) is the product of counts of both alleles, this is then summed for all sites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this phrasing in the comment is confusing, since sites are not necessarily biallelic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed though not sure my new wording is any better. I mention this a bit now in my bigger comment and within the docstrings. I also changed the comment within the script now so it states where I got the idea of just multiplying ref and var alleles, since I don't see that notation often.

k = size(muts);
n = genomes.size();
a_1 = sum(1 / 1:(n - 1));
a_2 = sum(1 / 1:(n - 1 ^ 2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops -

Suggested change
a_2 = sum(1 / 1:(n - 1 ^ 2));
a_2 = sum(1 / ((1:(n - 1)) ^ 2));

... right? (please check this!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue here, thanks! But I now believe this should be the correct one and I should have it in my fork now:

a_2 = sum(1 / (1:(n - 1)) ^ 2);

I'm using Durrett 2008 section 2.4.1 for the exact formula. 1 / (n - 1) is a vector, all values in this vector are squared before the division from 1. I decided to test it more closely this time around with a population of only 10 individuals (20 genomes). The output should be equivalent to this python code:

1/192 + 1/182 + 1/172 + 1/162 + 1/152 + 1/142 + 1/132 + 1/122 + 1/112 +1/102 + 1/92 + +1.82 + 1/72 + 1/62 + 1/52 + 1/42 + 1/32 + 1/22 + 1/1**2

which equals 1.5780382439130234 and this is what I get with the above a_2. Both my original and your suggestion deviated from this. I guess that's the trickiest part to code with the vector stuff but I ran the calcTajimaD outputting every variable with this same small population to recheck and I'm more confident it's all working now. I guess I should upload a small script actually using these functions.

// do the calculation
// Pi and Watterson's theta functions divide by sequence length so this must be undone in Tajima's D
// Sequence length is constant (i.e. no missing data or indels) so this can be applied equally over both metrics
diff = (calcPiTheta(genomes) - calcWattersonsTheta(genomes)) * (species.chromosome.lastPosition + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't calcPiTheta( ) have start, end, and muts arguments passed in as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense but I don't know how best to do this, as those should be optional. I think I need to set up a condition if the user does enter that info? As it is now, it works fine genome-wide but entering starts and ends to Tajima's D does not change the pi and WT calculations so I do need to fix this somehow, or just disallow the extra options.

// Sequence length is constant (i.e. no missing data or indels) so this can be applied equally over both metrics
diff = (calcPiTheta(genomes) - calcWattersonsTheta(genomes)) * (species.chromosome.lastPosition + 1);
// calculate standard deviation of covariance of pi and Watterson's theta
k = size(muts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the case that calcWattersonsTheta( ) is giving us k/a_1? It seems more transparent to either (a) calculate watterson's theta directly here like that, or (b) use calcWattersonsTheta( ) to get the value of k here. (This is not obvious because of various possible definitions like "number of mutations" or "number of sites with a mutation", etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k here should represent number of sites with a mutation. Not sure what to do here because I see how it's simpler to just calculate k/a_1 and divide by sites (as is done in calcWattersonsTheta) since that information is present already but as in the above comment I need to change this so intervals work properly over calcPiTheta and calcWattersonTheta.

@petrelharp
Copy link
Contributor

I can take a careful look. But, to do that I need a docstring, so it's unambigious what's being calculated? I can tell what the code does, but it would be helpful if you wrote down what you'd like it to be calculating, so we can make sure that it's doing what you want.

For instance:

  • what is the endpoint behavior of start and end is (both are inclusive; hopefully this agrees with other methods?)
  • how is muts used (I think by ignoring all other mutations, i.e., calculating the value you'd get if those mutations simply did not exist/were silently not genotypable)
  • the returned value of pi is "per site" (not "per segregating site")

Perhaps most confusingly: stacked mutations. The code is treating each distinct mutation as a distinctly genotypable locus; so suppose there is one position with two mutations that we'll call A and B. So, there's four possible genotypes: - (no mutations), A, B, and AB. One approach to calculating pi would be to treat these as four distinct alleles at the one locus, getting (except for a factor of 1/(N-1) to account for sampling-without-replacement):

p- * (1 - p-) + pA * (1 - pA) + pB * (1 - pB) + pAB * (1 - pAB)

Instead, the approach taken here is to treat these as two separate loci, getting

2 * pA * (1 - pA) + 2 * pB * (1 - pB)

I think this is a good way to go - I agree with the code as written. However, it needs explaining; for instance, a surprising consequence is that it is possible for pi to be greater than 1. (There's a thread in slim-discuss I think where someone ran into this and it was very confusing.)

I've looked fairly carefully comparing the implementation of Tajima's D here to the one in tskit. And I don't think they are the same - for instance, I think the first term in the covariance in the two implmentations differ by a factor of a_1 (in your notation). It's entirely possible I messed that one up. Could you have a careful look?

@bhaller
Copy link
Contributor

bhaller commented Jun 22, 2024

Thanks @petrelharp! So yes, would be great to have a docstring for these functions, @npb596. It should be written mirroring the style of calls like calcFST(), calcHeterozygosity(), etc., in the manual (section 25.19.2), and state precisely what it is that you intend these functions to do, with attention to the considerations Peter has pointed out. Of course please state something like "This function was written by Nick Bailey" at the end to credit yourself. Sound good? Thanks! :->

@petrelharp
Copy link
Contributor

(Note: I'd also be happy to help with the docstring; but we need to nail down the questions above.)

@npb596
Copy link
Author

npb596 commented Jun 24, 2024

Thanks for the look-through, @petrelharp!

For you main comment, I've updated my fork of this repo https://github.com/npb596/SLiM-Extras/tree/master/functions but have not pushed that here yet because I assume that will start up a new request and thread (I haven't pushed a lot of forks...). The above link has new docstrings as text files. Since I copied a lot of the code from calcWattersonsTheta I also copied a lot of the docstring from that function but I would appreciate a look there as well. I mainly added some references to papers for the calculations and otherwise I'm pretty sure it's still applicable word-for-word. When I first wrote these I really just wanted quick genome-wide whole-population calculations of standard pi and Tajima's D as a quick way to check when my simulations reached equilibrium. Otherwise, they aren't really crucial for any work I'm doing now. Once I wrote them I figured it may be nice to make them more "production" quality. I originally didn't have any bug checks or options to take gene intervals, so I took a lot of that from calcWattersonsTheta. I don't know how much of that is helpful as to intent. I guess it's also worth pointing out this would be an example of a command used (only p1.genomes is required):

print(calcPiTheta(p1.genomes, muts = sim.mutationsOfType(m1), start = 10000, end = 20000));

To answer the first of three points you mention. I actually didn't test start and end options until today so thank you for mentioning it. It works in calcPiTheta as in calcWattersonsTheta where you can define some interval in the simulated genome for the calculation, instead of the whole genome. There actually was a small bug in my calcPiTheta that didn't allow this to happen, a holdover from my initial tests. I've fixed that now (a change from species.mutations to muts within the pi calc) and it can be seen in the above link.

Regarding the second point, the above bug maybe caused some confusion on the use of muts because it wasn't being used in the calculation since I left species.mutations, which was largely equivalent, in there. It's used to specify a type of mutation, if desired.

Regarding the third point, yes the pi returned is per site (averaged). Another thing I decided on by matching it to calcWattersonsTheta.

I did not think about stacking at all, so that's helpful to know. I did consider the fact that the calculation assumes an infinite-sites model whereas a SLiM population can have multi-allelic sites, that was why I put the unclear comment on bi-allelic sites, so the calculation won't always be exact. I didn't want to bother with a multi-allelic implementation and this matches calcWattersonsTheta better anyways. The docstring there mentions this issue and I kept it in my docstrings, though also with citations for multi-allelic models to add to the point.

Lastly, I did look at your tskit link and I agree it's different. My code is similar to other implementations I've seen (https://github.com/cggh/scikit-allel/blob/master/allel/stats/diversity.py#L875) and I based it on a textbook https://services.math.duke.edu/~rtd/Gbook/Gbook.html (but still messed up the a2, thanks for checking yourself!). My notation also is based on those sources. Given that, I think the tskit one is flawed. It looks like tskit a and b are equivalent to c_1 and c_2 in my notation (it skips the steps of defining b_1 and b_2, which is fine). The problem is c_1 and c_2 are plugged into the final D equation where e_1 and e_2 should be. Per Durrett 2008 (pg. 66),

e_1 = c_1 / a_1

e_2 = c1 / (a_1 ^ 2 + a_2)

I also was curious what the necessity of making h and g arrays is. It's probably something having to do with how the function is used that I'm not aware of. It stood out just cause I'm used to the version as in scikit-allel linked above. Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants