Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low level pair coalescence counts #2932

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nspope
Copy link
Contributor

@nspope nspope commented Apr 18, 2024

Low level extension of #2915

Copy link

codecov bot commented Apr 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.62%. Comparing base (e6483fc) to head (62069ec).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2932   +/-   ##
=======================================
  Coverage   89.62%   89.62%           
=======================================
  Files          29       29           
  Lines       30176    30176           
  Branches     5874     5874           
=======================================
  Hits        27044    27044           
  Misses       1793     1793           
  Partials     1339     1339           
Flag Coverage Δ
c-tests 86.21% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 88.72% <ø> (ø)
python-tests 98.97% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@nspope
Copy link
Contributor Author

nspope commented Apr 18, 2024

I'd like to generalize this algorithm slightly:

  • Currently, the output is either a num_windows by num_nodes array (which is very large), or a num_windows by num_time_windows array where the counts are summed within time windows.
  • Conceptually, the "nodes" output gives the empirical distribution of pair coalescence times in windows across the genome. That is, for each window we have a vector of RVs (node times) and a vector of weights (pair coalescence counts).
  • From this distributional viewpoint, there's lots of useful things that may be calculated: the empirical CDF, quantiles, moments, coalescence rates, etc. (of which the "sum in time windows" option of the current implementation is one special case)

So, I think it'd be useful to take the current algorithm, and have it apply a summary function at the end of each window. This would let us calculate any useful summary statistic without having to create a potentially humongous array (windows by nodes by indexes) as an intermediate.

The API would stay the same-- later on, we could add named methods for various summary statistics, and potentially eventually expose a "general summary stat" interface, like is done for the other statistics.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - do you want to add the summary func stuff now before we start porting to C? Probably a good idea, if you want to do this in the short term.

python/tests/test_coalrate.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants