Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multichrom example #2665

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Dec 15, 2022

Stacked on #2663 (and #2619)

@petrelharp @bhaller @molpopgen You might be interested in this: the new file c/examples/multichrom_wright_fisher.c gives a simple example of multi chromosome simulations with parallel simplify (most of the other stuff is from other in-flight PRs)

This adds a simple example of running a multi chromosome forwards simulation and running the sort/simplify steps in parallel using pthreads. This was really just to make sure that the basic idea of share the node table across the different table collections would work, and that we were getting the details right with the new TSK_SIMPLIFY_NO_FILTER_NODES and TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS.

Running this through helgrind gives no errors (I think it's pretty good at picking up data races):

$ valgrind --tool=helgrind ./build/multichrom_wright_fisher 1000 10 1 tmp/sdf 1234

So, hooray, @molpopgen's ingenious idea works and with the infrastructure in #2663 and #2619 we can do it pretty easily!

Another thing I ended up doing here in 31c47e4 is to add a basic implementation of "delete rows", which is an operation we badly need in the API I think (mentioned here: #1034 (comment)). I think it's worth talking about that separately, so I'll open an issue for it.

We can probably let this sit for a while now anyway, and should try to clear up some other stuff first.

@molpopgen
Copy link
Member

Great! I'll take a closer look in a bit.

@bhaller
Copy link

bhaller commented Dec 15, 2022

Super cool. Good to know about helgrind, too; I will look into that too! :->

@hyanwong
Copy link
Member

I guess this is relevant to #176, and might inform how we deal with multiple chromosomes.

@bhaller
Copy link

bhaller commented Dec 16, 2022

I guess this is relevant to #176, and might inform how we deal with multiple chromosomes.

As I understand it, we could choose to make the two things tied together – to make the separate tree sequences begin/end at chromosome boundaries – but we don't have to do that, and there are pretty strong reasons not to do that, perhaps. (For example, you want to be able to keep multiple tree sequences to speed up even single-chromosome models; and also, chromosomes will be of very different lengths, whereas you want to divide the work of simplification as evenly as you can across your multiple tree sequences; and also, the number of chromosomes you are simulating will often not match the number of threads you have available for simplification.) So I would tentatively lean towards not connecting these two things.

@hyanwong
Copy link
Member

hyanwong commented Dec 16, 2022

I would tentatively lean towards not connecting these two things.

I think it's not exclusive. If you want to represent multiple chromosomes, then this gives a nice way to do it. But there's not reason why you couldn't also do it on a single chromosome, or indeed on part of a chromosome within a multi-chromosome set.

But I do take you point that there's an elegancy about keeping the things completely independent.

@jeromekelleher
Copy link
Member Author

This is really about low-level C API stuff here for now - how we might (or not) surface this to higher level things like representing multiple chromosomes is something we should wait and see (which is I think what we're all saying!).

@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

Merging #2665 (e51c6c2) into main (ff06945) will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2665      +/-   ##
==========================================
- Coverage   89.76%   89.76%   -0.01%     
==========================================
  Files          30       29       -1     
  Lines       29892    29886       -6     
  Branches     5803     5803              
==========================================
- Hits        26832    26826       -6     
  Misses       1753     1753              
  Partials     1307     1307              
Flag Coverage Δ
c-tests 86.15% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.87% <ø> (-0.02%) ⬇️
python-tests 99.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff06945...e51c6c2. Read the comment docs.

@jeromekelleher
Copy link
Member Author

@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).

1 similar comment
@jeromekelleher
Copy link
Member Author

@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).

@jeromekelleher
Copy link
Member Author

Ah, probably need to set the sample flags for the final generation.

@jeromekelleher
Copy link
Member Author

One more update - I've marked the sample flags and updated the original WF example so that it also indexes the output, and I think the outputs are now identical up to node reordering:

tree 0:
  num_sites: 0
  interval:  (0.000000, 0.006008)
                                                     322                                                         321                                                                         >
                                    ┏━━━━━┳━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┓                                     ┏━━━┻━━━━┓                                                                     >
                  312               ┃     ┃                             ┃                                     ┃        ┃                                                                     >
          ┏━━━━━━━━┻━━━━━━━━┓       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
         299                ┃       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
    ┏━━━━━┻━━━━━┓           ┃       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                             ┃                                    278       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                             ┃                                   ┏━┻┓       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                            258                   264            ┃  ┃       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                     ┏━━━━━━━┻━━━━━━━┓     ┏━━━━━━━┻━━━━━━━┓     ┃  ┃       ┃                                                                     >
   238          ┃           ┃       ┃     ┃                     ┃               ┃     ┃               ┃     ┃  ┃       ┃                                                                     >
  ┏━┻━┓         ┃           ┃       ┃     ┃                     ┃               ┃     ┃               ┃     ┃  ┃       ┃                                                                     >
  ┃   ┃        228          ┃       ┃     ┃                    229              ┃     ┃               ┃     ┃  ┃       ┃                                               219                   >
  ┃   ┃     ┏━━━┻━━━┓       ┃       ┃     ┃           ┏━━━━━━━━━┻━┳━━━━━━━┓     ┃     ┃               ┃     ┃  ┃       ┃                                ┏━━━━━━━━━━━━━━━┻━┳━━━━━━━┳━━━━━┓    >
  ┃   ┃     ┃       ┃      195     199    ┃           ┃           ┃       ┃     ┃     ┃              184    ┃  ┃      192            188                ┃                 ┃       ┃     ┃    >
  ┃   ┃     ┃       ┃   ┏━━━╋━━━━┓ ┏┻━┓   ┃           ┃           ┃       ┃     ┃     ┃           ┏━━━┻━━━┓ ┃  ┃    ┏━━┻━━┓     ┏━━━━━┻━┳━━━┓           ┃                 ┃       ┃     ┃    >
 176  ┃     ┃       ┃   ┃  167   ┃ ┃  ┃  170          ┃          172     162    ┃     ┃          165      ┃ ┃  ┃    ┃    181    ┃       ┃   ┃          169                ┃       ┃    180   >
┏━┻┓  ┃     ┃       ┃   ┃  ┏┻━┓  ┃ ┃  ┃ ┏━┻━┓         ┃      ┏━━━━╋━━━┓  ┏┻━┓   ┃     ┃     ┏━━━━━╋━━━━┓  ┃ ┃  ┃    ┃   ┏━┻┓    ┃       ┃   ┃   ┏━━━━━┳━┻━━━━━━┓          ┃       ┃    ┏┻━┓  >
┃  ┃  ┃    102     138  ┃  ┃  ┃  ┃ ┃  ┃ ┃  121       146     ┃   101  ┃  ┃  ┃  124   129   105   116   ┃  ┃ ┃  ┃   107  ┃  ┃   134     144  ┃   ┃    120      126        119     123   ┃  ┃  >
┃  ┃  ┃  ┏━━╋━━┓  ┏━┻┓  ┃  ┃  ┃  ┃ ┃  ┃ ┃  ┏┻━┓  ┏━━┳━┻┳━━┓  ┃  ┏━┻┓  ┃  ┃  ┃  ┏┻━┓ ┏━╋━━┓ ┏┻━┓  ┏┻━┓  ┃  ┃ ┃  ┃  ┏━┻┓  ┃  ┃ ┏━━╋━━┓  ┏━┻┓  ┃   ┃  ┏━━╋━━┓  ┏━━╋━━┓  ┏━━┳━┻┳━━┓  ┏┻━┓  ┃  ┃  >
0 96 57 14 69 90 58 87 11 20 93 47 1 34 8 66 85 17 26 89 91 21 40 54 95 75 81 18 31 2 9 12 3 36 16 49 22 72 4 32 45 55 67 88 5 42 92 62 98 73 6 7 19 60 84 38 70 76 24 28 43 74 63 77 68 79 1>

tree 1:
  num_sites: 0
  interval:  (0.006008, 0.008218)
:

and the second

tree 0:
  num_sites: 0
  interval:  (0.000000, 0.006008)
                                                                            1                                                                                    0                           >
                                                   ┏━━━━━━━━┳━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━┓                                                      ┏━━━━━┻━━━━━┓                     >
                         18                        ┃        ┃                                       ┃                                                      ┃           ┃                     >
              ┏━━━━━━━━━━━┻━━━━━━━━━━━━┓           ┃        ┃                                       ┃                                                      ┃           ┃                     >
             41                        ┃           ┃        ┃                                       ┃                                                      ┃           ┃                     >
      ┏━━━━━━━┻━━━━━━━┓                ┃           ┃        ┃                                       ┃                                                      ┃           ┃                     >
      ┃               ┃                ┃           ┃        ┃                                       ┃                                                     53           ┃                     >
      ┃               ┃                ┃           ┃        ┃                                       ┃                                                    ┏━┻━┓         ┃                     >
      ┃               ┃                ┃           ┃        ┃                                      60                              66                    ┃   ┃         ┃                     >
      ┃               ┃                ┃           ┃        ┃                            ┏━━━━━━━━━━┻━━━━━━━━━━┓         ┏━━━━━━━━━━┻━━━━━━━━━━━┓        ┃   ┃         ┃                     >
     73               ┃                ┃           ┃        ┃                            ┃                     ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┏━━┻━━┓            ┃                ┃           ┃        ┃                            ┃                     ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃           108               ┃           ┃        ┃                           109                    ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃       ┏━━━━┻━━━━┓           ┃           ┃        ┃              ┏━━━━━━━━━━━━━┻━┳━━━━━━━━━━━┓       ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃       ┃         ┃          129         133       ┃              ┃               ┃           ┃       ┃         ┃                     118       ┃   ┃        126                   1>
   ┃     ┃       ┃         ┃     ┏━━━━━╋━━━━━┓   ┏━┻━┓      ┃              ┃               ┃           ┃       ┃         ┃                ┏━━━━━┻━━━━┓   ┃   ┃     ┏━━━┻━━━┓         ┏━━━━━━━>
  170    ┃       ┃         ┃     ┃    161    ┃   ┃   ┃     164             ┃              166         156      ┃         ┃               159         ┃   ┃   ┃     ┃      175        ┃       >
 ┏━┻━┓   ┃       ┃         ┃     ┃   ┏━┻━┓   ┃   ┃   ┃   ┏━━┻━━┓           ┃         ┏━━━━━╋━━━━━┓   ┏━┻━┓     ┃         ┃         ┏━━━━━━┻┳━━━━━┓   ┃   ┃   ┃     ┃     ┏━┻━┓       ┃       >
 ┃   ┃   ┃      180       216    ┃   ┃   ┃   ┃   ┃   ┃   ┃    199         224        ┃    179    ┃   ┃   ┃    202       207       183     194    ┃   ┃   ┃   ┃    185    ┃   ┃      212      >
 ┃   ┃   ┃   ┏━━━╋━━━┓   ┏━┻━┓   ┃   ┃   ┃   ┃   ┃   ┃   ┃   ┏━┻━┓   ┏━━━┳━┻━┳━━━┓   ┃   ┏━┻━┓   ┃   ┃   ┃   ┏━┻━┓   ┏━━━╋━━━┓   ┏━┻━┓   ┏━┻━┓   ┃   ┃   ┃   ┃   ┏━┻━┓   ┃   ┃   ┏━━━╋━━━┓   >
231 327 288 245 300 321 289 318 242 251 324 278 232 265 239 297 316 248 257 320 322 252 271 285 326 306 312 249 262 233 240 243 234 267 247 280 253 303 235 263 276 286 298 319 236 273 323 2

We don't reorder the samples from 0 - n in the new version. If you do want this you can run simplify on it again and it'll renumber the output. (You could do this more cheaply too if you wanted to for compatibility purposes.)

@jeromekelleher
Copy link
Member Author

I think we can meaningfully compare the timings of this with haploid_wright_fisher now @petrelharp and @bhaller

@petrelharp
Copy link
Contributor

I think I know what's going on with the dissappointing scaling here. Here's an example from running the code:

$ N=10000;  T=5000; S=1000
$ ./haploid_wright_fisher $N $T $S n10k.trees 42
Simplify at generation 4000: (10010000 nodes 20000000 edges) -> (84542 nodes 439275 edges)
Simplify at generation 3000: (10084542 nodes 20439275 edges) -> (92436 nodes 494357 edges)
Simplify at generation 2000: (10092436 nodes 20494357 edges) -> (97097 nodes 527279 edges)
Simplify at generation 1000: (10097097 nodes 20527279 edges) -> (100073 nodes 547226 edges)
Simplify at generation 0: (10100073 nodes 20547226 edges) -> (102133 nodes 560584 edges)

$ ./multichrom_wright_fisher $N $T $S n10k-multichrom.trees 42
Simplify 10010000 nodes
	chunk 1: 11250775 -> 70861
	chunk 5: 11249822 -> 68845
	chunk 0: 11250882 -> 69585
	chunk 6: 11248979 -> 71402
	chunk 3: 11250703 -> 71395
	chunk 7: 11247502 -> 71564
	chunk 4: 11251640 -> 71959
	chunk 2: 11249697 -> 71370

The numbers for the multichromosome example are numbers of edges before and after simplification. Notice that each worker has about one-half the total number of edges: there were 20M edges for the whole simulation (2 edges * 10K individuals * 1K generations); and each of the 8 workers have 10M edges to deal with.

This is because: suppose that a chromosome has been inherited in k pieces, i.e., there are k edges that partition that chromosome. If we divide the chromosome among n workers, then we will get n + k - 1 edges, since we're breaking the chromosome at (a) n places, and also at (b) k places. Unless we get lucky and some of the breakpoints fall exactly on a division between adjacent chunks. (In this simulation, everyone is divided into exactly k=2 pieces.)

Now, what does simplify have to do under the hood? It's got a data structure that records the collection of ancestral segments that it's tracking; and then it goes through edges one at a time and deals with them. This means that if the cost of dealing with each edge is O(1), then our maximal speedup would be (n+k-1)/(nk) ~ 1/k. (This is because we've increased the number of edges by a factor of (n+k-1)/k but then divided that among n workers.)

However, the work required by simplify's data structure does scale with the number of segments it is tracking, to some degree, since although simplify has to look at all those edges, it only has to actually do something when it encounters an edge that some of its segments move along. If this was a substantial contribution to runtime then we might expect better speedups. The things we'd need to know are (a) what portion of simplify's work is "do something to the segments" versus "look at the next edge"; and (b) once we divide up a chromosome into chunks, how much does that reduce the proportion of its edges through which a segment of ancestry passes? I'm not very clear on the point, but I suspect the answer to (b) is "not much"; if so, then there's no hope for a speedup.

However, this approach would work just fine if each worker got literally one chromosome, since then edge breakpoints would usually (well, half the time) fall exactly on the divisions between workers. I may do a modification to the code here to have a look at that case, and see if we have wins there.

ping @jeromekelleher @molpopgen

@petrelharp
Copy link
Contributor

I did some very simple benchmarking - comparing runtimes for k chromosomes, done (a) in parallel, and (b) not. (To do this I added a new file, multichrom_wright_fisher_singlethreaded.c, which if we merge this example in probably don't want to keep.)
Here's the script:

#!/bin/bash

N=1000
T=5000
S=200

for k in $(seq 12)
do
    /usr/bin/time ./multichrom_wright_fisher_singlethreaded $N $T $S n10k.trees 42 $k &> single${k}.log
    /usr/bin/time ./multichrom_wright_fisher $N $T $S n10k.trees 42 $k &> multi${k}.log
done

echo "type num_chroms user system elapsed percent_cpu N T S" > results.txt
for t in single multi
do
    (for x in ${t}*.log; do echo ${x%.log} $(tail -n 2 $x | head -n 1 | cut -f 1-4 -d ' ') $N $T $S; done | sed -e 's/user//' | sed -e 's/chroms//' | sed -e 's/system//' | sed -e 's/elapsed//' | sed -e 's/.CPU//' | sort -n ) | tr ' ' '\t' >> results.txt
done

With that, I get this: black points are singlethreaded, red are multithreaded. Note that the simulations are still doing a fair bit of non-parallel work that is included in these timings:
Screenshot from 2023-09-11 22-27-31

This is looking very good! The max CPU utilization percent reported is 675%. Here's the rest of the results:

     type num_chroms  user system elapsed percent_cpu
1   multi          1  4.22   0.48    4.70          99
2   multi          2  8.50   1.02    5.36         177
3   multi          3 13.21   1.47    5.92         247
4   multi          4 18.17   2.01    6.52         309
5   multi          5 24.05   2.86    8.27         325
6   multi          6 30.71   3.70    9.02         381
7   multi          7 38.30   4.45    9.79         436
8   multi          8 46.23   5.26   10.56         487
9   multi          9 54.43   6.11   11.41         530
10  multi         10 63.40   7.06   12.16         579
11  multi         11 73.02   8.35   12.95         628
12  multi         12 83.72   9.45   13.79         675
13 single          1  3.97   0.05    4.02         100
14 single          2  7.91   0.16    8.08         100
15 single          3 12.31   0.61   12.93          99
16 single          4 16.69   0.82   17.52          99
17 single          5 21.33   1.14   22.47          99
18 single          6 25.93   1.39   27.33          99
19 single          7 30.78   1.79   32.57          99
20 single          8 35.85   2.18   38.04          99
21 single          9 40.84   1.96   42.81          99
22 single         10 45.17   2.69   47.87          99
23 single         11 50.77   3.11   53.89          99
24 single         12 55.90   3.41   59.32          99

@jeromekelleher
Copy link
Member Author

Aha! That's great @petrelharp, it really clarifies things 👍

@petrelharp
Copy link
Contributor

I'm thinking it's worth fixing up the multichrom example to merge in? @bhaller is enthusiastic about getting this into SLiM at some point. I think this just will require switching over delete_rows to keep_rows, and removing the extra file I put in?

@jeromekelleher
Copy link
Member Author

OK, I'll fix it up when I get a chance.

@jeromekelleher
Copy link
Member Author

I've updated this so that it's a clean build against upstream now @petrelharp. I think it needs a bit of documentation though - I don't really follow what the algorithm is doing. Could you update with some comments to help clarify that the model is, etc?

@jeromekelleher
Copy link
Member Author

I think the docs build is because of a stale cache. The simplest thing is probably to create a new PR. I've squashed my original commits down to one.

@molpopgen
Copy link
Member

I'm away on vacation for a couple of weeks, but:

  1. The speedup should be 75-80% at most if one doesn't do one "chromosome" per thread. That's what I showed w/my prototype when I presented this method.
  2. @petrelharp -- if I understand your figure, 12 chromosomes means 12 threads and a 6x speedup? Is that correct? If you run perf, you should be able to quantify the % of time that is purely single-threaded, which will give a better sense of how efficient the threading is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants