[BBT#559] Add llp scheduler: local lifo with priorities by devreal · Pull Request #325 · ICLDisco/parsec

devreal · 2022-03-02T14:45:11Z

LLP requires an extension to the LIFO to be able to insert tasks ordered by priorities and with distances. The sorted LIFO insertion works by detaching all elements, merging in the new elements, and reattaching it.

The LLP scheduler builds on top of PaRSEC's LIFO and uses its data structures to implement a sorted stack.

Compared to LFQ, LLP does not require threads to walk over an array of pointers (4*8*64 = 2k on 64 threads) and does not have a global lock in a system queue. It’s all distributed LIFOs.

Compared to LL, LLP supports priorities and distances, which is needed for some frontends and generally enables critical path execution.

Compared to AP, LLP has no global queue so no global absolute priorities. Priorities are local only. Stealing does not follow the highest priorities of the local queues as that would require another global array.

This is still WIP, needs some more testing and evaluation at scale. So far, it performs well under synthetic benchmarks.

Original PR on Bitbucket: https://bitbucket.org/icldistcomp/parsec/pull-requests/559

Signed-off-by: Joseph Schuchart schuchart@icl.utk.edu

devreal · 2022-03-02T14:47:41Z

The implementation of parsec_lifo_chain_sorted should be moved out of the LIFO into the scheduler.

devreal · 2022-06-20T20:46:37Z

I squashed and moved this out of draft. @bosilca @therault please review if you have a minute

bosilca

It is lacking the documentation that describe the scheduling strategy.

devreal · 2022-06-21T17:43:37Z

I added a comment in sched_lfq.h outlining the scheduling strategy.

omor1 · 2022-07-08T21:34:01Z

Does stealing try to steal from threads pinned to hierarchically nearby cores, as it does in e.g. LFQ? It seems to be in order of execution stream id, which I think is not stable between executions and depends on the order in which threads were started.

devreal · 2022-07-09T01:02:54Z

I haven't tried copying over the hierarchical handling from LFQ, shouldn't be too hard though

omor1 · 2022-07-11T17:18:51Z

Misc thought relevant to this scheduler in particular: since all tasks with remote data will go to execution stream 0, if that is bound far from the communication thread, it would increase inter-NUMA traffic. That's a bit less relevant for some of the other schedulers, since they either have a single shared queue anyway or overflow into a global queue—under high pressure I expect many of these tasks to end up in the global queue anyway. With LLP there is no global queue, so all these tasks are pushed far from the communication thread.

It could be useful to, instead of always pushing to 0, to push to an execution stream close to the comm thread. This could be detected automatically via hwloc or put under user control via an mca var—though I'd prefer the mca var to specify the hardware core to prefer, not the execution stream, since those aren't necessarily the same.

devreal · 2022-07-11T17:28:33Z

Misc thought relevant to this scheduler in particular: since all tasks with remote data will go to execution stream 0, if that is bound far from the communication thread, it would increase inter-NUMA traffic.

Sure, but that's only half the truth: if the tasks were to be pushed to a separate queue far from all the worker thread you'd get the same amount of NUMA traffic, just in the other direction. I'm not sure which way is better...

It could be useful to, instead of always pushing to 0, to push to an execution stream close to the comm thread. This could be detected automatically via hwloc or put under user control via an mca var—though I'd prefer the mca var to specify the hardware core to prefer, not the execution stream, since those aren't necessarily the same.

That is a broader change since the comm thread assumes the personality of thread 0 by copying it's execution context during initialization. In LLP, this means we need to protect the queue of thread 0 more than the others because two threads can push into that same queue. There is nothing we can do in the scheduler to differentiate between thread 0 and the comm thread. I would love to get rid of that behavior but that needs some more work.

omor1 · 2022-07-11T18:01:15Z

That is a broader change since the comm thread assumes the personality of thread 0 by copying it's execution context during initialization. In LLP, this means we need to protect the queue of thread 0 more than the others because two threads can push into that same queue. There is nothing we can do in the scheduler to differentiate between thread 0 and the comm thread. I would love to get rid of that behavior but that needs some more work.

Right—I'm not suggesting to get rid of that behavior per se, but let the comm thread assume the personality of a thread other than 0. This should amount to changing a hardcoded 0 to e.g. an index determined by an mca var.

omor1 · 2022-07-12T16:54:13Z

Hmmm so I tried pulling this in to my branch and then fixing it up so it'll work—I don't have some of the recent changes to lifo e.g. it still uses lifo_ghost—but I seem to have a mistake somewhere, since a run with LCI never completed.

Though it's not completely broken, since it succeeded in generating the matrix, so I suspect that I either have a mistake somewhere in the code with single_writer == false or that LCI communication being faster overloads the thread 0 lifo.

LLP requires an extension to the LIFO to be able to insert tasks ordered by priorities. The sorted LIFO insertion works by detaching all elements, merging in the new elements, and reattaching it. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

devreal · 2022-08-08T15:51:42Z

@omor1 Indeed, there was a problem in the slow code path (single_writer == false). I will test some more but so far it seems stable...

omor1 · 2022-08-31T23:00:42Z

@omor1 Indeed, there was a problem in the slow code path (single_writer == false). I will test some more but so far it seems stable...

I pulled in b8dc487 to my branch and HiCMA over PaRSEC/LCI ran correctly to completion.

It also seems to have (somehow?!) fixed an issue where, despite LCI being much faster than MPI, the reported "total critical path time" was longer—I suspect some tasks might have been getting stuck in the global queue if LCI generally has more tasks available, which shouldn't occur with LLP. I haven't tested LLP multiple times yet or with MPI, but with LFQ I saw high performance variability under LCI while very little under MPI—my hypothesis is, similar to the above, LCI is faster so sometimes important tasks get stuck in the global queue. LCI was still much faster than MPI regardless, but could itself vary as much as 10% in performance (anywhere between 130–144s). If this helps stabilize that, that's good :).

devreal added the enhancement New feature or request label Mar 2, 2022

devreal requested a review from bosilca as a code owner March 2, 2022 14:45

devreal marked this pull request as draft March 2, 2022 14:45

devreal force-pushed the scheduler_llp branch 2 times, most recently from d9df9ae to ccded67 Compare April 18, 2022 16:30

devreal marked this pull request as ready for review June 20, 2022 20:06

devreal requested a review from a team as a code owner June 20, 2022 20:06

devreal force-pushed the scheduler_llp branch 3 times, most recently from a719bdd to 4f9accb Compare June 20, 2022 20:31

bosilca reviewed Jun 21, 2022

View reviewed changes

Comment thread parsec/mca/sched/llp/sched_llp_module.c Outdated

Comment thread parsec/mca/sched/llp/sched_llp_module.c Outdated

devreal force-pushed the scheduler_llp branch from 4f9accb to 1b2214d Compare June 21, 2022 14:49

bosilca approved these changes Jun 21, 2022

View reviewed changes

devreal requested review from abouteiller and therault June 23, 2022 22:49

devreal added 2 commits August 8, 2022 10:47

LLP: fix potential segfault in insertion slow path

b8dc487

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>

devreal force-pushed the scheduler_llp branch from 1b2214d to b8dc487 Compare August 8, 2022 15:44

bosilca added this to the v4.0 milestone Oct 14, 2022

bosilca merged commit 0902724 into ICLDisco:master Oct 31, 2022

Conversation

devreal commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal commented Mar 2, 2022

Uh oh!

devreal commented Jun 20, 2022

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

devreal commented Jun 21, 2022

Uh oh!

omor1 commented Jul 8, 2022

Uh oh!

devreal commented Jul 9, 2022

Uh oh!

omor1 commented Jul 11, 2022

Uh oh!

devreal commented Jul 11, 2022

Uh oh!

omor1 commented Jul 11, 2022

Uh oh!

omor1 commented Jul 12, 2022

Uh oh!

devreal commented Aug 8, 2022

Uh oh!

omor1 commented Aug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devreal commented Mar 2, 2022 •

edited

Loading