Skip to content

[BBT#559] Add llp scheduler: local lifo with priorities#325

Merged
bosilca merged 2 commits intoICLDisco:masterfrom
devreal:scheduler_llp
Oct 31, 2022
Merged

[BBT#559] Add llp scheduler: local lifo with priorities#325
bosilca merged 2 commits intoICLDisco:masterfrom
devreal:scheduler_llp

Conversation

@devreal
Copy link
Copy Markdown
Contributor

@devreal devreal commented Mar 2, 2022

LLP requires an extension to the LIFO to be able to insert tasks ordered by priorities and with distances. The sorted LIFO insertion works by detaching all elements, merging in the new elements, and reattaching it.

The LLP scheduler builds on top of PaRSEC's LIFO and uses its data structures to implement a sorted stack.

Compared to LFQ, LLP does not require threads to walk over an array of pointers (4*8*64 = 2k on 64 threads) and does not have a global lock in a system queue. It’s all distributed LIFOs.

Compared to LL, LLP supports priorities and distances, which is needed for some frontends and generally enables critical path execution.

Compared to AP, LLP has no global queue so no global absolute priorities. Priorities are local only. Stealing does not follow the highest priorities of the local queues as that would require another global array.

This is still WIP, needs some more testing and evaluation at scale. So far, it performs well under synthetic benchmarks.

Original PR on Bitbucket: https://bitbucket.org/icldistcomp/parsec/pull-requests/559

Signed-off-by: Joseph Schuchart schuchart@icl.utk.edu

@devreal devreal added the enhancement New feature or request label Mar 2, 2022
@devreal devreal requested a review from bosilca as a code owner March 2, 2022 14:45
@devreal devreal marked this pull request as draft March 2, 2022 14:45
@devreal
Copy link
Copy Markdown
Contributor Author

devreal commented Mar 2, 2022

The implementation of parsec_lifo_chain_sorted should be moved out of the LIFO into the scheduler.

@devreal devreal force-pushed the scheduler_llp branch 2 times, most recently from d9df9ae to ccded67 Compare April 18, 2022 16:30
@devreal devreal marked this pull request as ready for review June 20, 2022 20:06
@devreal devreal requested a review from a team as a code owner June 20, 2022 20:06
@devreal devreal force-pushed the scheduler_llp branch 3 times, most recently from a719bdd to 4f9accb Compare June 20, 2022 20:31
@devreal
Copy link
Copy Markdown
Contributor Author

devreal commented Jun 20, 2022

I squashed and moved this out of draft. @bosilca @therault please review if you have a minute

Copy link
Copy Markdown
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is lacking the documentation that describe the scheduling strategy.

Comment thread parsec/mca/sched/llp/sched_llp_module.c Outdated
Comment thread parsec/mca/sched/llp/sched_llp_module.c Outdated
@devreal
Copy link
Copy Markdown
Contributor Author

devreal commented Jun 21, 2022

I added a comment in sched_lfq.h outlining the scheduling strategy.

@devreal devreal requested review from abouteiller and therault June 23, 2022 22:49
@omor1
Copy link
Copy Markdown
Contributor

omor1 commented Jul 8, 2022

Does stealing try to steal from threads pinned to hierarchically nearby cores, as it does in e.g. LFQ? It seems to be in order of execution stream id, which I think is not stable between executions and depends on the order in which threads were started.

@devreal
Copy link
Copy Markdown
Contributor Author

devreal commented Jul 9, 2022

I haven't tried copying over the hierarchical handling from LFQ, shouldn't be too hard though

@omor1
Copy link
Copy Markdown
Contributor

omor1 commented Jul 11, 2022

Misc thought relevant to this scheduler in particular: since all tasks with remote data will go to execution stream 0, if that is bound far from the communication thread, it would increase inter-NUMA traffic. That's a bit less relevant for some of the other schedulers, since they either have a single shared queue anyway or overflow into a global queue—under high pressure I expect many of these tasks to end up in the global queue anyway. With LLP there is no global queue, so all these tasks are pushed far from the communication thread.

It could be useful to, instead of always pushing to 0, to push to an execution stream close to the comm thread. This could be detected automatically via hwloc or put under user control via an mca var—though I'd prefer the mca var to specify the hardware core to prefer, not the execution stream, since those aren't necessarily the same.

@devreal
Copy link
Copy Markdown
Contributor Author

devreal commented Jul 11, 2022

Misc thought relevant to this scheduler in particular: since all tasks with remote data will go to execution stream 0, if that is bound far from the communication thread, it would increase inter-NUMA traffic.

Sure, but that's only half the truth: if the tasks were to be pushed to a separate queue far from all the worker thread you'd get the same amount of NUMA traffic, just in the other direction. I'm not sure which way is better...

It could be useful to, instead of always pushing to 0, to push to an execution stream close to the comm thread. This could be detected automatically via hwloc or put under user control via an mca var—though I'd prefer the mca var to specify the hardware core to prefer, not the execution stream, since those aren't necessarily the same.

That is a broader change since the comm thread assumes the personality of thread 0 by copying it's execution context during initialization. In LLP, this means we need to protect the queue of thread 0 more than the others because two threads can push into that same queue. There is nothing we can do in the scheduler to differentiate between thread 0 and the comm thread. I would love to get rid of that behavior but that needs some more work.

@omor1
Copy link
Copy Markdown
Contributor

omor1 commented Jul 11, 2022

That is a broader change since the comm thread assumes the personality of thread 0 by copying it's execution context during initialization. In LLP, this means we need to protect the queue of thread 0 more than the others because two threads can push into that same queue. There is nothing we can do in the scheduler to differentiate between thread 0 and the comm thread. I would love to get rid of that behavior but that needs some more work.

Right—I'm not suggesting to get rid of that behavior per se, but let the comm thread assume the personality of a thread other than 0. This should amount to changing a hardcoded 0 to e.g. an index determined by an mca var.

@omor1
Copy link
Copy Markdown
Contributor

omor1 commented Jul 12, 2022

Hmmm so I tried pulling this in to my branch and then fixing it up so it'll work—I don't have some of the recent changes to lifo e.g. it still uses lifo_ghost—but I seem to have a mistake somewhere, since a run with LCI never completed.

Though it's not completely broken, since it succeeded in generating the matrix, so I suspect that I either have a mistake somewhere in the code with single_writer == false or that LCI communication being faster overloads the thread 0 lifo.

devreal added 2 commits August 8, 2022 10:47
LLP requires an extension to the LIFO to be able to insert tasks
ordered by priorities. The sorted LIFO insertion works by
detaching all elements, merging in the new elements, and reattaching it.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@devreal
Copy link
Copy Markdown
Contributor Author

devreal commented Aug 8, 2022

@omor1 Indeed, there was a problem in the slow code path (single_writer == false). I will test some more but so far it seems stable...

@omor1
Copy link
Copy Markdown
Contributor

omor1 commented Aug 31, 2022

@omor1 Indeed, there was a problem in the slow code path (single_writer == false). I will test some more but so far it seems stable...

I pulled in b8dc487 to my branch and HiCMA over PaRSEC/LCI ran correctly to completion.

It also seems to have (somehow?!) fixed an issue where, despite LCI being much faster than MPI, the reported "total critical path time" was longer—I suspect some tasks might have been getting stuck in the global queue if LCI generally has more tasks available, which shouldn't occur with LLP. I haven't tested LLP multiple times yet or with MPI, but with LFQ I saw high performance variability under LCI while very little under MPI—my hypothesis is, similar to the above, LCI is faster so sometimes important tasks get stuck in the global queue. LCI was still much faster than MPI regardless, but could itself vary as much as 10% in performance (anywhere between 130–144s). If this helps stabilize that, that's good :).

@bosilca bosilca added this to the v4.0 milestone Oct 14, 2022
@bosilca bosilca merged commit 0902724 into ICLDisco:master Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants