[BBT#559] Add llp scheduler: local lifo with priorities#325
[BBT#559] Add llp scheduler: local lifo with priorities#325bosilca merged 2 commits intoICLDisco:masterfrom
Conversation
|
The implementation of |
d9df9ae to
ccded67
Compare
a719bdd to
4f9accb
Compare
bosilca
left a comment
There was a problem hiding this comment.
It is lacking the documentation that describe the scheduling strategy.
|
I added a comment in |
|
Does stealing try to steal from threads pinned to hierarchically nearby cores, as it does in e.g. LFQ? It seems to be in order of execution stream id, which I think is not stable between executions and depends on the order in which threads were started. |
|
I haven't tried copying over the hierarchical handling from LFQ, shouldn't be too hard though |
|
Misc thought relevant to this scheduler in particular: since all tasks with remote data will go to execution stream 0, if that is bound far from the communication thread, it would increase inter-NUMA traffic. That's a bit less relevant for some of the other schedulers, since they either have a single shared queue anyway or overflow into a global queue—under high pressure I expect many of these tasks to end up in the global queue anyway. With LLP there is no global queue, so all these tasks are pushed far from the communication thread. It could be useful to, instead of always pushing to 0, to push to an execution stream close to the comm thread. This could be detected automatically via hwloc or put under user control via an mca var—though I'd prefer the mca var to specify the hardware core to prefer, not the execution stream, since those aren't necessarily the same. |
Sure, but that's only half the truth: if the tasks were to be pushed to a separate queue far from all the worker thread you'd get the same amount of NUMA traffic, just in the other direction. I'm not sure which way is better...
That is a broader change since the comm thread assumes the personality of thread 0 by copying it's execution context during initialization. In LLP, this means we need to protect the queue of thread 0 more than the others because two threads can push into that same queue. There is nothing we can do in the scheduler to differentiate between thread 0 and the comm thread. I would love to get rid of that behavior but that needs some more work. |
Right—I'm not suggesting to get rid of that behavior per se, but let the comm thread assume the personality of a thread other than 0. This should amount to changing a hardcoded |
|
Hmmm so I tried pulling this in to my branch and then fixing it up so it'll work—I don't have some of the recent changes to lifo e.g. it still uses Though it's not completely broken, since it succeeded in generating the matrix, so I suspect that I either have a mistake somewhere in the code with |
LLP requires an extension to the LIFO to be able to insert tasks ordered by priorities. The sorted LIFO insertion works by detaching all elements, merging in the new elements, and reattaching it. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
|
@omor1 Indeed, there was a problem in the slow code path ( |
I pulled in b8dc487 to my branch and HiCMA over PaRSEC/LCI ran correctly to completion. It also seems to have (somehow?!) fixed an issue where, despite LCI being much faster than MPI, the reported "total critical path time" was longer—I suspect some tasks might have been getting stuck in the global queue if LCI generally has more tasks available, which shouldn't occur with LLP. I haven't tested LLP multiple times yet or with MPI, but with LFQ I saw high performance variability under LCI while very little under MPI—my hypothesis is, similar to the above, LCI is faster so sometimes important tasks get stuck in the global queue. LCI was still much faster than MPI regardless, but could itself vary as much as 10% in performance (anywhere between 130–144s). If this helps stabilize that, that's good :). |
LLP requires an extension to the LIFO to be able to insert tasks ordered by priorities and with distances. The sorted LIFO insertion works by detaching all elements, merging in the new elements, and reattaching it.The LLP scheduler builds on top of PaRSEC's LIFO and uses its data structures to implement a sorted stack.
Compared to LFQ, LLP does not require threads to walk over an array of pointers (4*8*64 = 2k on 64 threads) and does not have a global lock in a system queue. It’s all distributed LIFOs.
Compared to LL, LLP supports priorities and distances, which is needed for some frontends and generally enables critical path execution.
Compared to AP, LLP has no global queue so no global absolute priorities. Priorities are local only. Stealing does not follow the highest priorities of the local queues as that would require another global array.
This is still WIP, needs some more testing and evaluation at scale. So far, it performs well under synthetic benchmarks.
Original PR on Bitbucket: https://bitbucket.org/icldistcomp/parsec/pull-requests/559
Signed-off-by: Joseph Schuchart schuchart@icl.utk.edu