Load balancing for tasks with long tailed distributions #1343

yadudoc · 2019-10-09T22:18:47Z

Is your feature request related to a problem? Please describe.
The HTEX executor and our current strategy try to evenly distribute tasks to all online managers which can result in underutilization at the tail end of a large run. Since the tasks are not packed to the least filled block and then to the least filled manager, we end up in a situation where we cannot relinquish blocks that are severely underfilled.

Related to #172

Describe the solution you'd like
Currently we use a randomized scheme that works great for short duration tasks, but poorly for the situation described above. We'd need a spill-over algorithm which attempts to fill a block and each manager first before moving to the next. Once this feature is added, we can add a new strategy that shuts down any empty blocks rather than wait for the entire executor to be idle before starting scale-down events. We could leave it to the user to select the manager-task mapping algorithm via the Config.

Describe alternatives you've considered
Once blocks drop below a utilization threshold, you could terminate blocks and reschedule tasks.
This crude method I guess probably would give better utilization, at the cost of some wasted compute.

Additional context
Requested by @dgasmith during Parslfest

dgasmith · 2019-10-10T12:38:46Z

We would be happy to specify this as an optional distribution strategy using our use case assumption that tasks are minutes to hours. I would be a bit hesitant to simply drop tasks and reschedule them without an explicit option however.

@mattwelborn

danielskatz · 2019-10-10T12:42:19Z

I feel compelled to point to https://doi.org/10.1109/MTAGS.2010.5699433 for some previous thinking on this problem

dgasmith · 2019-10-10T18:22:05Z

Cool!

The ability to downsize existing allocations (by releasing some fraction of the current processors without releasing all of them) would address problem 1, as no new allocation would need to be started. This would also partially address problem 2, as not all tasks would need to be canceled. This would require support from the supercomputer’s scheduler to implement.

You solved your own problem without changing supercomputer schedulers for non-leadership platforms!

yadudoc added enhancement parslfest2019 labels Oct 9, 2019

yadudoc self-assigned this Oct 9, 2019

danielskatz changed the title ~~Load balancing for task's with long tailed distributions~~ Load balancing for tasks with long tailed distributions Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancing for tasks with long tailed distributions #1343

Load balancing for tasks with long tailed distributions #1343

yadudoc commented Oct 9, 2019

dgasmith commented Oct 10, 2019 •

edited

danielskatz commented Oct 10, 2019

dgasmith commented Oct 10, 2019 •

edited

Load balancing for tasks with long tailed distributions #1343

Load balancing for tasks with long tailed distributions #1343

Comments

yadudoc commented Oct 9, 2019

dgasmith commented Oct 10, 2019 • edited

danielskatz commented Oct 10, 2019

dgasmith commented Oct 10, 2019 • edited

dgasmith commented Oct 10, 2019 •

edited

dgasmith commented Oct 10, 2019 •

edited