Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancing for tasks with long tailed distributions #1343

Open
yadudoc opened this issue Oct 9, 2019 · 3 comments
Open

Load balancing for tasks with long tailed distributions #1343

yadudoc opened this issue Oct 9, 2019 · 3 comments

Comments

@yadudoc
Copy link
Member

yadudoc commented Oct 9, 2019

Is your feature request related to a problem? Please describe.
The HTEX executor and our current strategy try to evenly distribute tasks to all online managers which can result in underutilization at the tail end of a large run. Since the tasks are not packed to the least filled block and then to the least filled manager, we end up in a situation where we cannot relinquish blocks that are severely underfilled.

Related to #172

Describe the solution you'd like
Currently we use a randomized scheme that works great for short duration tasks, but poorly for the situation described above. We'd need a spill-over algorithm which attempts to fill a block and each manager first before moving to the next. Once this feature is added, we can add a new strategy that shuts down any empty blocks rather than wait for the entire executor to be idle before starting scale-down events. We could leave it to the user to select the manager-task mapping algorithm via the Config.

Describe alternatives you've considered
Once blocks drop below a utilization threshold, you could terminate blocks and reschedule tasks.
This crude method I guess probably would give better utilization, at the cost of some wasted compute.

Additional context
Requested by @dgasmith during Parslfest

@dgasmith
Copy link
Contributor

dgasmith commented Oct 10, 2019

We would be happy to specify this as an optional distribution strategy using our use case assumption that tasks are minutes to hours. I would be a bit hesitant to simply drop tasks and reschedule them without an explicit option however.

@mattwelborn

@danielskatz danielskatz changed the title Load balancing for task's with long tailed distributions Load balancing for tasks with long tailed distributions Oct 10, 2019
@danielskatz
Copy link
Member

I feel compelled to point to https://doi.org/10.1109/MTAGS.2010.5699433 for some previous thinking on this problem

@dgasmith
Copy link
Contributor

dgasmith commented Oct 10, 2019

Cool!

The ability to downsize existing allocations (by releasing some fraction of the current processors without releasing all of them) would address problem 1, as no new allocation would need to be started. This would also partially address problem 2, as not all tasks would need to be canceled. This would require support from the supercomputer’s scheduler to implement.

You solved your own problem without changing supercomputer schedulers for non-leadership platforms!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants