You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The HTEX executor and our current strategy try to evenly distribute tasks to all online managers which can result in underutilization at the tail end of a large run. Since the tasks are not packed to the least filled block and then to the least filled manager, we end up in a situation where we cannot relinquish blocks that are severely underfilled.
Describe the solution you'd like
Currently we use a randomized scheme that works great for short duration tasks, but poorly for the situation described above. We'd need a spill-over algorithm which attempts to fill a block and each manager first before moving to the next. Once this feature is added, we can add a new strategy that shuts down any empty blocks rather than wait for the entire executor to be idle before starting scale-down events. We could leave it to the user to select the manager-task mapping algorithm via the Config.
Describe alternatives you've considered
Once blocks drop below a utilization threshold, you could terminate blocks and reschedule tasks.
This crude method I guess probably would give better utilization, at the cost of some wasted compute.
Additional context
Requested by @dgasmith during Parslfest
The text was updated successfully, but these errors were encountered:
We would be happy to specify this as an optional distribution strategy using our use case assumption that tasks are minutes to hours. I would be a bit hesitant to simply drop tasks and reschedule them without an explicit option however.
danielskatz
changed the title
Load balancing for task's with long tailed distributions
Load balancing for tasks with long tailed distributions
Oct 10, 2019
The ability to downsize existing allocations (by releasing some fraction of the current processors without releasing all of them) would address problem 1, as no new allocation would need to be started. This would also partially address problem 2, as not all tasks would need to be canceled. This would require support from the supercomputer’s scheduler to implement.
You solved your own problem without changing supercomputer schedulers for non-leadership platforms!
Is your feature request related to a problem? Please describe.
The HTEX executor and our current strategy try to evenly distribute tasks to all online managers which can result in underutilization at the tail end of a large run. Since the tasks are not packed to the least filled block and then to the least filled manager, we end up in a situation where we cannot relinquish blocks that are severely underfilled.
Related to #172
Describe the solution you'd like
Currently we use a randomized scheme that works great for short duration tasks, but poorly for the situation described above. We'd need a spill-over algorithm which attempts to fill a block and each manager first before moving to the next. Once this feature is added, we can add a new strategy that shuts down any empty blocks rather than wait for the entire executor to be idle before starting scale-down events. We could leave it to the user to select the manager-task mapping algorithm via the Config.
Describe alternatives you've considered
Once blocks drop below a utilization threshold, you could terminate blocks and reschedule tasks.
This crude method I guess probably would give better utilization, at the cost of some wasted compute.
Additional context
Requested by @dgasmith during Parslfest
The text was updated successfully, but these errors were encountered: