Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low capacity jobs blocking high capacity jobs with higher prio #540

Open
ultra-sonic opened this issue Apr 20, 2022 · 6 comments
Open

Low capacity jobs blocking high capacity jobs with higher prio #540

ultra-sonic opened this issue Apr 20, 2022 · 6 comments

Comments

@ultra-sonic
Copy link

Hi Timur,

today I am reporting an issue that is bugging us for a while now but is becoming increasingly important right now.

The scenerio is simple:
Rendernodes all have a total capacity of 1100
Job 1 has prio 50 and 1000 tasks that need a capacity of 500 each.
Job 1 is already started and tasks finish asynchronously leaving just 600 capacity at all times preventing other higher capacity tasks to start.
Job 2 has prio 200 and 1 task with that needs 1000 capacity but it can never start bc there is never enough capacity left.

I think this is something that you know about and I can imagine that you already have a solution to this, do you?

Cheers
Oli
@sebastianelsner

@timurhai
Copy link
Member

Hi Oli, (sorry for a delay)

Unfortunately, there is no general solution for such situations.
If your renders has 1100 capacity, 500c tasks will never allow 1000c tasks to start.And for now I do not see some simple solution when 500c should "go on pause".
But we are using low capacity tasks at work.Our common render capacity is 1500. Task common capacity is 1000.Tasks that have less than 500 capacity should be very light-weight, such tasks will not take an entire farm, or they can take, but for a small period of time.
Sometimes a user has a "very heavy" tasks, that can't run in parallel even with a light tasks.In this case the user can set the capacity to 1500 to take all render capacity.

@ultra-sonic
Copy link
Author

ultra-sonic commented Mar 2, 2023

hello again...this issue is becoming increasingly important at RISE at the moment bc we are about to run more than 1 task per host by default soon. the scenario will look like this:
host capacity is equal to number of cores on the host. our renderfarm consists of a wild mix of 8,12,32,40,64,128 and 256 core machines - roughly 800 nodes in total.

i have the following renderjobs in the farm:
easy - capacity 8
medium - capacity 64
heavy - capacity 256

as described earlier the problem is that if the heavy job is submitted after the easy and medium jobs it will not start bc the 256 core rendernodes will be busy working on lets say 4 tasks of the medium job. since not all 4 tasks will finish at the same time there will never be enough capacity until all easy and medium jobs are finished.

My temp fix for this would be to limit the max. tasks on all 256 core nodes (that match the jobs hostmask) as long as there are heavy jobs with status RDY. this dynamic limiting could be done via a cron job that runs every minute.

I can imageine that the above temp fix could be intergrated into afserver much more elegant but I realize that this takes some time and maybe you can come up with a much smarter solution for this issue. can you? 😉

One thing that afserver can do which is not that easy to re-implement in a cron job is limiting the max.tasks only on a specific number of hosts based on the "need" of the heavy job, bc I do want medium jobs with higher prio to be scheduled on the 256core nodes if their priority is a lot higher. if we dont take the prio into account then low prio heavy jobs would take a away resources from high prio medium jobs. does that make sense?

cheers
Oli

@ultra-sonic
Copy link
Author

Hi Timur,
sorry to bother you again...could you think of a way to implement this?
cheers
Oli

@timurhai
Copy link
Member

Hi Oliver!
Sorry, I did not wrote any answer.
But I smoking this!

@ultra-sonic
Copy link
Author

Hi Timur,
by "smoking this" you mean you are thinking of a solution or is this impossible to implement?
We already have a name for it: "The capacity dilemma" 😉

@timurhai
Copy link
Member

I am thinking about the solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants