Add ATC/worker flags to limit max build containers for workers #2928

mhuangpivotal · 2018-12-10T17:14:41Z

Note: we want to refactor runtime code in #2926 before implementing this.

With the least-build-containers container placement strategy implemented in #2577, we also want a way to limit the max number of build containers on the workers.

The proposal in #2577 wants to add --max-tasks-per-worker to ATC:

if max-tasks-per-worker is set:

If the current worker has less than max-tasks-per-worker tasks running, then dispatch to it the new task (and directly or indirectly update the priority queue).

If on the other hand the current worker has >= than max-tasks-per-worker tasks running, then since the list is a priority queue, all other workers will be in a similar situation, no need to keep traversing the list. Do not dispatch the runnable task and wait for the next event to wake up the scheduler.

If max-tasks-per-worker is not set:

Take the current worker from the priority queue and dispatch the runnable task to it. This is still better than random placement because it dispatches to the matching worker will less running tasks.

This option requires least-build-containers to be set, otherwise it will error.

Additionally,

Add option max-tasks-for-worker to concourse worker and modify the scheduling algorithm on the ATC to allow to take care of differences between workers, while max-tasks-per-worker on the ATC would still be a maximum cap.

We may want to change the flag names to say build-containers instead of tasks, since the least-build-containers strategy includes get, task and put containers.

The text was updated successfully, but these errors were encountered:

cirocosta · 2018-12-11T23:26:14Z

Hey @mhuangpivotal ,

Do you think this is something that could be advertised by the worker at registration time?

E.g., you could have a worker "here" that sets the default max-containers for garden (250), but another "there" that sets "10", in which case by having a per-worker setting, atc would schedule containers based on those values.

Wdyt?

thx!

jchesterpivotal · 2018-12-17T05:28:19Z

Should I close PR #2707 in favour of this? I prefer this approach overall, especially since I don't have to write it.

I would still set a default value below the default Garden limit (250), since there will be a lagging response to detecting hitting a capacity limit.

ddadlani · 2019-02-11T19:07:19Z

@marco-m what exactly is the use case for having a varying maximum for build containers on each worker? We don't currently set container maximums, those are enforced directly by Garden. Would the fix in #3251 along with the fewest-build-containers strategy not satisfy your use case?

marco-m · 2019-02-12T09:13:44Z

@ddadlani here is my understanding:

The fix in #3251, along with fewest-build-containers, would allow to spread evenly the task containers on the available workers. This would make fewest-build-containers usable in prod for us and would be a great improvement :-)

On the other hand, I think that this ticket, that stems from #2577, is about controlling the load. More in details: it would allow to control the max number of task containers on a given worker. For example: as an operator, if I know that more than, say, 2 task containers kill my workers, then I can set max-tasks-per-worker to 2. If the total number of runnable tasks is more than number_of_workers * max-tasks-per-worker, then the Concourse scheduler will not dispatch any task and wait for the next scheduler run / next event. This would provide a rough queue.

If it is possible to obtain the equivalent behavior with Garden (after the fix for #3251), even better!
But to be acceptable by the end users of a CI system (the developers), the following should happen:
if I set a limit on Garden, and the limit is hit, Concourse should skip and retry, it should not mark a given build failed due to the Garden limit. if the build is marked failed, developers will get even more frustrated.

Does it makes sense ?

jchesterpivotal · 2019-02-12T13:57:50Z

If the total number of runnable tasks is more than number_of_workers * max-tasks-per-worker, then the Concourse scheduler will not dispatch any task and wait for the next scheduler run / next event. This would provide a rough queue.

I submitted a PR for the same idea, expressed with the inverse: a global max-in-flight. It was deliberately simple to enable quick adoption.

But to be acceptable by the end users of a CI system (the developers), the following should happen:
if I set a limit on Garden, and the limit is hit, Concourse should skip and retry, it should not mark a given build failed due to the Garden limit. if the build is marked failed, developers will get even more frustrated.

I strongly agree. Because Concourse does not maintain a safe work buffer, it becomes necessary as a safety measure to retain a capacity buffer. The CF buildpacks team, for example, retains enough workers to handle the possibility of around 40 pipelines operating simultaneously. But this is not the common case, so average utilisation is very low.

Similarly, this behaves badly in disaster-recovery scenarios. It's not uncommon for many pipelines to fire up simultaneously when a Concourse is restored or rebuilt from scratch. This is doubly problematic because the ATC will begin loading workers as soon as they begin to report, leading to a flapping restart when workers are added progressively (as BOSH does). In DR scenarios I have found that it becomes necessary to manually disable all pipelines and then unpause them one by one, waiting for each added load to stabilise before proceeding.

I shouldn't have to do this.

cirocosta · 2019-02-12T14:02:48Z

Hey @marco-m,

It sounds to me a lot similar to #2577 with the idea of having an extra constraint in scheduling in the sense that a task would "reserve" a container from the number of "available containers" that one can reserve from the "pool" that the worker has (like #2577 (comment), but instead of CPU / RAM, containers).

With a task "reserving" 1 container, when scheduling the task, the scheduler could take that into account and if it finds a worker that can accomodate the container:1 reservation, then it could subtract that reservation from available_containers of that worker (thus, reserving the resource) until it's running there; otherwise, leave it as pending.

By keeping track of how much work is being reserved and how much capacity we have, one could have a better metric for autoscaling too, not needing to keep a large buffer of resources as @jchesterpivotal expressed.

As mentioned in the comment in #2577, this could scale to other resources too, not only number of containers, but cpu & mem too.

Does that make sense?

marco-m · 2019-02-12T14:10:40Z

hello @cirocosta, yes, I agree 100%.

This task is the same as #2577 as you mention. My understanding is that this task, #2928, has been created by @mhuangpivotal to track a specific activity, while #2577 has been considered more a "discussion" ticket.

As you mention, then

this could scale to other resources too, not only number of containers, but cpu & mem too.

Exactly!

topherbullock · 2019-03-13T15:34:03Z

Do you think this is something that could be advertised by the worker at registration time?
@cirocosta
I think this really hints allowing a more general solution to the problem of overloading workers, as it means the worker will tell Concourse what it supports (# containers, but eventually other resources?), and Concourse just ensures it doesn't over-exert the worker.

Maybe it makes sense for the ContainerPlacementStrategy's Choose() to express that there isn't a good choice available

concourse/atc/worker/placement.go

Lines 11 to 15 in 74de717

    
           type ContainerPlacementStrategy interface { 
        
           	//TODO: Don't pass around container metadata since it's not guaranteed to be deterministic. 
        
           	// Change this after check containers stop being reused 
        
           	Choose(lager.Logger, []Worker, ContainerSpec, db.ContainerMetadata) (Worker, error) 
        
           }

And thread that all the way up to the caller of FindWorkerForContainer on the pool (#3373 is going to be merged soon)

concourse/atc/worker/pool.go

Lines 19 to 23 in cd8c6cf

    
           FindWorkerForContainer( 
        
           	logger lager.Logger, 
        
           	teamID int, 
        
           	handle string, 
        
           ) (Worker, bool, error)

@ddadlani @vito @jchesterpivotal @marco-m
I'm considering changing the sentiment of this issue to reflect the more general idea of the runtime factoring in workers' resource consumption when deciding to create containers for steps, and where to create them. There's a lot of good conversation here, but I'm not sure we want to implement this in the manner originally suggested .

What do y'all think?

jchesterpivotal · 2019-03-13T16:13:13Z

Basically I'm still worried about continuing to drift towards a full-blown orchestrator; I want to do the least possible in that direction while allowing a way to limit work-in-progress.

topherbullock · 2019-03-13T16:32:04Z

Basically I'm still worried about continuing to drift towards a full-blown orchestrator

I have the same worry, but am constantly balancing it against a competing, related worry; not pushing the choices we need to make based on the current Runtime implementation down into that implementation, and abstracting it as far away from Concourse's Core objects for pipelines as possible.

marco-m · 2019-03-19T09:44:39Z

@topherbullock I am fine with either approaches. What I can say is this: as a Concourse operator, I have seen many times workers being overloaded to the point of being unreachable / unresponsive.
From that point of view, anything with the following characteristics is super welcome by me:

incremental improvement (better something partial soon that the whole solution later, exactly as fewest-build-containers, which has been a GREAT improvement for us ❤️ ).
give the operator a way to limit the number of tasks landing on a worker (a queue, a limit, anything) with at the same time some type of feedback in the UI for the developer watching his/her build and wondering why it is not advancing. Something similar to the existing "checking max in flight", but if possible with some more details.

marco-m · 2019-04-05T10:34:33Z

Hello colleagues (@topherbullock @ddadlani @mhuangpivotal @cirocosta @jchesterpivotal), since #2926 (refactor runtime code) has been marked closed as done, it looks like that this ticket is ready to be worked on :-)

As an additional data point, we have Concourse 5 in production since more than 1 month, with fewest-build-containers. Since there is no way to limit the number of task containers landing on a worker, we are still witnessing, daily, multiple tasks (up to 4) landing on the same worker, causing the build duration to go from 8m to 50m, a 10x increase. This happens although we are autoscaling aggressively on the average CPU utilization. We have a fleet of more than 40 workers (linux, windows, mac).

Really, the next missing step to make Concourse withstand huge load (C++ build and test) is at least a super basic "queue" like this ticket is about. This would make our lives shiny and bright :-)

jwntrs · 2019-04-05T13:44:25Z

@marco-m it sounds like you need a task anti-affinity more than a max-containers per worker?

@ddadlani is that easier or harder?

ddadlani · 2019-04-05T15:36:49Z

@marco-m just to echo my comment from Discord, changes to the fewest-build-strategy are a little difficult right now because of some recent changes that were made to the way workers are chosen. The ongoing refactoring to work toward #3079 will make this easier in the future, which is why this issue and #3301 are on hold for now.

Can we get a bit more info about the jobs causing these problems? For example, are they all triggered at the same time? In which case they may also be susceptible to the timing issue in #3301, which currently will also affect any queuing logic for this strategy.

If there are certain jobs that should not run on the same worker, adding in job-to-job anti affinity, as @pivotal-jwinters suggests, might help by ensuring that "repelling" builds are not scheduled on the same worker. We could add a field in the job config to identify other jobs which should be "repelled".

Task anti-affinity (or rather, build step anti-affinity) is harder because we do not track any step-identifying information for build containers. The containers are tied to a specific build.

On the other hand, if the offending job builds are not triggered at the same time, would it be possible in your environment to tag them to specific workers to ensure that they are not placed on the same worker?

Thoughts? @topherbullock @vito @cirocosta

marco-m · 2019-04-05T17:08:48Z

Hello @ddadlani :-), just knowing that this ticket is blocked by #3079 makes me understand better the situation, thanks! Regarding #3301, I am well aware of it, it is my team who opened it after our workers collapsed ;-)

The jobs that cause problems are triggered close one to the other, this is why our autoscaling cannot react fast enough (we autoscale linux and windows VMs). On the other hand, they are not related to #3301, since we discovered that bug we removed all the aggregate from our pipelines.

Our builds generate so much load that answering the question "If there are certain jobs that should not run on the same worker" is actually simple: all of them. Any build job (containing a single task) is run via ninja (similar to make), which simply eats EVERY core it can find on the host. Using the few Concourse tunables to limit the resource usage of the container is ineffective, because we run also on Windows, and if we statically limit the number of cores then it means that when the task lands on a idle worker we run in a inefficient way. So to summarize: our builds perform optimally ONLY when there is a SINGLE task on a worker. If only two tasks land on the same worker, the build suffers. If 4 tasks, the build takes 10x, so the degradation is not even linear.

This is why for us having a job queue is vital. If Concourse (as any distributed system actually) had a queue, it could enforce backpressure, and we could use the queue length as a parameter for the autoscaling. I think that the majority of this context is also present in #2577, which gave birth to fewest-build-containers.

Waiting for a real queue, already having a settable limit on the number of tasks on a worker (this ticket) would allow our workers not to collapse, at least this is what I think.

So I don't think that the "repelling" approach would work (although I can appreciate it).

Regarding the last suggestion, tagging the jobs to specific workers, it is impractical. It would mean no autoscaling and having always the full number of workers available "just in case" a build is triggered...

Thanks again for everything the Concourse team is doing!

jchesterpivotal · 2019-04-05T17:26:16Z

@ddadlani just piling on to @marco-m's comment a bit.

What we're looking at here is about how to limit work-in-progress, vs how to distribute work-in-progress. It looks similar on its face but, for queueing theory reasons that are approximately magical IMHO, the outcomes are very different.

Suppose I run a bank with 5 teller windows. If 20 customers arrive and all go to the same teller window at the same time, that's bad. Each will get very slow service. The other 4 tellers are idle, which is a dead loss to me.

Then I introduce some manner of load-balancing like fewest-build-containers, possibly with some additional anti-affinity magic to ensure that depositors and withdrawers don't all wind up at the same teller window. Now each teller has 5 customers to serve.

Result? Still bad. Each teller is dealing with 5 customers at once. Service is faster than the 20-customer : 1-teller case, and there are no idle tellers, but the experience overall still sucks.

What's needed here is a queue at the front of the bank. Each teller deals with one customer at a time, who gets much speedier service. The rest of the customers wait in the queue for a teller to become free.

What's happened is that we capped the work-in-progress limit at 5 customers and added a queue ahead of the limited pool. When total demand goes above 5 customers, a queue forms. If that queue gets very long I have a signal to add another teller. If the queue vanishes and I have idle tellers, that's a signal to remove tellers.

Does that help make the distinction?

ddadlani · 2019-04-05T18:16:38Z

@marco-m Thanks for the clarification. Yeah, the "repelling" solution doesn't really change anything if all tasks are CPU intensive. Just a minor detail, but #3301 could still affect you if multiple jobs are triggered at almost exactly the same time (e.g. using the time resource), though it's less likely.

What we're looking at here is about how to limit work-in-progress, vs how to distribute work-in-progress.

@jchesterpivotal this makes sense. We were suggesting anti affinity as a stop-gap solution, with queueing being the eventual goal to mitigate the underlying distribution problem. Of course, given @marco-m's workload, anti affinity doesn't help.

Ideally we would like to move to a scenario with some combination of queueing containers + scheduling containers based on actual worker load (CPU load for example) but this is a large chunk of work. I opened #3695 to discuss any problems/solutions with regards to scheduling work. @marco-m please do let us know if you still feel strongly about max-build-containers as well.

marco-m · 2019-06-07T09:25:44Z

Hello @ddadlani, in our team we were wondering the following, related to the #3695 epic.

We see that the Concourse team is active in many directions around that topic, and we are very grateful.

On the other hand, it is unclear to us when Concourse will be able to benefit from all that work, since it is clearly non trivial :-)

So the question: would you be willing to consider my team providing a stop gap, an implementation for this specific issue (#2928) ? This stop gap could be temporary, and could be thrown away as soon as a proper fix coming from #3695 is implemented.

ddadlani · 2019-06-07T14:26:58Z

Hi @marco-m, yes I'd be okay with --max-build-containers-per-worker as a temporary solution since #3695 is clearly going to take a while. And it'd be great if you could PR that in too ;)

The only gotcha I foresee is that it would be affected by the same bug as #3301, but we are working on a fix for that too so it may work out that both are done around the same time.

@vito any concerns?

vito · 2019-06-07T23:04:41Z

What will happen when a worker reaches the max? In absence of queueing, will we just act as if the worker isn't in the pool and fail with 'no workers'? Or sit in a retry loop until one has capacity?

Either way starting on a naive implementation that's opt-in sounds good. 👍 We can mull over the details as we go.

(Sorry if that was answered above, just giving a quick go-ahead.)

marco-m · 2019-06-08T07:46:40Z

Good question. To me it is very important not to add user-visible failures, so it should do something similar to "sit in a retry loop until one has capacity". This notion of retrying was also in my original proposal for #2577.

ddadlani · 2019-06-17T15:21:58Z

@marco-m is this in flight right now? Do you have any updates for us?

marco-m · 2019-06-17T15:26:56Z

@ddadlani This is scheduled for us. Very probably we will pick it up this week.

aledegano · 2019-06-24T11:15:49Z

@ddadlani I've picked up this task from the team's board. (I'm of the same team of @marco-m)

This is a POC to attempt fixing concourse#2928. Add "running_task" to "containers" in DB. Use running_task to determine if a worker is busy.

ddadlani · 2019-07-15T18:38:19Z

The direction that this is now taking is that there will be a new experimental placement strategy called limit-active-tasks which is being implemented by @aledeganopix4d in #4118. Discussion leading up to this point is in #4069

the add_active_tasks_to_workers migration had a random number as the version, which would cause future migrations to get skipped until their versions passed this number. #2928 Signed-off-by: Divya Dadlani <ddadlani@pivotal.io> Co-authored-by: Rui Yang <ryang@pivotal.io>

hdiass · 2019-08-20T14:17:17Z

How can this feature be used ? (v5.4.1)
For me this one of the biggest flaws of concourse.
In the morning(for example), all the developers trigger a build and then all the tasks are sent to workers. Alongside, because workers are flooded with tasks, we get a lot of timeout errors (docker-image-resource for example) because it can't spin a new daemon or it can't download a docker image or a resource, and then you need to re-schedule a build...The way that concourse handles the workloads are still major problem(and dangerous) and having use of what this ticket tries to implement can perhaps reduce this pain point. Cheers

aledegano · 2019-08-20T14:19:08Z

This feature should be shipped with v5.5.
The docs already explain how it will be used, here: https://concourse-ci.org/container-placement.html#limit-active-tasks-strategy

hdiass · 2019-08-20T14:23:06Z

Thanks 👍

mhuangpivotal added this to Icebox in Runtime via automation Dec 10, 2018

mhuangpivotal mentioned this issue Dec 10, 2018

Feature Request: slightly better scheduling (don't kill my workers under load) #2577

Closed

marco-m mentioned this issue Dec 11, 2018

Discussion: Runtime refactoring #2926

Closed

vito added the triage label Jan 9, 2019

topherbullock moved this from Icebox to Backlog in Runtime Feb 12, 2019

topherbullock moved this from Backlog to Research 🤔 in Runtime Feb 12, 2019

topherbullock moved this from Research 🤔 to Backlog in Runtime Feb 12, 2019

topherbullock mentioned this issue Feb 12, 2019

Concourse Should Automatically Cordon Bad Workers #1965

Open

ddadlani mentioned this issue Feb 22, 2019

Exclude Worker when it's reached Max Containers on DB Counts #3367

Closed

vito mentioned this issue Mar 12, 2019

Introduce a global max-in-flight limit #2707

Closed

ddadlani mentioned this issue Apr 5, 2019

Better build scheduling / load distribution #3695

Open

ddadlani moved this from Backlog to In Flight in Runtime Jun 24, 2019

ddadlani mentioned this issue Jun 25, 2019

Feature request: ability to limit number of tasks per worker #676

Closed

aledegano pushed a commit to Pix4D/concourse that referenced this issue Jun 26, 2019

[POC] Limit tasks per worker.

b80ba10

This is a POC to attempt fixing concourse#2928. Add "running_task" to "containers" in DB. Use running_task to determine if a worker is busy.

aledegano mentioned this issue Jun 26, 2019

[POC] Limit tasks per worker. #4069

Closed

ddadlani assigned aledegano Jun 26, 2019

This was referenced Jun 28, 2019

manually revert changes from PR #3902 #4075

Merged

Max-build-task-per-workers #4076

Closed

ddadlani mentioned this issue Jul 8, 2019

Break out task step into core and runtime sides #4109

Merged

ddadlani moved this from In Flight to Done in Runtime Jul 23, 2019

ddadlani mentioned this issue Jul 24, 2019

change latest migration version to the current timestamp #4161

Merged

ddadlani moved this from Done to Accepted in Runtime Jul 29, 2019

ddadlani closed this as completed Jul 30, 2019

Runtime automation moved this from Accepted to Done Jul 30, 2019

ddadlani moved this from Done to Accepted in Runtime Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ATC/worker flags to limit max build containers for workers #2928

Add ATC/worker flags to limit max build containers for workers #2928

mhuangpivotal commented Dec 10, 2018

cirocosta commented Dec 11, 2018

jchesterpivotal commented Dec 17, 2018

ddadlani commented Feb 11, 2019

marco-m commented Feb 12, 2019 •

edited

jchesterpivotal commented Feb 12, 2019 •

edited

cirocosta commented Feb 12, 2019

marco-m commented Feb 12, 2019

topherbullock commented Mar 13, 2019

jchesterpivotal commented Mar 13, 2019

topherbullock commented Mar 13, 2019 •

edited

marco-m commented Mar 19, 2019 •

edited

marco-m commented Apr 5, 2019

jwntrs commented Apr 5, 2019

ddadlani commented Apr 5, 2019

marco-m commented Apr 5, 2019

jchesterpivotal commented Apr 5, 2019

ddadlani commented Apr 5, 2019

marco-m commented Jun 7, 2019

ddadlani commented Jun 7, 2019

vito commented Jun 7, 2019 •

edited

marco-m commented Jun 8, 2019

ddadlani commented Jun 17, 2019

marco-m commented Jun 17, 2019

aledegano commented Jun 24, 2019

ddadlani commented Jul 15, 2019 •

edited

hdiass commented Aug 20, 2019 •

edited

aledegano commented Aug 20, 2019

hdiass commented Aug 20, 2019

Add ATC/worker flags to limit max build containers for workers #2928

Add ATC/worker flags to limit max build containers for workers #2928

Comments

mhuangpivotal commented Dec 10, 2018

cirocosta commented Dec 11, 2018

jchesterpivotal commented Dec 17, 2018

ddadlani commented Feb 11, 2019

marco-m commented Feb 12, 2019 • edited

jchesterpivotal commented Feb 12, 2019 • edited

cirocosta commented Feb 12, 2019

marco-m commented Feb 12, 2019

topherbullock commented Mar 13, 2019

jchesterpivotal commented Mar 13, 2019

topherbullock commented Mar 13, 2019 • edited

marco-m commented Mar 19, 2019 • edited

marco-m commented Apr 5, 2019

jwntrs commented Apr 5, 2019

ddadlani commented Apr 5, 2019

marco-m commented Apr 5, 2019

jchesterpivotal commented Apr 5, 2019

ddadlani commented Apr 5, 2019

marco-m commented Jun 7, 2019

ddadlani commented Jun 7, 2019

vito commented Jun 7, 2019 • edited

marco-m commented Jun 8, 2019

ddadlani commented Jun 17, 2019

marco-m commented Jun 17, 2019

aledegano commented Jun 24, 2019

ddadlani commented Jul 15, 2019 • edited

hdiass commented Aug 20, 2019 • edited

aledegano commented Aug 20, 2019

hdiass commented Aug 20, 2019

marco-m commented Feb 12, 2019 •

edited

jchesterpivotal commented Feb 12, 2019 •

edited

topherbullock commented Mar 13, 2019 •

edited

marco-m commented Mar 19, 2019 •

edited

vito commented Jun 7, 2019 •

edited

ddadlani commented Jul 15, 2019 •

edited

hdiass commented Aug 20, 2019 •

edited