Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ATC/worker flags to limit max build containers for workers #2928

Closed
mhuangpivotal opened this issue Dec 10, 2018 · 27 comments
Closed

Add ATC/worker flags to limit max build containers for workers #2928

mhuangpivotal opened this issue Dec 10, 2018 · 27 comments
Assignees
Projects

Comments

@mhuangpivotal
Copy link
Contributor

Note: we want to refactor runtime code in #2926 before implementing this.

With the least-build-containers container placement strategy implemented in #2577, we also want a way to limit the max number of build containers on the workers.

The proposal in #2577 wants to add --max-tasks-per-worker to ATC:

if max-tasks-per-worker is set:

  • If the current worker has less than max-tasks-per-worker tasks running, then dispatch to it the new task (and directly or indirectly update the priority queue).
  • If on the other hand the current worker has >= than max-tasks-per-worker tasks running, then since the list is a priority queue, all other workers will be in a similar situation, no need to keep traversing the list. Do not dispatch the runnable task and wait for the next event to wake up the scheduler.

If max-tasks-per-worker is not set:

  • Take the current worker from the priority queue and dispatch the runnable task to it. This is still better than random placement because it dispatches to the matching worker will less running tasks.

This option requires least-build-containers to be set, otherwise it will error.

Additionally,

Add option max-tasks-for-worker to concourse worker and modify the scheduling algorithm on the ATC to allow to take care of differences between workers, while max-tasks-per-worker on the ATC would still be a maximum cap.

We may want to change the flag names to say build-containers instead of tasks, since the least-build-containers strategy includes get, task and put containers.

@cirocosta
Copy link
Member

Hey @mhuangpivotal ,

Do you think this is something that could be advertised by the worker at registration time?

E.g., you could have a worker "here" that sets the default max-containers for garden (250), but another "there" that sets "10", in which case by having a per-worker setting, atc would schedule containers based on those values.

Wdyt?

thx!

@jchesterpivotal
Copy link
Contributor

Should I close PR #2707 in favour of this? I prefer this approach overall, especially since I don't have to write it.

I would still set a default value below the default Garden limit (250), since there will be a lagging response to detecting hitting a capacity limit.

@vito vito added the triage label Jan 9, 2019
@ddadlani
Copy link
Contributor

@marco-m what exactly is the use case for having a varying maximum for build containers on each worker? We don't currently set container maximums, those are enforced directly by Garden. Would the fix in #3251 along with the fewest-build-containers strategy not satisfy your use case?

@marco-m
Copy link
Contributor

marco-m commented Feb 12, 2019

@ddadlani here is my understanding:

The fix in #3251, along with fewest-build-containers, would allow to spread evenly the task containers on the available workers. This would make fewest-build-containers usable in prod for us and would be a great improvement :-)

On the other hand, I think that this ticket, that stems from #2577, is about controlling the load. More in details: it would allow to control the max number of task containers on a given worker. For example: as an operator, if I know that more than, say, 2 task containers kill my workers, then I can set max-tasks-per-worker to 2. If the total number of runnable tasks is more than number_of_workers * max-tasks-per-worker, then the Concourse scheduler will not dispatch any task and wait for the next scheduler run / next event. This would provide a rough queue.

If it is possible to obtain the equivalent behavior with Garden (after the fix for #3251), even better!
But to be acceptable by the end users of a CI system (the developers), the following should happen:
if I set a limit on Garden, and the limit is hit, Concourse should skip and retry, it should not mark a given build failed due to the Garden limit. if the build is marked failed, developers will get even more frustrated.

Does it makes sense ?

@jchesterpivotal
Copy link
Contributor

jchesterpivotal commented Feb 12, 2019

If the total number of runnable tasks is more than number_of_workers * max-tasks-per-worker, then the Concourse scheduler will not dispatch any task and wait for the next scheduler run / next event. This would provide a rough queue.

I submitted a PR for the same idea, expressed with the inverse: a global max-in-flight. It was deliberately simple to enable quick adoption.

But to be acceptable by the end users of a CI system (the developers), the following should happen:
if I set a limit on Garden, and the limit is hit, Concourse should skip and retry, it should not mark a given build failed due to the Garden limit. if the build is marked failed, developers will get even more frustrated.

I strongly agree. Because Concourse does not maintain a safe work buffer, it becomes necessary as a safety measure to retain a capacity buffer. The CF buildpacks team, for example, retains enough workers to handle the possibility of around 40 pipelines operating simultaneously. But this is not the common case, so average utilisation is very low.

Similarly, this behaves badly in disaster-recovery scenarios. It's not uncommon for many pipelines to fire up simultaneously when a Concourse is restored or rebuilt from scratch. This is doubly problematic because the ATC will begin loading workers as soon as they begin to report, leading to a flapping restart when workers are added progressively (as BOSH does). In DR scenarios I have found that it becomes necessary to manually disable all pipelines and then unpause them one by one, waiting for each added load to stabilise before proceeding.

I shouldn't have to do this.

@cirocosta
Copy link
Member

Hey @marco-m,

It sounds to me a lot similar to #2577 with the idea of having an extra constraint in scheduling in the sense that a task would "reserve" a container from the number of "available containers" that one can reserve from the "pool" that the worker has (like #2577 (comment), but instead of CPU / RAM, containers).

With a task "reserving" 1 container, when scheduling the task, the scheduler could take that into account and if it finds a worker that can accomodate the container:1 reservation, then it could subtract that reservation from available_containers of that worker (thus, reserving the resource) until it's running there; otherwise, leave it as pending.

By keeping track of how much work is being reserved and how much capacity we have, one could have a better metric for autoscaling too, not needing to keep a large buffer of resources as @jchesterpivotal expressed.

As mentioned in the comment in #2577, this could scale to other resources too, not only number of containers, but cpu & mem too.

Does that make sense?

@marco-m
Copy link
Contributor

marco-m commented Feb 12, 2019

hello @cirocosta, yes, I agree 100%.

This task is the same as #2577 as you mention. My understanding is that this task, #2928, has been created by @mhuangpivotal to track a specific activity, while #2577 has been considered more a "discussion" ticket.

As you mention, then

this could scale to other resources too, not only number of containers, but cpu & mem too.

Exactly!

@topherbullock
Copy link
Member

Do you think this is something that could be advertised by the worker at registration time?
@cirocosta
I think this really hints allowing a more general solution to the problem of overloading workers, as it means the worker will tell Concourse what it supports (# containers, but eventually other resources?), and Concourse just ensures it doesn't over-exert the worker.

Maybe it makes sense for the ContainerPlacementStrategy's Choose() to express that there isn't a good choice available

type ContainerPlacementStrategy interface {
//TODO: Don't pass around container metadata since it's not guaranteed to be deterministic.
// Change this after check containers stop being reused
Choose(lager.Logger, []Worker, ContainerSpec, db.ContainerMetadata) (Worker, error)
}

And thread that all the way up to the caller of FindWorkerForContainer on the pool (#3373 is going to be merged soon)

FindWorkerForContainer(
logger lager.Logger,
teamID int,
handle string,
) (Worker, bool, error)

@ddadlani @vito @jchesterpivotal @marco-m
I'm considering changing the sentiment of this issue to reflect the more general idea of the runtime factoring in workers' resource consumption when deciding to create containers for steps, and where to create them. There's a lot of good conversation here, but I'm not sure we want to implement this in the manner originally suggested .

What do y'all think?

@jchesterpivotal
Copy link
Contributor

Basically I'm still worried about continuing to drift towards a full-blown orchestrator; I want to do the least possible in that direction while allowing a way to limit work-in-progress.

@topherbullock
Copy link
Member

topherbullock commented Mar 13, 2019

Basically I'm still worried about continuing to drift towards a full-blown orchestrator

I have the same worry, but am constantly balancing it against a competing, related worry; not pushing the choices we need to make based on the current Runtime implementation down into that implementation, and abstracting it as far away from Concourse's Core objects for pipelines as possible.

@marco-m
Copy link
Contributor

marco-m commented Mar 19, 2019

@topherbullock I am fine with either approaches. What I can say is this: as a Concourse operator, I have seen many times workers being overloaded to the point of being unreachable / unresponsive.
From that point of view, anything with the following characteristics is super welcome by me:

  1. incremental improvement (better something partial soon that the whole solution later, exactly as fewest-build-containers, which has been a GREAT improvement for us ❤️ ).
  2. give the operator a way to limit the number of tasks landing on a worker (a queue, a limit, anything) with at the same time some type of feedback in the UI for the developer watching his/her build and wondering why it is not advancing. Something similar to the existing "checking max in flight", but if possible with some more details.

@marco-m
Copy link
Contributor

marco-m commented Apr 5, 2019

Hello colleagues (@topherbullock @ddadlani @mhuangpivotal @cirocosta @jchesterpivotal), since #2926 (refactor runtime code) has been marked closed as done, it looks like that this ticket is ready to be worked on :-)

As an additional data point, we have Concourse 5 in production since more than 1 month, with fewest-build-containers. Since there is no way to limit the number of task containers landing on a worker, we are still witnessing, daily, multiple tasks (up to 4) landing on the same worker, causing the build duration to go from 8m to 50m, a 10x increase. This happens although we are autoscaling aggressively on the average CPU utilization. We have a fleet of more than 40 workers (linux, windows, mac).

Really, the next missing step to make Concourse withstand huge load (C++ build and test) is at least a super basic "queue" like this ticket is about. This would make our lives shiny and bright :-)

@jwntrs
Copy link
Contributor

jwntrs commented Apr 5, 2019

@marco-m it sounds like you need a task anti-affinity more than a max-containers per worker?

@ddadlani is that easier or harder?

@ddadlani
Copy link
Contributor

ddadlani commented Apr 5, 2019

@marco-m just to echo my comment from Discord, changes to the fewest-build-strategy are a little difficult right now because of some recent changes that were made to the way workers are chosen. The ongoing refactoring to work toward #3079 will make this easier in the future, which is why this issue and #3301 are on hold for now.

Can we get a bit more info about the jobs causing these problems? For example, are they all triggered at the same time? In which case they may also be susceptible to the timing issue in #3301, which currently will also affect any queuing logic for this strategy.

If there are certain jobs that should not run on the same worker, adding in job-to-job anti affinity, as @pivotal-jwinters suggests, might help by ensuring that "repelling" builds are not scheduled on the same worker. We could add a field in the job config to identify other jobs which should be "repelled".

Task anti-affinity (or rather, build step anti-affinity) is harder because we do not track any step-identifying information for build containers. The containers are tied to a specific build.

On the other hand, if the offending job builds are not triggered at the same time, would it be possible in your environment to tag them to specific workers to ensure that they are not placed on the same worker?

Thoughts? @topherbullock @vito @cirocosta

@marco-m
Copy link
Contributor

marco-m commented Apr 5, 2019

Hello @ddadlani :-), just knowing that this ticket is blocked by #3079 makes me understand better the situation, thanks! Regarding #3301, I am well aware of it, it is my team who opened it after our workers collapsed ;-)

The jobs that cause problems are triggered close one to the other, this is why our autoscaling cannot react fast enough (we autoscale linux and windows VMs). On the other hand, they are not related to #3301, since we discovered that bug we removed all the aggregate from our pipelines.

Our builds generate so much load that answering the question "If there are certain jobs that should not run on the same worker" is actually simple: all of them. Any build job (containing a single task) is run via ninja (similar to make), which simply eats EVERY core it can find on the host. Using the few Concourse tunables to limit the resource usage of the container is ineffective, because we run also on Windows, and if we statically limit the number of cores then it means that when the task lands on a idle worker we run in a inefficient way. So to summarize: our builds perform optimally ONLY when there is a SINGLE task on a worker. If only two tasks land on the same worker, the build suffers. If 4 tasks, the build takes 10x, so the degradation is not even linear.

This is why for us having a job queue is vital. If Concourse (as any distributed system actually) had a queue, it could enforce backpressure, and we could use the queue length as a parameter for the autoscaling. I think that the majority of this context is also present in #2577, which gave birth to fewest-build-containers.

Waiting for a real queue, already having a settable limit on the number of tasks on a worker (this ticket) would allow our workers not to collapse, at least this is what I think.

So I don't think that the "repelling" approach would work (although I can appreciate it).

Regarding the last suggestion, tagging the jobs to specific workers, it is impractical. It would mean no autoscaling and having always the full number of workers available "just in case" a build is triggered...

Thanks again for everything the Concourse team is doing!

@jchesterpivotal
Copy link
Contributor

@ddadlani just piling on to @marco-m's comment a bit.

What we're looking at here is about how to limit work-in-progress, vs how to distribute work-in-progress. It looks similar on its face but, for queueing theory reasons that are approximately magical IMHO, the outcomes are very different.

Suppose I run a bank with 5 teller windows. If 20 customers arrive and all go to the same teller window at the same time, that's bad. Each will get very slow service. The other 4 tellers are idle, which is a dead loss to me.

Then I introduce some manner of load-balancing like fewest-build-containers, possibly with some additional anti-affinity magic to ensure that depositors and withdrawers don't all wind up at the same teller window. Now each teller has 5 customers to serve.

Result? Still bad. Each teller is dealing with 5 customers at once. Service is faster than the 20-customer : 1-teller case, and there are no idle tellers, but the experience overall still sucks.

What's needed here is a queue at the front of the bank. Each teller deals with one customer at a time, who gets much speedier service. The rest of the customers wait in the queue for a teller to become free.

What's happened is that we capped the work-in-progress limit at 5 customers and added a queue ahead of the limited pool. When total demand goes above 5 customers, a queue forms. If that queue gets very long I have a signal to add another teller. If the queue vanishes and I have idle tellers, that's a signal to remove tellers.

Does that help make the distinction?

@ddadlani
Copy link
Contributor

ddadlani commented Apr 5, 2019

@marco-m Thanks for the clarification. Yeah, the "repelling" solution doesn't really change anything if all tasks are CPU intensive. Just a minor detail, but #3301 could still affect you if multiple jobs are triggered at almost exactly the same time (e.g. using the time resource), though it's less likely.

What we're looking at here is about how to limit work-in-progress, vs how to distribute work-in-progress.

@jchesterpivotal this makes sense. We were suggesting anti affinity as a stop-gap solution, with queueing being the eventual goal to mitigate the underlying distribution problem. Of course, given @marco-m's workload, anti affinity doesn't help.

Ideally we would like to move to a scenario with some combination of queueing containers + scheduling containers based on actual worker load (CPU load for example) but this is a large chunk of work. I opened #3695 to discuss any problems/solutions with regards to scheduling work. @marco-m please do let us know if you still feel strongly about max-build-containers as well.

@marco-m
Copy link
Contributor

marco-m commented Jun 7, 2019

Hello @ddadlani, in our team we were wondering the following, related to the #3695 epic.

We see that the Concourse team is active in many directions around that topic, and we are very grateful.

On the other hand, it is unclear to us when Concourse will be able to benefit from all that work, since it is clearly non trivial :-)

So the question: would you be willing to consider my team providing a stop gap, an implementation for this specific issue (#2928) ? This stop gap could be temporary, and could be thrown away as soon as a proper fix coming from #3695 is implemented.

@ddadlani
Copy link
Contributor

ddadlani commented Jun 7, 2019

Hi @marco-m, yes I'd be okay with --max-build-containers-per-worker as a temporary solution since #3695 is clearly going to take a while. And it'd be great if you could PR that in too ;)

The only gotcha I foresee is that it would be affected by the same bug as #3301, but we are working on a fix for that too so it may work out that both are done around the same time.

@vito any concerns?

@vito
Copy link
Member

vito commented Jun 7, 2019

What will happen when a worker reaches the max? In absence of queueing, will we just act as if the worker isn't in the pool and fail with 'no workers'? Or sit in a retry loop until one has capacity?

Either way starting on a naive implementation that's opt-in sounds good. 👍 We can mull over the details as we go.

(Sorry if that was answered above, just giving a quick go-ahead.)

@marco-m
Copy link
Contributor

marco-m commented Jun 8, 2019

Good question. To me it is very important not to add user-visible failures, so it should do something similar to "sit in a retry loop until one has capacity". This notion of retrying was also in my original proposal for #2577.

@ddadlani
Copy link
Contributor

@marco-m is this in flight right now? Do you have any updates for us?

@marco-m
Copy link
Contributor

marco-m commented Jun 17, 2019

@ddadlani This is scheduled for us. Very probably we will pick it up this week.

@aledegano
Copy link
Contributor

@ddadlani I've picked up this task from the team's board. (I'm of the same team of @marco-m)

@ddadlani ddadlani moved this from Backlog to In Flight in Runtime Jun 24, 2019
aledegano pushed a commit to Pix4D/concourse that referenced this issue Jun 26, 2019
This is a POC to attempt fixing concourse#2928.
Add "running_task" to "containers" in DB.
Use running_task to determine if a worker is busy.
@ddadlani
Copy link
Contributor

ddadlani commented Jul 15, 2019

The direction that this is now taking is that there will be a new experimental placement strategy called limit-active-tasks which is being implemented by @aledeganopix4d in #4118. Discussion leading up to this point is in #4069

@ddadlani ddadlani moved this from In Flight to Done in Runtime Jul 23, 2019
ddadlani pushed a commit that referenced this issue Jul 24, 2019
the add_active_tasks_to_workers migration had a random number as the
version, which would cause future migrations to get skipped until their
versions passed this number.

#2928

Signed-off-by: Divya Dadlani <ddadlani@pivotal.io>
Co-authored-by: Rui Yang <ryang@pivotal.io>
@ddadlani ddadlani moved this from Done to Accepted in Runtime Jul 29, 2019
Runtime automation moved this from Accepted to Done Jul 30, 2019
@ddadlani ddadlani moved this from Done to Accepted in Runtime Jul 30, 2019
@hdiass
Copy link

hdiass commented Aug 20, 2019

How can this feature be used ? (v5.4.1)
For me this one of the biggest flaws of concourse.
In the morning(for example), all the developers trigger a build and then all the tasks are sent to workers. Alongside, because workers are flooded with tasks, we get a lot of timeout errors (docker-image-resource for example) because it can't spin a new daemon or it can't download a docker image or a resource, and then you need to re-schedule a build...The way that concourse handles the workloads are still major problem(and dangerous) and having use of what this ticket tries to implement can perhaps reduce this pain point. Cheers

@aledegano
Copy link
Contributor

This feature should be shipped with v5.5.
The docs already explain how it will be used, here: https://concourse-ci.org/container-placement.html#limit-active-tasks-strategy

@hdiass
Copy link

hdiass commented Aug 20, 2019

Thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Runtime
Accepted
Development

Successfully merging a pull request may close this issue.

10 participants