New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ATC/worker flags to limit max build containers for workers #2928
Comments
Hey @mhuangpivotal , Do you think this is something that could be advertised by the worker at registration time? E.g., you could have a worker "here" that sets the default max-containers for garden (250), but another "there" that sets "10", in which case by having a per-worker setting, Wdyt? thx! |
Should I close PR #2707 in favour of this? I prefer this approach overall, especially since I don't have to write it. I would still set a default value below the default Garden limit (250), since there will be a lagging response to detecting hitting a capacity limit. |
@ddadlani here is my understanding: The fix in #3251, along with On the other hand, I think that this ticket, that stems from #2577, is about controlling the load. More in details: it would allow to control the max number of task containers on a given worker. For example: as an operator, if I know that more than, say, 2 task containers kill my workers, then I can set max-tasks-per-worker to 2. If the total number of runnable tasks is more than number_of_workers * max-tasks-per-worker, then the Concourse scheduler will not dispatch any task and wait for the next scheduler run / next event. This would provide a rough queue. If it is possible to obtain the equivalent behavior with Garden (after the fix for #3251), even better! Does it makes sense ? |
I submitted a PR for the same idea, expressed with the inverse: a global max-in-flight. It was deliberately simple to enable quick adoption.
I strongly agree. Because Concourse does not maintain a safe work buffer, it becomes necessary as a safety measure to retain a capacity buffer. The CF buildpacks team, for example, retains enough workers to handle the possibility of around 40 pipelines operating simultaneously. But this is not the common case, so average utilisation is very low. Similarly, this behaves badly in disaster-recovery scenarios. It's not uncommon for many pipelines to fire up simultaneously when a Concourse is restored or rebuilt from scratch. This is doubly problematic because the ATC will begin loading workers as soon as they begin to report, leading to a flapping restart when workers are added progressively (as BOSH does). In DR scenarios I have found that it becomes necessary to manually disable all pipelines and then unpause them one by one, waiting for each added load to stabilise before proceeding. I shouldn't have to do this. |
Hey @marco-m, It sounds to me a lot similar to #2577 with the idea of having an extra constraint in scheduling in the sense that a task would "reserve" a container from the number of "available containers" that one can reserve from the "pool" that the worker has (like #2577 (comment), but instead of CPU / RAM, containers). With a task "reserving" By keeping track of how much work is being reserved and how much capacity we have, one could have a better metric for autoscaling too, not needing to keep a large buffer of resources as @jchesterpivotal expressed. As mentioned in the comment in #2577, this could scale to other resources too, not only number of containers, but cpu & mem too. Does that make sense? |
hello @cirocosta, yes, I agree 100%. This task is the same as #2577 as you mention. My understanding is that this task, #2928, has been created by @mhuangpivotal to track a specific activity, while #2577 has been considered more a "discussion" ticket. As you mention, then
Exactly! |
Maybe it makes sense for the concourse/atc/worker/placement.go Lines 11 to 15 in 74de717
And thread that all the way up to the caller of Lines 19 to 23 in cd8c6cf
@ddadlani @vito @jchesterpivotal @marco-m What do y'all think? |
Basically I'm still worried about continuing to drift towards a full-blown orchestrator; I want to do the least possible in that direction while allowing a way to limit work-in-progress. |
I have the same worry, but am constantly balancing it against a competing, related worry; not pushing the choices we need to make based on the current Runtime implementation down into that implementation, and abstracting it as far away from Concourse's Core objects for pipelines as possible. |
@topherbullock I am fine with either approaches. What I can say is this: as a Concourse operator, I have seen many times workers being overloaded to the point of being unreachable / unresponsive.
|
Hello colleagues (@topherbullock @ddadlani @mhuangpivotal @cirocosta @jchesterpivotal), since #2926 (refactor runtime code) has been marked closed as done, it looks like that this ticket is ready to be worked on :-) As an additional data point, we have Concourse 5 in production since more than 1 month, with Really, the next missing step to make Concourse withstand huge load (C++ build and test) is at least a super basic "queue" like this ticket is about. This would make our lives shiny and bright :-) |
@marco-m just to echo my comment from Discord, changes to the Can we get a bit more info about the jobs causing these problems? For example, are they all triggered at the same time? In which case they may also be susceptible to the timing issue in #3301, which currently will also affect any queuing logic for this strategy. If there are certain jobs that should not run on the same worker, adding in job-to-job anti affinity, as @pivotal-jwinters suggests, might help by ensuring that "repelling" builds are not scheduled on the same worker. We could add a field in the job config to identify other jobs which should be "repelled". Task anti-affinity (or rather, build step anti-affinity) is harder because we do not track any step-identifying information for build containers. The containers are tied to a specific build. On the other hand, if the offending job builds are not triggered at the same time, would it be possible in your environment to tag them to specific workers to ensure that they are not placed on the same worker? Thoughts? @topherbullock @vito @cirocosta |
Hello @ddadlani :-), just knowing that this ticket is blocked by #3079 makes me understand better the situation, thanks! Regarding #3301, I am well aware of it, it is my team who opened it after our workers collapsed ;-) The jobs that cause problems are triggered close one to the other, this is why our autoscaling cannot react fast enough (we autoscale linux and windows VMs). On the other hand, they are not related to #3301, since we discovered that bug we removed all the Our builds generate so much load that answering the question "If there are certain jobs that should not run on the same worker" is actually simple: all of them. Any build job (containing a single task) is run via This is why for us having a job queue is vital. If Concourse (as any distributed system actually) had a queue, it could enforce backpressure, and we could use the queue length as a parameter for the autoscaling. I think that the majority of this context is also present in #2577, which gave birth to Waiting for a real queue, already having a settable limit on the number of tasks on a worker (this ticket) would allow our workers not to collapse, at least this is what I think. So I don't think that the "repelling" approach would work (although I can appreciate it). Regarding the last suggestion, tagging the jobs to specific workers, it is impractical. It would mean no autoscaling and having always the full number of workers available "just in case" a build is triggered... Thanks again for everything the Concourse team is doing! |
@ddadlani just piling on to @marco-m's comment a bit. What we're looking at here is about how to limit work-in-progress, vs how to distribute work-in-progress. It looks similar on its face but, for queueing theory reasons that are approximately magical IMHO, the outcomes are very different. Suppose I run a bank with 5 teller windows. If 20 customers arrive and all go to the same teller window at the same time, that's bad. Each will get very slow service. The other 4 tellers are idle, which is a dead loss to me. Then I introduce some manner of load-balancing like Result? Still bad. Each teller is dealing with 5 customers at once. Service is faster than the 20-customer : 1-teller case, and there are no idle tellers, but the experience overall still sucks. What's needed here is a queue at the front of the bank. Each teller deals with one customer at a time, who gets much speedier service. The rest of the customers wait in the queue for a teller to become free. What's happened is that we capped the work-in-progress limit at 5 customers and added a queue ahead of the limited pool. When total demand goes above 5 customers, a queue forms. If that queue gets very long I have a signal to add another teller. If the queue vanishes and I have idle tellers, that's a signal to remove tellers. Does that help make the distinction? |
@marco-m Thanks for the clarification. Yeah, the "repelling" solution doesn't really change anything if all tasks are CPU intensive. Just a minor detail, but #3301 could still affect you if multiple jobs are triggered at almost exactly the same time (e.g. using the time resource), though it's less likely.
@jchesterpivotal this makes sense. We were suggesting anti affinity as a stop-gap solution, with queueing being the eventual goal to mitigate the underlying distribution problem. Of course, given @marco-m's workload, anti affinity doesn't help. Ideally we would like to move to a scenario with some combination of queueing containers + scheduling containers based on actual worker load (CPU load for example) but this is a large chunk of work. I opened #3695 to discuss any problems/solutions with regards to scheduling work. @marco-m please do let us know if you still feel strongly about |
Hello @ddadlani, in our team we were wondering the following, related to the #3695 epic. We see that the Concourse team is active in many directions around that topic, and we are very grateful. On the other hand, it is unclear to us when Concourse will be able to benefit from all that work, since it is clearly non trivial :-) So the question: would you be willing to consider my team providing a stop gap, an implementation for this specific issue (#2928) ? This stop gap could be temporary, and could be thrown away as soon as a proper fix coming from #3695 is implemented. |
Hi @marco-m, yes I'd be okay with The only gotcha I foresee is that it would be affected by the same bug as #3301, but we are working on a fix for that too so it may work out that both are done around the same time. @vito any concerns? |
What will happen when a worker reaches the max? In absence of queueing, will we just act as if the worker isn't in the pool and fail with 'no workers'? Or sit in a retry loop until one has capacity? Either way starting on a naive implementation that's opt-in sounds good. 👍 We can mull over the details as we go. (Sorry if that was answered above, just giving a quick go-ahead.) |
Good question. To me it is very important not to add user-visible failures, so it should do something similar to "sit in a retry loop until one has capacity". This notion of retrying was also in my original proposal for #2577. |
@marco-m is this in flight right now? Do you have any updates for us? |
@ddadlani This is scheduled for us. Very probably we will pick it up this week. |
This is a POC to attempt fixing concourse#2928. Add "running_task" to "containers" in DB. Use running_task to determine if a worker is busy.
the add_active_tasks_to_workers migration had a random number as the version, which would cause future migrations to get skipped until their versions passed this number. #2928 Signed-off-by: Divya Dadlani <ddadlani@pivotal.io> Co-authored-by: Rui Yang <ryang@pivotal.io>
How can this feature be used ? (v5.4.1) |
This feature should be shipped with v5.5. |
Thanks 👍 |
Note: we want to refactor runtime code in #2926 before implementing this.
With the
least-build-containers
container placement strategy implemented in #2577, we also want a way to limit the max number of build containers on the workers.The proposal in #2577 wants to add
--max-tasks-per-worker
to ATC:This option requires
least-build-containers
to be set, otherwise it will error.Additionally,
We may want to change the flag names to say
build-containers
instead oftasks
, since theleast-build-containers
strategy includesget
,task
andput
containers.The text was updated successfully, but these errors were encountered: