New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check containers accumulate on a single worker with the 'fewest-build-containers' #3251
Comments
Thanks for the detailed report, we've noticed this at times as well. |
Hello @kcmannem, although I am not familiar with that part of the code, I don't think that the fact that the ATC doesn't know in real time the number of containers of a worker is related to this problem. My reasoning is the following: it takes tens of minutes to go from 200 to 600 containers, so I would assume that a rough approximation would be enough for the ATC not to dispatch containers to a given worker. Also, if the scheduling (only for check containers) were random (as we propose in the ticket), then knowing or not the number of the containers of each worker would not be needed. Anything would be better that what we are currently seeing :-) |
I was able to repo the
checking in on the
also verified through the api
trying to create one now returns:
if the check built in guardian was working as intended, we should have been seeing: |
As an additional data point, even if we retire the worker with runaway containers, that worker still gets new containers from the ATC for a while, before we start seeing the container count decreasing. |
We added a small fix to randomly place check containers on workers when using the |
I managed to reproduce the container accumulation. My hypothesis is that a momentary loss of connection between the ATC and the worker triggers this failure mode. I made a branch where I added some code and instructions: master...Pix4D:reproduce-check-container-accumulation In summary:
|
@wagdav That appears to be a different known issue where because of the worker restart, new check containers are created for the same resource config check session. This continues until the resource config check session expires. This issue has been around for a while, but was probably exacerbated due to all the check containers being placed on one worker. We don't have plans to fix this because we plan on doing away with long-lived check containers in the near future. The current issue is to randomize the placement of check containers, which should reduce the impact of the issue you found. |
@wagdav that test was actually caused by another bug that is unrelated to concourse/atc/db/container_owner.go Line 154 in c44d1f8
Since your worker was just spun up in your test, it's uptime is very recent and therefore the check container that gets placed onto it expiries really quickly. concourse/atc/db/container_owner.go Line 198 in c44d1f8
The short expiration date in addition with the grace time probably resulted in the existing check containers not being found. |
it's causing a ruckus because of #3251 Signed-off-by: Alex Suraci <suraci.alex@gmail.com>
This should be fixed by #3288 |
Bug Report
We updated to Concourse 5.0 RC54 we see an accumulation of check containers. The containers are placed to single workers until they reach 256 and pipelines start to fail with the error message:
Following a proposed workaround in #847 we increased the number of available subnets by providing the
--network-pool=172.16.0.0/23
argument to Garden. This made the number of containers reach as high as ~350 on a single worker.After a certain amount of time new error started to appear:
The number of containers on a single worker went up to ~600. This error message was also reported in #3127 .
Steps to Reproduce
Enable the "fewest-build-container" strategy and set pipelines with a lot of resource checks. The number of build tasks or generated load by the task do not seem to directly influence the issue.
The number of pipelines required to trigger this issue depends on the available Linux workers. We have ~100 pipelines and about ~20 Linux workers.
Expected Results
Check containers distributed such that they don't kill workers.
Actual Results
Check containers gravitate toward a specific worker until all its resources are exhausted.
Possible explanation
The new "fewest-build-containers" strategy only applies to task containers, as expected. It seems that check containers fall back to Concourse's default behavior "volume locality", so they accumulate without limits on one or two workers.
A possible solution would be to use the random placement strategy for check containers. We've seen this working well 4.x using the random placement strategy. The check containers were evenly distributed among all the workers.
Additional Context
This graph shows the number of containers for all of our workers as a function of time. We can see three distinct time periods:
--network-pool=172.16.0.0/23
. This way even more containers are placed on the worker and eventually it runs out of memory.We can also see that the number of containers increase and decrease almost instantaneously.
Impact
Currently we are out of ideas on how to control this behavior and keep the system operational. We would be happy to provide more information that could help fixing this issue.
Version Info
The text was updated successfully, but these errors were encountered: