Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check containers accumulate on a single worker with the 'fewest-build-containers' #3251

Closed
wagdav opened this issue Feb 8, 2019 · 11 comments
Assignees
Labels
bug release/no-impact This is an issue that never affected released versions (i.e. a regression caught prior to shipping).
Projects
Milestone

Comments

@wagdav
Copy link
Contributor

wagdav commented Feb 8, 2019

Bug Report

We updated to Concourse 5.0 RC54 we see an accumulation of check containers. The containers are placed to single workers until they reach 256 and pipelines start to fail with the error message:

insufficient subnets remaining in the pool

Following a proposed workaround in #847 we increased the number of available subnets by providing the --network-pool=172.16.0.0/23 argument to Garden. This made the number of containers reach as high as ~350 on a single worker.

After a certain amount of time new error started to appear:

* iptables: create-instance-chains: iptables: Memory allocation problem.

The number of containers on a single worker went up to ~600. This error message was also reported in #3127 .

Steps to Reproduce

Enable the "fewest-build-container" strategy and set pipelines with a lot of resource checks. The number of build tasks or generated load by the task do not seem to directly influence the issue.

The number of pipelines required to trigger this issue depends on the available Linux workers. We have ~100 pipelines and about ~20 Linux workers.

Expected Results

Check containers distributed such that they don't kill workers.

Actual Results

Check containers gravitate toward a specific worker until all its resources are exhausted.

Possible explanation

The new "fewest-build-containers" strategy only applies to task containers, as expected. It seems that check containers fall back to Concourse's default behavior "volume locality", so they accumulate without limits on one or two workers.

A possible solution would be to use the random placement strategy for check containers. We've seen this working well 4.x using the random placement strategy. The check containers were evenly distributed among all the workers.

Additional Context

containers

This graph shows the number of containers for all of our workers as a function of time. We can see three distinct time periods:

  • until 8:30 we have Concourse 4.2.1 with the random container placement strategy
  • between 8:30 and 12:00 we have Concourse 5.0 RC54 with fewest-build-containers strategy. Garden limits the number of containers at 256 and "insufficient subnets remaining in the pool" error messages are generated
  • from 12:00 we have still Concourse 5.0 but we allow more subnets by providing --network-pool=172.16.0.0/23. This way even more containers are placed on the worker and eventually it runs out of memory.

We can also see that the number of containers increase and decrease almost instantaneously.

Impact

Currently we are out of ideas on how to control this behavior and keep the system operational. We would be happy to provide more information that could help fixing this issue.

Version Info

  • Concourse version: 5.0 RC 54
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: AWS
  • Browser (if applicable):
  • Did this used to work? possibly not
@kcmannem
Copy link
Member

kcmannem commented Feb 8, 2019

Thanks for the detailed report, we've noticed this at times as well.
The number of active containers are reported by each worker on a heartbeat. The information doesn't arrive just-in-time when we make a decision on where to place containers (build or check). However, the atc does know the real world view of how many containers should be in the system in the database. We should consider better balancing based on db container (in any state) counts. We've had some discussion about better usage of check containers in #3079.

@marco-m
Copy link
Contributor

marco-m commented Feb 8, 2019

Hello @kcmannem, although I am not familiar with that part of the code, I don't think that the fact that the ATC doesn't know in real time the number of containers of a worker is related to this problem. My reasoning is the following: it takes tens of minutes to go from 200 to 600 containers, so I would assume that a rough approximation would be enough for the ATC not to dispatch containers to a given worker. Also, if the scheduling (only for check containers) were random (as we propose in the ticket), then knowing or not the number of the containers of each worker would not be needed. Anything would be better that what we are currently seeing :-)

@kcmannem
Copy link
Member

kcmannem commented Feb 8, 2019

I was able to repo the insufficient subnets remaining in the pool message from garden. Guardian is supposed to block us from continuing the request and surfacing this error as shown:
https://github.com/cloudfoundry/guardian/blob/master/gardener/gardener.go#L191-L193
But that check is done by reading the dirs listed in the garden depotdir. When sending create requests in parallel, theres a race where some containers might not have been written to depot but were already created by runc.
Here's the script to repo:

import (
	"fmt"
	"os"

	"code.cloudfoundry.org/garden"
	"code.cloudfoundry.org/garden/client"
	"code.cloudfoundry.org/garden/client/connection"
)

func main() {
	gardenClient := client.New(connection.New("tcp", "127.0.0.1:7777"))
	for {
		go func() {
			container, err := gardenClient.Create(garden.ContainerSpec{RootFSPath: "raw:///worker-state/volumes/live/2b1ade31-5e37-4c4a-5720-efd371a0f964/volume"})
			if err != nil {
				fmt.Println(err)
				os.Exit(1)
			}
			fmt.Println(container)
		}()
	}
}

checking in on the depot dir we can see there's more than 250 limit.

root@797ab6de2967:/src# ls -l /worker-state/depot/ | wc -l
257

also verified through the api

~/w/concourse (issue/3052) $ ~/Downloads/gaol_darwin list | wc -l
     256

trying to create one now returns:

~/w/concourse (issue/3052) $ ~/Downloads/gaol_darwin create -r "raw:///worker-state/volumes/live/2b1ade31-5e37-4c4a-5720-efd371a0f964/volume"
error: insufficient subnets remaining in the pool

if the check built in guardian was working as intended, we should have been seeing:
https://github.com/cloudfoundry/guardian/blob/master/gardener/gardener.go#L460

@kcmannem
Copy link
Member

kcmannem commented Feb 8, 2019

@marco-m your point on having the random strategy for check container placement is totally valid and its what @ddadlani and I think is best for a short term fix. But we still need to protect ourselves from over creating containers when garden isn't doing it for us.

@marco-m
Copy link
Contributor

marco-m commented Feb 12, 2019

As an additional data point, even if we retire the worker with runaway containers, that worker still gets new containers from the ATC for a while, before we start seeing the container count decreasing.

@kcmannem kcmannem moved this from Backlog to In Flight in Runtime Feb 12, 2019
wagdav pushed a commit to Pix4D/concourse that referenced this issue Feb 13, 2019
@ddadlani
Copy link
Contributor

We added a small fix to randomly place check containers on workers when using the fewest-build-containers placement strategy. Our next steps will be to ensure it is integration tested and then it should be good to go.

@wagdav
Copy link
Contributor Author

wagdav commented Feb 13, 2019

I managed to reproduce the container accumulation. My hypothesis is that a momentary loss of connection between the ATC and the worker triggers this failure mode.

I made a branch where I added some code and instructions: master...Pix4D:reproduce-check-container-accumulation

In summary:

  • Create a small pipeline with a mock resource
  • Run fly check-resource in an infinite loop
  • Stop the worker, wait a bit, then start it again
  • In a short while fly check-resource returns successfully, but the number of containers increase after each check-resource requrest

@kcmannem
Copy link
Member

@wagdav That appears to be a different known issue where because of the worker restart, new check containers are created for the same resource config check session. This continues until the resource config check session expires. This issue has been around for a while, but was probably exacerbated due to all the check containers being placed on one worker. We don't have plans to fix this because we plan on doing away with long-lived check containers in the near future.

The current issue is to randomize the placement of check containers, which should reduce the impact of the issue you found.

@clarafu
Copy link
Contributor

clarafu commented Feb 13, 2019

@wagdav that test was actually caused by another bug that is unrelated to fewest-build-containers. The number of containers increase after each check-resource in your test because it is unable to find any of the check containers it previously created due to a grace time before the container's expiration time that we use for finding an existing check container.

sq.Expr(fmt.Sprintf("expires_at > NOW() + interval '%d seconds'", int(c.expiries.GraceTime.Seconds()))),

Since your worker was just spun up in your test, it's uptime is very recent and therefore the check container that gets placed onto it expiries really quickly.

"NOW() + LEAST(GREATEST('%d seconds'::interval, NOW() - to_timestamp(max(start_time))), '%d seconds'::interval)",

The short expiration date in addition with the grace time probably resulted in the existing check containers not being found.

vito added a commit that referenced this issue Feb 13, 2019
it's causing a ruckus because of #3251

Signed-off-by: Alex Suraci <suraci.alex@gmail.com>
@vito vito added the release/no-impact This is an issue that never affected released versions (i.e. a regression caught prior to shipping). label Feb 13, 2019
@vito
Copy link
Member

vito commented Feb 13, 2019

This should be fixed by #3288

@vito vito closed this as completed Feb 13, 2019
Runtime automation moved this from In Flight to Done Feb 13, 2019
@wagdav
Copy link
Contributor Author

wagdav commented Feb 14, 2019

Thank you all for the feedback! We will try out the fix in #3288 and we are looking forward to the short-lived check containers in #3079 !

@topherbullock topherbullock moved this from Done to Accepted in Runtime Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug release/no-impact This is an issue that never affected released versions (i.e. a regression caught prior to shipping).
Projects
No open projects
Runtime
Accepted
Development

No branches or pull requests

6 participants