Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed resource check container creates can lead to snowballing of check containers #2454

Closed
vito opened this issue Aug 1, 2018 · 3 comments

Comments

@vito
Copy link
Member

vito commented Aug 1, 2018

This is pretty complicated to describe.

Here's the scenario but you may want to jump around the code and use this as a guide instead of trying to just read it verbatim:

  1. Radar scans for a given resource on an interval. It creates a Resource Config Check Session (RCCS).
  2. Radar calls (ResourceFactory).NewResource with a ResourceConfigCheckSessionContainerOwner.
  3. This then goes through to (worker.Pool).FindOrCreateContainer, which first does a (worker.WorkerProvider).FindWorkerForContainerByOwner. This will first find the container owner (in this case via (resourceConfigCheckSessionContainerOwner).Find which will search across all workers). But as part of finding the worker, it tries to find a container.
  • This is dangerous, because if we somehow have the owner already but don't have any container, we'll go down the "create" code path which will try to create another owner, and the worker_resource_config_check_sessions don't have a UNIQUE constraint.
  1. Say the first time this runs, there's no container. We'll then pick a worker and then call (worker.Worker).FindOrCreateContainer.
  2. This passes through to the worker.ContainerProvider, and it'll then create the container in the database.
  3. This will then create the owner and the container in the same transaction, which is what you want, as we saw in point 3 if you end up with an owner with no container, you'll end up with dupes.
  4. And here's things break: if we fail to create the real container, for whatever reason, it will instead transition to FAILED state, and then be removed by the garbage collector. This will leave us with a WRCCS owner without a corresponding container.

Ok, so what happens now that we have an owner without any container?

Well, we'll find the owner. But then we won't find a container corresponding to it (point 3). So we'll create another owner, and thus another check container. As long as the resource config check session is active, this will result in duplicate WRCCS rows, each corresponding to a duplicate check container.

@wagdav
Copy link
Contributor

wagdav commented Aug 2, 2018

@vito this behavior is very much what we saw related to #2346 . It was described there with less technical insight.

@vito
Copy link
Member Author

vito commented Aug 2, 2018

@wagdav That issue makes no mention of container count going up. You sure it's related? That was just about the only observable metric for us. CPU and everything else was fine.

mhuangpivotal pushed a commit to vmware-archive/atc that referenced this issue Aug 2, 2018
concourse/concourse#2454

Signed-off-by: Mark Huang <mhuang@pivotal.io>
mhuangpivotal pushed a commit that referenced this issue Aug 2, 2018
#2454

Submodule src/github.com/concourse/atc 4b1839f..dad9ab3:
  > Merge pull request #294 from concourse/upsert
  > Add a failing test for duplicated WRCCS

Signed-off-by: Mark Huang <mhuang@pivotal.io>
@wagdav
Copy link
Contributor

wagdav commented Aug 2, 2018

@vito You're right I didn't mention this in the referenced issue. However, I remember when we were diagnosing the problem of #2346 we saw thousands of check containers on the workers while only a handful of 'task' containers. This seemed to be less important, probably because we were too much focused on the CPU consumption of the ATC back then.

@vito vito added the accepted label Aug 21, 2018
@vito vito closed this as completed Aug 21, 2018
@vito vito modified the milestone: v4.1.0 Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants