Distribute container/volume garbage-collection across workers #1959

vito · 2018-01-15T20:36:31Z

Feature Request

What challenge are you facing?

Currently the ATC is responsible for destroying containers/volumes across workers. This is very network-intensive and error prone and difficult to parallelize while having reasonable resource consumption (it's easy to just have a swarm of connections lead to a 'too many open files' error). This happened to our large-scale Concourse instance, Wings, resulting in the whole server being dead and volumes leaking forever.

A Modest Proposal

Here's one idea:

Don't have the ATC talk to workers to destroy containers/volumes - have its GC only mark them as 'destroying'.
Add an API endpoint, /api/v1/workers/<name>/sync (or something). Details described after.
Add a TSA command, sync (or whatever we call this), which does a POST to the above endpoint with the worker's list of container/volume handles.
Any container/volume handles not included in the submitted list will be removed from the ATC's database.
The API endpoint then returns the list of container/volume handles in DESTROYING state. This is passed on to the caller of the sync command.
Add a process on the worker that periodically invokes this sync command (using the worker's private key as authorization), and destroys the returned containers/volumes.

This has quite a few benefits:

Much fewer moving parts in the GC - it's all just database work now, making it much less prone to locking up.
Easier to reason about worker parallelism, now that it's each worker doing its own slice. We'd just need a max-in-flight, rather than a fancy per-worker job queue.
More effectively distributes work across the cluster; the ATC is no longer a bottleneck for removing all containers/volumes.
This also fixes the unrecoverable cases where volumes/containers are removed out-of-band from the workers, leading to unknown handle errors - the initial POST will clear them out. Ref. Lots of unknown handle errors #1255, unknown handle ... after cleaning of the workers volumes #1305, Unknown handle makes pipeline unusable #1322, failed to find created volume in baggageclaim #1550, "unknown handle" errors after upgrading Concourse 2.6.0 to 3.3.4 #1721, unknown handle on repository. #1821. As a result this should help out with Investigation: non-BOSH worker operation lifecycle #1457.

The text was updated successfully, but these errors were encountered:

marco-m · 2018-01-17T11:25:27Z

Regarding the step "Add a process on the worker that periodically invokes this sync command", I would suggest to add on each worker a randomization factor when calculating the time for the next "sync", to avoid a synchronized "sync" storm from the workers to the ATC.

william-tran · 2018-01-24T20:29:44Z

On k8s, when a worker dies (for whatever reason), I would like it to re-register with the same name and be in a state to take on work. To do this, I'd like it to synchronize with ATC on startup, and if the simplest and most foolproof state to agree on is the empty state, as if this worker never existed before, that works for me. I don't want to have to clear the concourse-work-dir before I start the concourse process to ensure clean startup, or need to call another command (eg retire-worker, or fly prune-worker) to get ATC to agree on the empty state.

vito · 2018-01-26T23:24:02Z

There's one thing to be very careful about here. Say this order of operations happens:

Worker collects its set of containers/volumes to sync with the ATC.
The ATC instructs the worker to create a container or volume.
The container/volume finishes being created.
The ATC receives the sync request. The original set of containers/volumes won't include the newly-created one, and it will be immediately reaped from the database. Uh oh!

Not sure what to do about this yet. One way to do it would be to linearize it in some way that the ATC can tell that the sync request is old or cannot possibly contain the newly created container/volume (as opposed to it having disappeared), and to not remove it from the database. This could be done e.g. with a timestamp included in the sync request, and the ATC would only remove containers that were created prior to that timestamp. (To make this more foolproof we wouldn't use a timestamp; a monotonically increasing number would do, but then we need to define a source of truth for that so the worker knows what to send...)

marco-m · 2018-01-27T08:45:13Z

Nice distributed system race condition :-)

One possibility would be to make the ATC the source of truth for the sequence number (the "monotonically increasing number") as follows:

If the worker already has a sequence number from the ATC, it uses it, otherwise it uses 0.
Worker collects its set of containers/volumes to sync with the ATC, with the current sequence number.
The ATC instructs the worker to create a container or volume. Among the information, there is a new sequence number.
The container/volume finishes being created.
The ATC receives the sync request. The original set of containers/volumes won't include the newly-created one, but the request contains the old sequence number, so enough information for the ATC to decide what to remove.

topherbullock · 2018-02-05T20:19:16Z

Breaking this down into some bite-sized issues:

Move worker common registration in concourse/bin and the BOSH release's groundcrew job into concourse/worker
Add batch volume and container deletion capabilities to concourse/worker Add batch volume and container deletion capabilities to concourse/worker #2109
Add ATC "marked garbage" API for worker to query ( via TSA ) which volumes and containers they should remove
Add ATC "reconciliation" API to allow workers to report what resources they still have after the initial tick of GC has finished; ATC should reconcile its database with the real state of the worker.
Modify the GC of containers and volumes on the ATC to only mark containers for deletion, and have the workers "sweep" the list of things to delete from their own baggageclaim and garden
Create a component of the worker which runs on some interval (configurable with a sane default) to hit the "marked garbage" API and start the sweep phase to delete volumes and containers.
Report back to the ATC "reconciliation" API from the worker after the "sweep" phase finishes with the current state of the wold on the worker.

xtremerui · 2018-04-24T15:16:39Z

Breaking out the tasks further

Add ATC "marked garbage" API to ATC for workers to query which volumes they should remove
Add ATC "marked garbage" API to ATC for workers to query which containers they should remove
Add TSA command for fetching fetching destroying containers from ATC
Add ATC "reconciliation" API to allow workers to report what containers they still have after the initial tick of GC has finished; ATC should reconcile its database with the real state of the worker.
Cleanup reaper client on ATC heartbeat msg
Add TSA command for fetching fetching destroying volumes from ATC
Add ATC "reconciliation" API to allow workers to report what volumes they still have after the initial tick of GC has finished; ATC should reconcile its database with the real state of the worker.

#1959 Submodule src/github.com/concourse/atc 310a311..97c9af1: > Add APIs to list destroying containers > Update marked destroy containers api without team context > Add API to report volumes to destroyed for a given worker > Add mark API > Merge pull request #271 from timrchavez/timrchavez/issue_1717 > Merge pull request #265 from SHyx0rmZ/set-content-type Submodule src/github.com/concourse/bin 6cd414a80..baac81ca9: > Add new runner for sweeping containers > Merge pull request #42 from osis/releasethequickstart Submodule src/github.com/concourse/topgun 7706c3b..e9251aa: > Enable debug to see more logs for Topgun Submodule src/github.com/concourse/tsa 62eb12d..5576ee1: > Add command to sync work status with ATC > Add tsa cmd to sweep containers Submodule src/github.com/concourse/worker 907ef55..85f0974: > Add sweeper runner to worker start command Signed-off-by: Rui Yang <ryang@pivotal.io>

#1959 Submodule src/github.com/concourse/atc f498067..4561536: > cleanup of reaper URL and reaper client references Submodule src/github.com/concourse/bin 11139b93e..505ce5502: > Remove reaper URL for heartbeat msg Submodule src/github.com/concourse/tsa 51a5503..88b5b1f: > Remove reaper addr from forward connection of workers Submodule src/github.com/concourse/worker 13192ce..05f8abe: > Cleanup http client defaults > Remove reaper addr from heartbeat msg Signed-off-by: Shash Reddy <sreddy@pivotal.io>

#1959 Submodule src/github.com/concourse/atc 4561536..f64cb7f: > Fix down migration Signed-off-by: Shash Reddy <sreddy@pivotal.io>

#1959 Submodule src/github.com/concourse/atc f64cb7f..c92ce5b: > Update bindata Signed-off-by: Shash Reddy <sreddy@pivotal.io>

#1959 Submodule src/github.com/concourse/atc c92ce5b..448ebf3: > Cleanup volume GC Submodule src/github.com/concourse/tsa 88b5b1f..0d97c09: > Add sweep and report functionality for volumes Submodule src/github.com/concourse/worker 05f8abe..f388956: > Run volume GC after container GC Signed-off-by: Shash Reddy <sreddy@pivotal.io>

topherbullock · 2018-06-05T15:55:46Z

Looks good!
One caveat here is that if a container or volume disappears from a worker, and it hasn't been marked as destroying in the DB. The ATC will never delete that container or volume from the worker, as it hasn't been marked as destroying. So, as it exists now, this works well at ensuring the state machine marches on, but doesn't handle the case of the disappearing container.

See :
https://github.com/concourse/atc/blob/master/api/containerserver/report.go#L11:18
https://github.com/concourse/atc/blob/master/db/container_repository.go#L62

These calls are where the worker reports the list of containers it knows about, and the ATC destroys any Destroying containers which aren't in that list. Ideally this should also perform some diff action to determine the set of things which should stay, and what should go. This is where that race condition comes in, but we can make that a separate issue ( and maybe use the worker's last heartbeat time in the DB for sequencing? )

vito added efficiency resiliency enhancement labels Jan 15, 2018

vito added this to Icebox in Runtime via automation Jan 15, 2018

vito added the incident label Jan 15, 2018

topherbullock mentioned this issue Jan 15, 2018

Volume GC Gets Wedged #1960

Closed

vito moved this from Icebox to Backlog in Runtime Jan 15, 2018

vito mentioned this issue Jan 24, 2018

Task fails with "config file not found" after restarting Docker service #1796

Closed

topherbullock moved this from Backlog to In Flight in Runtime Feb 7, 2018

topherbullock moved this from In Flight to Backlog in Runtime Feb 8, 2018

topherbullock mentioned this issue Feb 14, 2018

Discussion: Alternate containerization strategies for workers #2037

Closed

topherbullock moved this from Backlog to In Flight in Runtime Feb 15, 2018

william-tran mentioned this issue Feb 21, 2018

[stable/concourse] Support pruning of stalled workers. helm/charts#3365

Closed

vito mentioned this issue Mar 1, 2018

Aggregate issue: 'too many open files' #1201

Closed

vito moved this from In Flight to Backlog in Runtime Mar 5, 2018

vito assigned topherbullock and shashwathi Mar 8, 2018

shashwathi mentioned this issue Mar 22, 2018

Add batch volume and container deletion capabilities to concourse/worker #2109

Closed

4 tasks

jama22 moved this from Backlog to In Flight in Runtime Mar 26, 2018

xtremerui self-assigned this Apr 23, 2018

phillbaker mentioned this issue May 2, 2018

Workers should be smart enough not to fill up their disk #1751

Closed

jama22 mentioned this issue May 7, 2018

Investigate worker lifecycle now that we have distributed GC #2202

Closed

1 task

shashwathi pushed a commit that referenced this issue May 8, 2018

bump atc

3b96d6c

#1959 Submodule src/github.com/concourse/atc 4561536..f64cb7f: > Fix down migration Signed-off-by: Shash Reddy <sreddy@pivotal.io>

shashwathi pushed a commit that referenced this issue May 8, 2018

bump atc

1127823

#1959 Submodule src/github.com/concourse/atc f64cb7f..c92ce5b: > Update bindata Signed-off-by: Shash Reddy <sreddy@pivotal.io>

xtremerui moved this from In Flight to Done in Runtime May 16, 2018

xtremerui mentioned this issue May 24, 2018

Make GC of containers/volumes more efficient with Distributed workers #2238

Closed

topherbullock added the accepted label Jun 5, 2018

topherbullock closed this as completed Jun 5, 2018

topherbullock added the accepted label Jun 5, 2018

vito added this to the v3.14.0 milestone Jul 25, 2018

topherbullock mentioned this issue Jul 26, 2018

Containers leaky as a vegetable #1669

Closed

topherbullock moved this from Done to Accepted in Runtime Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute container/volume garbage-collection across workers #1959

Distribute container/volume garbage-collection across workers #1959

vito commented Jan 15, 2018 •

edited

marco-m commented Jan 17, 2018

william-tran commented Jan 24, 2018

vito commented Jan 26, 2018 •

edited

marco-m commented Jan 27, 2018

topherbullock commented Feb 5, 2018 •

edited by xtremerui

xtremerui commented Apr 24, 2018 •

edited

topherbullock commented Jun 5, 2018

Distribute container/volume garbage-collection across workers #1959

Distribute container/volume garbage-collection across workers #1959

Comments

vito commented Jan 15, 2018 • edited

Feature Request

What challenge are you facing?

A Modest Proposal

marco-m commented Jan 17, 2018

william-tran commented Jan 24, 2018

vito commented Jan 26, 2018 • edited

marco-m commented Jan 27, 2018

topherbullock commented Feb 5, 2018 • edited by xtremerui

xtremerui commented Apr 24, 2018 • edited

topherbullock commented Jun 5, 2018

vito commented Jan 15, 2018 •

edited

vito commented Jan 26, 2018 •

edited

topherbullock commented Feb 5, 2018 •

edited by xtremerui

xtremerui commented Apr 24, 2018 •

edited