Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers leaky as a vegetable #1669

Closed
tiredpixel opened this issue Sep 27, 2017 · 5 comments
Closed

Containers leaky as a vegetable #1669

tiredpixel opened this issue Sep 27, 2017 · 5 comments

Comments

@tiredpixel
Copy link

tiredpixel commented Sep 27, 2017

Containers are not cleaned up, leading to increasingly more containers running over time, leading to eventual subnet exhaustion (not directly experienced yet in this phase of testing, but it'll clearly happen).

Kubernetes 1.7.0, Ubuntu 16.04.3 LTS, non-default kernel 4.10.0-28-generic.

This is basically a refiling of #1424, but against Concourse 3.5.0. I'm raising a new ticket since related #1297 and #1413 have been closed and released, yet the problem persists.

{"timestamp":"1506512429.404951572","source":"guardian","message":"guardian.destroy.destroy.delete-failed","log_level":2,"data":{"error":"runc create: exit status 1: container init still running\n","handle":"dcea3bf8-93b5-4e42-5c9d-a69db690604e","session":"3613.1"}}
{"timestamp":"1506512429.405071259","source":"guardian","message":"guardian.api.garden-server.destroy.failed","log_level":2,"data":{"error":"runc create: exit status 1: container init still running\n","handle":"dcea3bf8-93b5-4e42-5c9d-a69db690604e","session":"3.1.4513"}}
@topherbullock
Copy link
Member

Uh oh, leeks!

Those examples seem like containers which are still in the process of being created being GC'd or at least trying to be GC'd and failing. 3.5.0 fixed a leak related to multiple teams using the same resource config (#1579), but our metrics show that this particular issue is fixed.

I suspect this may be related to the original issue you filed about errors coming from the networking component of Garden (#1640), and we're currently looking into some possible issues around cleaning up container rows in the database which don't map to a successfully initialized container (#1576) so the leak may be related to this. The root cause of the 'failed' (there isn't a failed state, so they appear as 'creating' until they are garbage collected) containers however does seem to be how Garden runs on Kubernetes and your specific kernel version.

@tiredpixel
Copy link
Author

tiredpixel commented Sep 27, 2017

@topherbullock

Leeks indeed! :D Thanks for your reply. You explanation makes some sense to me. All the clusters I maintain, both Docker Swarm and Kubernetes, are running as a single main team, so I doubt it's related to that. Whilst a failure to clean up rows in the database could well be an issue, I'm not sure it would lead to the issues I was seeing before (which appears to be the same as this scenario), where zombie processes are left on the workers. It's worth noting, however, that I've experienced these leaks on multiple kernel versions—always on Kubernetes clusters, though (my Docker Swarm clusters have other problems, filed separately, but the container garbage collection appears to work fine, on those). I've also experienced this on multiple versions of Kubernetes.

What is to be done? :) Can I be of assistance in providing some extra information? I've already previously debugged it down to runc level (documented in my original ticket). Is there something else I can do?

@topherbullock topherbullock added this to Icebox in Runtime Nov 3, 2017
@gerhard
Copy link

gerhard commented Dec 13, 2017

We've just hit this after upgrading from 3.6.0 to 3.8.0. We have 6 workers, all pipelines are stuck:

$ bosh -d concourse deployment
Using environment 'https://10.0.0.6:25555' as client 'admin'

Name       Release(s)             Stemcell(s)                                     Team(s)  Cloud Config
concourse  datadog-agent/5.8.5.5  bosh-google-kvm-ubuntu-trusty-go_agent/3363.15  -        latest
           garden-runc/1.9.0
           ulimit/0.1.0
           concourse/3.8.0
           postgres/23

$ bosh -d concourse is

Instance                                     Process State  AZ  IPs
db/d02f1a90-7a49-4cbf-8816-55d039b80d76      running        z2  10.0.32.6
web/1d13b97c-f964-4802-beac-ea6483c84385     running        z2  10.0.32.5
web/236cbaf7-3e80-4a1d-93c6-377411faff52     running        z2  10.0.32.4
worker/0a08291c-1f6b-4d14-974e-7ba446c8f24d  running        z2  10.0.32.8
worker/22385291-93e9-4630-8a6a-8265831486f2  running        z2  10.0.32.9
worker/3d3982b3-0adc-4d30-b16a-6c762975bd55  running        z2  10.0.32.11
worker/8a328a4a-7d24-412c-83f4-a026c35b2029  running        z2  10.0.32.10
worker/aae20483-ed0e-41fb-a8db-d9b108d8504b  running        z2  10.0.32.7
worker/ea8e2613-8584-416f-99d8-fe25761fc37e  running        z2  10.0.32.12

$ bosh -d concourse ssh worker -c "ps -efH | grep -c /proc/self/init"

worker/ea8e2613-8584-416f-99d8-fe25761fc37e: stdout | 2
worker/3d3982b3-0adc-4d30-b16a-6c762975bd55: stdout | 2
worker/22385291-93e9-4630-8a6a-8265831486f2: stdout | 2
worker/0a08291c-1f6b-4d14-974e-7ba446c8f24d: stdout | 2
worker/aae20483-ed0e-41fb-a8db-d9b108d8504b: stdout | 5
worker/8a328a4a-7d24-412c-83f4-a026c35b2029: stdout | 12

$ fly -t rmq ws

name                                  containers  platform  tags  team  state    version
0a08291c-c029-432d-83a4-5b28b615a984  250         linux     none  none  running  1.2
22385291-f793-405f-84db-46eb51782947  250         linux     none  none  running  1.2
3d3982b3-2f05-4b6c-802d-426c98b0bf9a  250         linux     none  none  running  1.2
8a328a4a-721e-4014-978d-3dfa83a18254  250         linux     none  none  running  1.2
aae20483-b731-4688-850b-677502f5575b  250         linux     none  none  running  1.2
ea8e2613-3f1c-4d5a-9732-d46077652a9a  250         linux     none  none  running  1.2

$ fly -t rmq bs

id    pipeline/job                                                  build  status   start                     end                       duration
7421  server-release:v3.7.x/test-rabbitmq-server-scripts            28     pending  n/a                       n/a                       n/a
7420  server-release:v3.8.x/test-rabbitmq-server-scripts            22     pending  n/a                       n/a                       n/a
7419  jms-client/test-rabbitmq-jms-client-pr-master                 16     errored  2017-12-12@20:19:26+0000  2017-12-12@20:19:38+0000  12s
7418  jms-client/rabbitmq-jms-client-1-x-x-stable-against-master    35     errored  2017-12-12@20:19:26+0000  2017-12-12@20:19:27+0000  1s
7417  jms-client/rabbitmq-jms-client-master-against-master          42     errored  2017-12-12@20:19:26+0000  2017-12-12@20:19:31+0000  5s
7416  jms-client/rabbitmq-jms-cts                                   49     errored  2017-12-12@20:19:25+0000  2017-12-12@20:19:27+0000  2s
7415  java-client/test-rabbitmq-java-client-pr-4-x-x                33     errored  2017-12-12@20:18:25+0000  2017-12-12@20:18:33+0000  8s
7414  java-client/rabbitmq-java-client-stable                       9      errored  2017-12-12@20:18:25+0000  2017-12-12@20:18:26+0000  1s
7413  java-client/rabbitmq-java-client-4-3-x                        23     errored  2017-12-12@20:18:25+0000  2017-12-12@20:18:29+0000  4s
7412  jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable    13     errored  2017-12-12@20:18:25+0000  2017-12-12@20:18:28+0000  3s
7411  java-client/rabbitmq-java-client-4-x-x                        32     errored  2017-12-12@20:18:25+0000  2017-12-12@20:18:31+0000  6s
7410  jms-client/rabbitmq-jms-client-master-against-stable          20     errored  2017-12-12@20:18:24+0000  2017-12-12@20:18:29+0000  5s
7409  java-client/rabbitmq-java-client-master                       31     errored  2017-12-12@20:18:25+0000  2017-12-12@20:18:26+0000  1s
7408  java-client/test-rabbitmq-java-client-pr-4-x-x                32     errored  2017-12-12@20:04:18+0000  2017-12-12@20:04:24+0000  6s
7407  java-client/rabbitmq-java-client-4-3-x                        22     errored  2017-12-12@20:04:17+0000  2017-12-12@20:04:18+0000  1s
7406  java-client/rabbitmq-java-client-4-x-x                        31     errored  2017-12-12@20:04:17+0000  2017-12-12@20:04:19+0000  2s
7405  java-client/rabbitmq-java-client-master                       30     errored  2017-12-12@20:04:17+0000  2017-12-12@20:04:19+0000  2s
7404  jms-client/test-rabbitmq-jms-client-pr-master                 15     errored  2017-12-12@20:03:59+0000  2017-12-12@20:04:01+0000  2s
7403  jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable    12     errored  2017-12-12@20:03:59+0000  2017-12-12@20:04:03+0000  4s
7402  jms-client/rabbitmq-jms-client-master-against-stable          19     errored  2017-12-12@20:03:59+0000  2017-12-12@20:04:11+0000  12s
7401  jms-client/rabbitmq-jms-client-1-x-x-stable-against-master    34     errored  2017-12-12@20:03:59+0000  2017-12-12@20:04:00+0000  1s
7400  jms-client/rabbitmq-jms-client-master-against-master          41     errored  2017-12-12@20:03:59+0000  2017-12-12@20:04:05+0000  6s
7399  jms-client/rabbitmq-jms-cts                                   48     errored  2017-12-12@20:03:59+0000  2017-12-12@20:04:00+0000  1s
7398  java-client/rabbitmq-java-client-stable                       8      errored  2017-12-12@20:03:47+0000  2017-12-12@20:03:48+0000  1s
7397  java-client/test-rabbitmq-java-client-pr-4-x-x                31     errored  2017-12-12@20:01:15+0000  2017-12-12@20:01:16+0000  1s
7396  java-client/rabbitmq-java-client-4-3-x                        21     errored  2017-12-12@20:01:15+0000  2017-12-12@20:01:17+0000  2s
7395  java-client/rabbitmq-java-client-4-x-x                        30     errored  2017-12-12@20:01:15+0000  2017-12-12@20:01:16+0000  1s
7394  java-client/rabbitmq-java-client-master                       29     errored  2017-12-12@20:01:15+0000  2017-12-12@20:01:16+0000  1s
7393  jms-client/test-rabbitmq-jms-client-pr-master                 14     errored  2017-12-12@20:00:58+0000  2017-12-12@20:01:08+0000  10s
7392  jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable    11     errored  2017-12-12@20:00:57+0000  2017-12-12@20:00:59+0000  2s
7391  jms-client/rabbitmq-jms-client-master-against-stable          18     errored  2017-12-12@20:00:57+0000  2017-12-12@20:00:59+0000  2s
7390  jms-client/rabbitmq-jms-client-1-x-x-stable-against-master    33     errored  2017-12-12@20:00:57+0000  2017-12-12@20:00:59+0000  2s
7389  jms-client/rabbitmq-jms-client-master-against-master          40     errored  2017-12-12@20:00:57+0000  2017-12-12@20:00:59+0000  2s
7388  jms-client/rabbitmq-jms-cts                                   47     errored  2017-12-12@20:00:57+0000  2017-12-12@20:00:58+0000  1s
7387  java-client/rabbitmq-java-client-stable                       7      errored  2017-12-12@20:00:44+0000  2017-12-12@20:00:45+0000  1s
7386  jms-client/test-rabbitmq-jms-client-pr-master                 13     errored  2017-12-12@19:36:53+0000  2017-12-12@19:37:04+0000  11s
7385  jms-client/rabbitmq-jms-client-1-x-x-stable-against-master    32     errored  2017-12-12@19:36:53+0000  2017-12-12@19:36:54+0000  1s
7384  jms-client/rabbitmq-jms-client-master-against-master          39     errored  2017-12-12@19:36:53+0000  2017-12-12@19:36:59+0000  6s
7383  jms-client/rabbitmq-jms-cts                                   46     errored  2017-12-12@19:36:52+0000  2017-12-12@19:36:54+0000  2s
7382  java-client/test-rabbitmq-java-client-pr-4-x-x                30     errored  2017-12-12@19:36:02+0000  2017-12-12@19:36:12+0000  10s
7381  java-client/rabbitmq-java-client-stable                       6      errored  2017-12-12@19:36:02+0000  2017-12-12@19:36:09+0000  7s
7380  java-client/rabbitmq-java-client-4-3-x                        20     errored  2017-12-12@19:36:02+0000  2017-12-12@19:36:09+0000  7s
7379  java-client/rabbitmq-java-client-4-x-x                        29     errored  2017-12-12@19:36:02+0000  2017-12-12@19:36:03+0000  1s
7378  java-client/rabbitmq-java-client-master                       28     errored  2017-12-12@19:36:02+0000  2017-12-12@19:36:14+0000  12s
7377  jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable    10     errored  2017-12-12@19:36:02+0000  2017-12-12@19:36:14+0000  12s
7376  jms-client/rabbitmq-jms-client-master-against-stable          17     errored  2017-12-12@19:36:02+0000  2017-12-12@19:36:11+0000  9s
7375  server-release:v3.7.x/test-rabbitmq-server-scripts            27     errored  2017-12-12@19:09:19+0000  2017-12-12@19:11:11+0000  1m52s
7374  server-release:v3.7.x/test-with-bunny:master                  18     errored  2017-12-12@19:12:39+0000  2017-12-12@19:13:07+0000  28s
7373  server-release:v3.7.x/test-with-bunny:release                 16     errored  2017-12-12@19:09:27+0000  2017-12-12@19:12:17+0000  2m50s
7372  server-release:v3.7.x/test-with-rabbitmq-java-client:release  18     errored  2017-12-12@19:00:27+0000  2017-12-12@19:00:46+0000  19s

Re-creating the entire deployment with bosh --deployment concourse recreate --fix fixed our issue. We've also switched to overlay, we were using btrfs (see #1045)

@michaelklishin
Copy link

This happens to RabbitMQ Concourse deployment every 2-3 days, whenever there's a spike in build activity. What kind of information should we provide to help Concourse maintainers make progress on resolving this?

@topherbullock
Copy link
Member

This issue is old enough, and there's been enough changes to the core runtime's GC that I'm going to close it off.
I suspect that a couple things will help with containers piling up:

#1959 - Workers now report the resource (container and volume) handles they have when heartbeating, and the ATC marks ones in use, and a sweep phase cleans up any garbage on the worker and in the DB
#1637 - The "Failed" state for Containers; containers which are never initialized on the worker host will be marked in the DB as failed, and we'll GC them from the DB on the next pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants