New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers leaky as a vegetable #1669
Comments
Those examples seem like containers which are still in the process of being created being GC'd or at least trying to be GC'd and failing. 3.5.0 fixed a leak related to multiple teams using the same resource config (#1579), but our metrics show that this particular issue is fixed. I suspect this may be related to the original issue you filed about errors coming from the networking component of Garden (#1640), and we're currently looking into some possible issues around cleaning up container rows in the database which don't map to a successfully initialized container (#1576) so the leak may be related to this. The root cause of the 'failed' (there isn't a failed state, so they appear as 'creating' until they are garbage collected) containers however does seem to be how Garden runs on Kubernetes and your specific kernel version. |
Leeks indeed! :D Thanks for your reply. You explanation makes some sense to me. All the clusters I maintain, both Docker Swarm and Kubernetes, are running as a single What is to be done? :) Can I be of assistance in providing some extra information? I've already previously debugged it down to runc level (documented in my original ticket). Is there something else I can do? |
We've just hit this after upgrading from 3.6.0 to 3.8.0. We have 6 workers, all pipelines are stuck: $ bosh -d concourse deployment
Using environment 'https://10.0.0.6:25555' as client 'admin'
Name Release(s) Stemcell(s) Team(s) Cloud Config
concourse datadog-agent/5.8.5.5 bosh-google-kvm-ubuntu-trusty-go_agent/3363.15 - latest
garden-runc/1.9.0
ulimit/0.1.0
concourse/3.8.0
postgres/23
$ bosh -d concourse is
Instance Process State AZ IPs
db/d02f1a90-7a49-4cbf-8816-55d039b80d76 running z2 10.0.32.6
web/1d13b97c-f964-4802-beac-ea6483c84385 running z2 10.0.32.5
web/236cbaf7-3e80-4a1d-93c6-377411faff52 running z2 10.0.32.4
worker/0a08291c-1f6b-4d14-974e-7ba446c8f24d running z2 10.0.32.8
worker/22385291-93e9-4630-8a6a-8265831486f2 running z2 10.0.32.9
worker/3d3982b3-0adc-4d30-b16a-6c762975bd55 running z2 10.0.32.11
worker/8a328a4a-7d24-412c-83f4-a026c35b2029 running z2 10.0.32.10
worker/aae20483-ed0e-41fb-a8db-d9b108d8504b running z2 10.0.32.7
worker/ea8e2613-8584-416f-99d8-fe25761fc37e running z2 10.0.32.12
$ bosh -d concourse ssh worker -c "ps -efH | grep -c /proc/self/init"
worker/ea8e2613-8584-416f-99d8-fe25761fc37e: stdout | 2
worker/3d3982b3-0adc-4d30-b16a-6c762975bd55: stdout | 2
worker/22385291-93e9-4630-8a6a-8265831486f2: stdout | 2
worker/0a08291c-1f6b-4d14-974e-7ba446c8f24d: stdout | 2
worker/aae20483-ed0e-41fb-a8db-d9b108d8504b: stdout | 5
worker/8a328a4a-7d24-412c-83f4-a026c35b2029: stdout | 12
$ fly -t rmq ws
name containers platform tags team state version
0a08291c-c029-432d-83a4-5b28b615a984 250 linux none none running 1.2
22385291-f793-405f-84db-46eb51782947 250 linux none none running 1.2
3d3982b3-2f05-4b6c-802d-426c98b0bf9a 250 linux none none running 1.2
8a328a4a-721e-4014-978d-3dfa83a18254 250 linux none none running 1.2
aae20483-b731-4688-850b-677502f5575b 250 linux none none running 1.2
ea8e2613-3f1c-4d5a-9732-d46077652a9a 250 linux none none running 1.2
$ fly -t rmq bs
id pipeline/job build status start end duration
7421 server-release:v3.7.x/test-rabbitmq-server-scripts 28 pending n/a n/a n/a
7420 server-release:v3.8.x/test-rabbitmq-server-scripts 22 pending n/a n/a n/a
7419 jms-client/test-rabbitmq-jms-client-pr-master 16 errored 2017-12-12@20:19:26+0000 2017-12-12@20:19:38+0000 12s
7418 jms-client/rabbitmq-jms-client-1-x-x-stable-against-master 35 errored 2017-12-12@20:19:26+0000 2017-12-12@20:19:27+0000 1s
7417 jms-client/rabbitmq-jms-client-master-against-master 42 errored 2017-12-12@20:19:26+0000 2017-12-12@20:19:31+0000 5s
7416 jms-client/rabbitmq-jms-cts 49 errored 2017-12-12@20:19:25+0000 2017-12-12@20:19:27+0000 2s
7415 java-client/test-rabbitmq-java-client-pr-4-x-x 33 errored 2017-12-12@20:18:25+0000 2017-12-12@20:18:33+0000 8s
7414 java-client/rabbitmq-java-client-stable 9 errored 2017-12-12@20:18:25+0000 2017-12-12@20:18:26+0000 1s
7413 java-client/rabbitmq-java-client-4-3-x 23 errored 2017-12-12@20:18:25+0000 2017-12-12@20:18:29+0000 4s
7412 jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable 13 errored 2017-12-12@20:18:25+0000 2017-12-12@20:18:28+0000 3s
7411 java-client/rabbitmq-java-client-4-x-x 32 errored 2017-12-12@20:18:25+0000 2017-12-12@20:18:31+0000 6s
7410 jms-client/rabbitmq-jms-client-master-against-stable 20 errored 2017-12-12@20:18:24+0000 2017-12-12@20:18:29+0000 5s
7409 java-client/rabbitmq-java-client-master 31 errored 2017-12-12@20:18:25+0000 2017-12-12@20:18:26+0000 1s
7408 java-client/test-rabbitmq-java-client-pr-4-x-x 32 errored 2017-12-12@20:04:18+0000 2017-12-12@20:04:24+0000 6s
7407 java-client/rabbitmq-java-client-4-3-x 22 errored 2017-12-12@20:04:17+0000 2017-12-12@20:04:18+0000 1s
7406 java-client/rabbitmq-java-client-4-x-x 31 errored 2017-12-12@20:04:17+0000 2017-12-12@20:04:19+0000 2s
7405 java-client/rabbitmq-java-client-master 30 errored 2017-12-12@20:04:17+0000 2017-12-12@20:04:19+0000 2s
7404 jms-client/test-rabbitmq-jms-client-pr-master 15 errored 2017-12-12@20:03:59+0000 2017-12-12@20:04:01+0000 2s
7403 jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable 12 errored 2017-12-12@20:03:59+0000 2017-12-12@20:04:03+0000 4s
7402 jms-client/rabbitmq-jms-client-master-against-stable 19 errored 2017-12-12@20:03:59+0000 2017-12-12@20:04:11+0000 12s
7401 jms-client/rabbitmq-jms-client-1-x-x-stable-against-master 34 errored 2017-12-12@20:03:59+0000 2017-12-12@20:04:00+0000 1s
7400 jms-client/rabbitmq-jms-client-master-against-master 41 errored 2017-12-12@20:03:59+0000 2017-12-12@20:04:05+0000 6s
7399 jms-client/rabbitmq-jms-cts 48 errored 2017-12-12@20:03:59+0000 2017-12-12@20:04:00+0000 1s
7398 java-client/rabbitmq-java-client-stable 8 errored 2017-12-12@20:03:47+0000 2017-12-12@20:03:48+0000 1s
7397 java-client/test-rabbitmq-java-client-pr-4-x-x 31 errored 2017-12-12@20:01:15+0000 2017-12-12@20:01:16+0000 1s
7396 java-client/rabbitmq-java-client-4-3-x 21 errored 2017-12-12@20:01:15+0000 2017-12-12@20:01:17+0000 2s
7395 java-client/rabbitmq-java-client-4-x-x 30 errored 2017-12-12@20:01:15+0000 2017-12-12@20:01:16+0000 1s
7394 java-client/rabbitmq-java-client-master 29 errored 2017-12-12@20:01:15+0000 2017-12-12@20:01:16+0000 1s
7393 jms-client/test-rabbitmq-jms-client-pr-master 14 errored 2017-12-12@20:00:58+0000 2017-12-12@20:01:08+0000 10s
7392 jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable 11 errored 2017-12-12@20:00:57+0000 2017-12-12@20:00:59+0000 2s
7391 jms-client/rabbitmq-jms-client-master-against-stable 18 errored 2017-12-12@20:00:57+0000 2017-12-12@20:00:59+0000 2s
7390 jms-client/rabbitmq-jms-client-1-x-x-stable-against-master 33 errored 2017-12-12@20:00:57+0000 2017-12-12@20:00:59+0000 2s
7389 jms-client/rabbitmq-jms-client-master-against-master 40 errored 2017-12-12@20:00:57+0000 2017-12-12@20:00:59+0000 2s
7388 jms-client/rabbitmq-jms-cts 47 errored 2017-12-12@20:00:57+0000 2017-12-12@20:00:58+0000 1s
7387 java-client/rabbitmq-java-client-stable 7 errored 2017-12-12@20:00:44+0000 2017-12-12@20:00:45+0000 1s
7386 jms-client/test-rabbitmq-jms-client-pr-master 13 errored 2017-12-12@19:36:53+0000 2017-12-12@19:37:04+0000 11s
7385 jms-client/rabbitmq-jms-client-1-x-x-stable-against-master 32 errored 2017-12-12@19:36:53+0000 2017-12-12@19:36:54+0000 1s
7384 jms-client/rabbitmq-jms-client-master-against-master 39 errored 2017-12-12@19:36:53+0000 2017-12-12@19:36:59+0000 6s
7383 jms-client/rabbitmq-jms-cts 46 errored 2017-12-12@19:36:52+0000 2017-12-12@19:36:54+0000 2s
7382 java-client/test-rabbitmq-java-client-pr-4-x-x 30 errored 2017-12-12@19:36:02+0000 2017-12-12@19:36:12+0000 10s
7381 java-client/rabbitmq-java-client-stable 6 errored 2017-12-12@19:36:02+0000 2017-12-12@19:36:09+0000 7s
7380 java-client/rabbitmq-java-client-4-3-x 20 errored 2017-12-12@19:36:02+0000 2017-12-12@19:36:09+0000 7s
7379 java-client/rabbitmq-java-client-4-x-x 29 errored 2017-12-12@19:36:02+0000 2017-12-12@19:36:03+0000 1s
7378 java-client/rabbitmq-java-client-master 28 errored 2017-12-12@19:36:02+0000 2017-12-12@19:36:14+0000 12s
7377 jms-client/rabbitmq-jms-client-1-x-x-stable-against-stable 10 errored 2017-12-12@19:36:02+0000 2017-12-12@19:36:14+0000 12s
7376 jms-client/rabbitmq-jms-client-master-against-stable 17 errored 2017-12-12@19:36:02+0000 2017-12-12@19:36:11+0000 9s
7375 server-release:v3.7.x/test-rabbitmq-server-scripts 27 errored 2017-12-12@19:09:19+0000 2017-12-12@19:11:11+0000 1m52s
7374 server-release:v3.7.x/test-with-bunny:master 18 errored 2017-12-12@19:12:39+0000 2017-12-12@19:13:07+0000 28s
7373 server-release:v3.7.x/test-with-bunny:release 16 errored 2017-12-12@19:09:27+0000 2017-12-12@19:12:17+0000 2m50s
7372 server-release:v3.7.x/test-with-rabbitmq-java-client:release 18 errored 2017-12-12@19:00:27+0000 2017-12-12@19:00:46+0000 19s Re-creating the entire deployment with |
This happens to RabbitMQ Concourse deployment every 2-3 days, whenever there's a spike in build activity. What kind of information should we provide to help Concourse maintainers make progress on resolving this? |
This issue is old enough, and there's been enough changes to the core runtime's GC that I'm going to close it off. #1959 - Workers now report the resource (container and volume) handles they have when heartbeating, and the ATC marks ones in use, and a sweep phase cleans up any garbage on the worker and in the DB |
Containers are not cleaned up, leading to increasingly more containers running over time, leading to eventual subnet exhaustion (not directly experienced yet in this phase of testing, but it'll clearly happen).
Kubernetes
1.7.0
, Ubuntu16.04.3 LTS
, non-default kernel4.10.0-28-generic
.This is basically a refiling of #1424, but against Concourse
3.5.0
. I'm raising a new ticket since related #1297 and #1413 have been closed and released, yet the problem persists.The text was updated successfully, but these errors were encountered: