Skip to content

Runners keep throwing docker Daemon running #3223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
sravula84 opened this issue Jan 12, 2024 · 13 comments
Open
4 tasks done

Runners keep throwing docker Daemon running #3223

sravula84 opened this issue Jan 12, 2024 · 13 comments
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers

Comments

@sravula84
Copy link

Checks

Controller Version

actions-runner-controller-0.22.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1) installed ARC with github token 
2) configured runnerdeployment  with replica -10
3) configured horizontal scaler - min 10 max 50
4) runners are in running state, but after some time new runners getting docker daemon running error and job are waiting in queue to pick the runner

happening for all the new runners then pod going to error state

Describe the bug

runners are in running state, but after some time new runners getting docker daemon running error and job are waiting in queue to pick the runner

happening for all the new runners then pod going to error state

Describe the expected behavior

based on the load horizontal scaler should scale the runners, but it is throwing docker daemon running ? error

Additional Context

❯ helm get values actions-runner-controller
USER-SUPPLIED VALUES:
authSecret:
  create: true
  github_token: "" supplied github token

Controller Logs

a2844b0833", "allowed": true}
2024-01-12T21:31:30Z	INFO	runner	Failed to create pod due to AlreadyExists error. Probably this pod has been already created in previous reconcilation but is still not in the informer cache. Will retry on pod created. If it doesn't repeat, there's no problem	{"runner": "actions-runner-systems/github-action-np-h6z4z-gf9pg"}
2024-01-12T21:31:31Z	DEBUG	runner	Runner appears to have been registered and running.	{"runner": "actions-runner-systems/github-action-np-h6z4z-gf9pg", "podCreationTimestamp": "2024-01-12 21:31:30 +0000 UTC"}
2024-01-12T21:31:36Z	INFO	runnerpod	Failed to delete pod within 1m0s. This is typically the case when a Kubernetes node became unreachable and the kube controller started evicting nodes. Forcefully deleting the pod to not get stuck.	{"runnerpod": "actions-runner-systems/github-action-np-h6z4z-qjc6z", "podDeletionTimestamp": "2024-01-12 21:30:25 +0000 UTC", "currentTime": "2024-01-12T21:31:36Z", "configuredDeletionTimeout": "1m0s"}
2024-01-12T21:31:36Z	INFO	runnerpod	Forcefully deleted runner pod	{"runnerpod": "actions-runner-systems/github-action-np-h6z4z-qjc6z", "repository": ""}
2024-01-12T21:31:36Z	DEBUG	events	Forcefully deleted pod 'github-action-np-h6z4z-qjc6z'	{"type": "Normal", "object": {"kind":"Pod","namespace":"actions-runner-systems","name":"github-action-np-h6z4z-qjc6z","uid":"3be68602-3db3-4803-a2b7-9ae0ec52df94","apiVersion":"v1","resourceVersion":"605720"}, "reason": "PodDeleted"}
2024-01-12T21:31:39Z	DEBUG	horizontalrunnerautoscaler	Suggested desired replicas of 10 by PercentageRunnersBusy	{"replicas_desired_before": 10, "replicas_desired": 10, "num_runners": 10, "num_runners_registered": 9, "num_runners_busy": 6, "num_terminating_busy": 0, "namespace": "actions-runner-systems", "kind": "runnerdeployment", "name": "github-action-np", "horizontal_runner_autoscaler": "example-runner-deployment-autoscaler", "enterprise": "", "organization": "prosperllc", "repository": ""}
2024-01-12T21:31:39Z	DEBUG	horizontalrunnerautoscaler	Calculated desired replicas of 10	{"horizontalrunnerautoscaler": "actions-runner-systems/example-runner-deployment-autoscaler", "suggested": 10, "reserved": 0, "min": 10, "max": 20}
2024-01-12T21:32:19Z	DEBUG	runner	Runner appears to have been registered and running.	{"runner": "actions-runner-systems/github-action-np-h6z4z-9f5cp", "podCreationTimestamp": "2024-01-12 21:25:58 +0000 UTC"}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "1e47c97c-345a-4b39-825f-67129cf2201d", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "1e47c97c-345a-4b39-825f-67129cf2201d", "allowed": true}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "89c6d115-f817-472c-b629-0489fd90e10e", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2024-01-12T21:32:19Z	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "89c6d115-f817-472c-b629-0489fd90e10e", "allowed": true}

Runner Pod Logs

"https://pipelinesghubeus21.actions.githubusercontent.com/tMTkzAKYleoidiHAI9FjPaHPkEkp2s7TIoUW3BW1740YmeFlFo/",
  "gitHubUrl": "https://github.com/prosperllc",
  "workFolder": "/runner/_work"
2024-01-12 21:34:47.510  DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2024-01-12 21:34:47.512  DEBUG --- Waiting until Docker is available or the timeout of 120 seconds is reached
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?
@sravula84 sravula84 added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jan 12, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@sravula84
Copy link
Author

this is the message we are seeing
2024-01-12 22:19:43.778 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2024-01-12 22:19:43.780 DEBUG --- Waiting until Docker is available or the timeout of 120 seconds is reached
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?

@sravula84
Copy link
Author

Hi Team,
can one suggest me on this issue ?

Thanks
Sridhar

@sravula84
Copy link
Author

HI Team,

any suggestions on the above one?

Thanks
Sridhar

@nikola-jokic nikola-jokic added community Community contribution and removed gha-runner-scale-set Related to the gha-runner-scale-set mode labels Jan 22, 2024
@sravula84
Copy link
Author

HI @nikola-jokic can you please advise on the above ticket ?

@emilwangaa
Copy link

We just had a similar issue in one of our clusters today. We tracked the root cause to the latest v25.0.0 release of docker:dind and ended setting the Helm value for image.dindSidecarRepositoryAndTag to docker:24.0.7-dind which solved the issue. Maybe give that a try.

@prosper-sre
Copy link

@emilwangaa so i need to upgrade my actions runner controller chat to 25.0.0 and update docker dind verison ?

@emilwangaa
Copy link

@emilwangaa so i need to upgrade my actions runner controller chat to 25.0.0 and update docker dind verison ?

@prosper-sre I don't think you need to update your arc chart unless it doesn't support setting the dind version.
The default setting for the arc chart is to pull the latest version if dind, which caused issues for us.

@gera-aldama
Copy link

gera-aldama commented Feb 3, 2024

Hello @emilwangaa, could you share how you changed image.dindSidecarRepositoryAndTag: docker:24.0.7-dind? I tried

helm upgrade --install -f custom-values.yaml --namespace actions-runner-system --create-namespace --wait actions-runner-controller actions-runner-controller/actions-runner-controller --set image.dindSidecarRepositoryAndTag=docker:24.0.7-dind --version ${CHART_VERSION}

but I still see the image dind:dind configured on the re-deployed runners

@emilwangaa
Copy link

Hello @emilwangaa, could you share how you changed image.dindSidecarRepositoryAndTag: docker:24.0.7-dind? I tried

helm upgrade --install -f custom-values.yaml --namespace actions-runner-system --create-namespace --wait actions-runner-controller actions-runner-controller/actions-runner-controller --set image.dindSidecarRepositoryAndTag=docker:24.0.7-dind --version ${CHART_VERSION}

but I still see the image dind:dind configured on the re-deployed runners

We use Terraform to install the chart, but your method looks right. Which version of the chart are you using? And have you tried setting it in the custom-values.yaml file that you specify instead?

@sravula84
Copy link
Author

i have done the upgrade, but still seeing the same issues. looks like one of the workflow causing this problem and it is mono repo . some thing wrong with that . i need to figure it out.

i think it is purely resource issues, do we have any recommended specifications w.r.t to cpu memory ?

@gera-aldama
Copy link

Thanks for the response @emilwangaa I can now see docker:24.0.7-dind but the issue still persists. I'm using version 0.23.7

@95jinhong
Copy link

@emilwangaa I also did a version update and am experiencing the same issue, maybe it was fixed after updating to that version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

6 participants