Ephemeral Pods not cleaning up which disrupts scaling of newer pods #4048

shivansh-ptr · 2025-04-21T07:13:15Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy runner scale sets version v2.323.0 on GKE
2. I have deployed a runner with config 8 vCPU and 16 GB memory.
3. When multiple jobs come up, the pods scale up but sometimes the pods do not clean up resulting in scaling issues

Describe the bug

I have observed that on a larger scale, the pods are not being cleaned up and hindering the scaling of the newer pods. This has resulted in a higher queue time.
Once I delete these pods manually, the new pods immediately scale up.
I am assuming that since these pods are present, the controller assumes that the pods are already present. Please correct me if I am wrong here.

Describe the expected behavior

The pods should be auto cleaned and should not result in a higher queue time of workflows.

Additional Context

gha-runner-scale-set:
  githubConfigUrl: "https://github.com/<orgname>"

  githubConfigSecret: gha-runner-scale-sets-github-config-secret

  maxRunners: 100
  minRunners: 1
  runnerGroup: "k8s-dom-prod-runnergroup"
  runnerScaleSetName: "k8s-xl"

  containerMode:
    type: "dind"

  listenerTemplate:
    spec:
      containers:
        - name: listener
          image: "<image_name>:0.9.3"
          securityContext:
            runAsUser: 1000
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
            requests:
              cpu: 250m
              memory: 256Mi

      nodeSelector:
        cloud.google.com/compute-class: od-static-class

  template:
    spec:
      containers:
        - name: runner
          image: "<image_name>:v2.323.0"
          command: ["/home/runner/run.sh"]
          resources:
            limits:
              cpu: 8000m
              memory: 16Gi
              ephemeral-storage: "20Gi"
            requests:
              cpu: 8000m
              memory: 16Gi
              ephemeral-storage: "20Gi"

      nodeSelector:
        cloud.google.com/compute-class: gar-extra-large-spot-compute-class

  controllerServiceAccount:
    namespace: arc-systems 
    name: gha-runner-scale-sets-controller-gha-rs-controller

Controller Logs

The controller logs did not show any errors. 

However I did see this error in argocd events for the pod

error killing pod: [failed to "KillContainer" for "runner" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "02d63aba-19b0-4e84-ae5c-419e73b69f6f" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = failed to stop container \"c3d0a59b7625ef2fcc9bb9471beb28605811883d89f811ba86426f3df2ee0cdd\": an error occurs during waiting for container \"c3d0a59b7625ef2fcc9bb9471beb28605811883d89f811ba86426f3df2ee0cdd\" to be killed: wait container \"c3d0a59b7625ef2fcc9bb9471beb28605811883d89f811ba86426f3df2ee0cdd\": context deadline exceeded"]

Runner Pod Logs

https://gist.github.com/shivansh-ptr/71575d8deb902825b13e4c591eb29dd6

github-actions · 2025-04-21T07:13:40Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic · 2025-04-22T13:16:02Z

Hey @shivansh-ptr, this is by design so we can avoid scaling indefinitely when something bad happens to the cluster. However, we intend to revisit this design and have a better strategy for self-healing. I'll close this issue here since we will be tracking it on issue #2721

shivansh-ptr added bug gha-runner-scale-set needs triage labels Apr 21, 2025

nikola-jokic closed this as completed Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ephemeral Pods not cleaning up which disrupts scaling of newer pods #4048

Ephemeral Pods not cleaning up which disrupts scaling of newer pods #4048

shivansh-ptr commented Apr 21, 2025

github-actions bot commented Apr 21, 2025

Uh oh!

nikola-jokic commented Apr 22, 2025

Uh oh!

Ephemeral Pods not cleaning up which disrupts scaling of newer pods #4048

Ephemeral Pods not cleaning up which disrupts scaling of newer pods #4048

Comments

shivansh-ptr commented Apr 21, 2025

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Apr 21, 2025

Uh oh!

nikola-jokic commented Apr 22, 2025

Uh oh!