Skip to content

Ephemeral Pods not cleaning up which disrupts scaling of newer pods #4048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
shivansh-ptr opened this issue Apr 21, 2025 · 2 comments
Closed
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@shivansh-ptr
Copy link

Checks

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy runner scale sets version v2.323.0 on GKE
2. I have deployed a runner with config 8 vCPU and 16 GB memory.
3. When multiple jobs come up, the pods scale up but sometimes the pods do not clean up resulting in scaling issues

Describe the bug

I have observed that on a larger scale, the pods are not being cleaned up and hindering the scaling of the newer pods. This has resulted in a higher queue time.
Once I delete these pods manually, the new pods immediately scale up.
I am assuming that since these pods are present, the controller assumes that the pods are already present. Please correct me if I am wrong here.

Describe the expected behavior

The pods should be auto cleaned and should not result in a higher queue time of workflows.

Additional Context

gha-runner-scale-set:
  githubConfigUrl: "https://github.com/<orgname>"

  githubConfigSecret: gha-runner-scale-sets-github-config-secret

  maxRunners: 100
  minRunners: 1
  runnerGroup: "k8s-dom-prod-runnergroup"
  runnerScaleSetName: "k8s-xl"

  containerMode:
    type: "dind"

  listenerTemplate:
    spec:
      containers:
        - name: listener
          image: "<image_name>:0.9.3"
          securityContext:
            runAsUser: 1000
          resources:
            limits:
              cpu: 500m
              memory: 512Mi
            requests:
              cpu: 250m
              memory: 256Mi

      nodeSelector:
        cloud.google.com/compute-class: od-static-class

  template:
    spec:
      containers:
        - name: runner
          image: "<image_name>:v2.323.0"
          command: ["/home/runner/run.sh"]
          resources:
            limits:
              cpu: 8000m
              memory: 16Gi
              ephemeral-storage: "20Gi"
            requests:
              cpu: 8000m
              memory: 16Gi
              ephemeral-storage: "20Gi"

      nodeSelector:
        cloud.google.com/compute-class: gar-extra-large-spot-compute-class

  controllerServiceAccount:
    namespace: arc-systems 
    name: gha-runner-scale-sets-controller-gha-rs-controller

Controller Logs

The controller logs did not show any errors. 

However I did see this error in argocd events for the pod

error killing pod: [failed to "KillContainer" for "runner" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "02d63aba-19b0-4e84-ae5c-419e73b69f6f" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = failed to stop container \"c3d0a59b7625ef2fcc9bb9471beb28605811883d89f811ba86426f3df2ee0cdd\": an error occurs during waiting for container \"c3d0a59b7625ef2fcc9bb9471beb28605811883d89f811ba86426f3df2ee0cdd\" to be killed: wait container \"c3d0a59b7625ef2fcc9bb9471beb28605811883d89f811ba86426f3df2ee0cdd\": context deadline exceeded"]

Runner Pod Logs

https://gist.github.com/shivansh-ptr/71575d8deb902825b13e4c591eb29dd6
@shivansh-ptr shivansh-ptr added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 21, 2025
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Collaborator

Hey @shivansh-ptr, this is by design so we can avoid scaling indefinitely when something bad happens to the cluster. However, we intend to revisit this design and have a better strategy for self-healing. I'll close this issue here since we will be tracking it on issue #2721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

2 participants