Skip to content

Sometimes, a job need to wait 30/60 minutes before getting a runner #3953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
julien-michaud opened this issue Feb 28, 2025 · 11 comments
Open
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@julien-michaud
Copy link

julien-michaud commented Feb 28, 2025

Checks

Controller Version

0.10.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Start workflows
2. The first two jobs will get a runner very quickly
3. The third one will sometimes stay pending for 30/40 minutes before getting a runner

Describe the bug

Let's say that I have a workflow with 3 jobs running in parallel.

Sometimes, the jobs 1 and 2 will get a runner right away but the third one will have to wait 30 minutes to an hour before getting a runner.

Describe the expected behavior

All the jobs should start right away.

Note that I have two runner-scale-sets with the same runnerScaleSetName name, I don't know if its a bad practice or not but its working fine 🤷‍♂

I did that to ease teh upgrade process when a new chart is available, I update the gha-runner-scale-sets one by one to avoid service interruptions.

Thanks

Additional Context

gha-runner-scale-set-controller:
  enabled: true
  flags:
    logLevel: "warn"
  podLabels:
    finops.company.net/cloud_provider: gcp
    finops.company.net/cost_center: compute
    finops.company.net/product: tools
    finops.company.net/service: actions-runner-controller
    finops.company.net/region: europe-west1
  replicaCount: 3
  podAnnotations:
    ad.datadoghq.com/manager.checks: |
      {
        "openmetrics": {
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8080/metrics",
              "histogram_buckets_as_distributions": true,
              "namespace": "actions-runner-system",
              "metrics": [".*"]
            }
          ]
        }
      }
  metrics:
    controllerManagerAddr: ":8080"
    listenerAddr: ":8080"
    listenerEndpoint: "/metrics"

gha-runner-scale-set:
  enabled: true
  githubConfigUrl: https://github.com/company
  githubConfigSecret:
    github_token: <path:secret/github_token/actions_runner_controller#token>

  maxRunners: 100
  minRunners: 1

  containerMode:
    type: "dind"  ## type can be set to dind or kubernetes

  listenerTemplate:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
      annotations:
        ad.datadoghq.com/listener.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "histogram_buckets_as_distributions": true,
                  "namespace": "actions-runner-system",
                  "max_returned_metrics": 6000,
                  "metrics": [".*"],
                  "exclude_metrics": [
                    "gha_job_startup_duration_seconds",
                    "gha_job_execution_duration_seconds"
                  ],
                  "exclude_labels": [
                    "enterprise",
                    "event_name",
                    "job_name",
                    "job_result",
                    "job_workflow_ref",
                    "organization",
                    "repository",
                    "runner_name"
                  ]
                }
              ]
            }
          }
    spec:
      containers:
      - name: listener
        securityContext:
          runAsUser: 1000
  template:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
    spec:
      restartPolicy: OnFailure
      imagePullSecrets:
        - name: company-prod-registry
      containers:
        - name: runner
          image: eu.gcr.io/company-production/devex/gha-runners:v1.0.0-snapshot5
          command: ["/home/runner/run.sh"]

  controllerServiceAccount:
    namespace: actions-runner-system
    name: actions-runner-controller-gha-rs-controller

Controller Logs

https://gist.github.com/julien-michaud/dce55b9320fb236b622cbb00919277ce

Runner Pod Logs

/
@julien-michaud julien-michaud added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 28, 2025
@julien-michaud julien-michaud changed the title Sometimes, no runners are spawned Sometimes, a job need to wait 30/60 minutes before getting a runner Feb 28, 2025
@avadhanij
Copy link

We are seeing the same issue and we have a similar setup. We are unsure if the two runnerscalesets (for upgrade ease) is actually causing problems.

@emmahsax
Copy link

emmahsax commented Apr 4, 2025

We are on 0.8.2, and seem to be encountering a similar issue. We recently upgraded Karpenter to 1.3.3, and that's when we began seeing this issue. But it may have been existing before that.

@marcusisnard
Copy link

I’m observing similar behavior, even when not running in a high availability setup (single cluster on Azure). Unfortunately, the logs offer no insight, and the latency is unpredictable.

@marcusisnard
Copy link

Our organization is experiencing job queue delays exceeding 12 hours, severely impacting production workloads. No error logs are observed on our side. What steps can we take to troubleshoot this issue? @nikola-jokic

@nikola-jokic
Copy link
Collaborator

Hey everyone,

Could you please submit these logs without obfuscation through the support? We cannot investigate it without understanding which workflow runs are stuck. If you have failed runners, they count as the number of runners, so we can avoid creating an indefinite number of runners if something goes wrong with the cluster.
But if the delay is caused on the back-end side, please submit the workflow run that is stuck, and the unobfuscated log, so we can troubleshoot it. Thanks!

@nikola-jokic
Copy link
Collaborator

Hey everyone,

We found the root cause of the issue, and it should be fixed now. Please let us know if you are still experiencing this issue. I will leave this issue open for now for visibility. Thank you all for reporting it!

@marcusisnard
Copy link

Hey everyone,

We found the root cause of the issue, and it should be fixed now. Please let us know if you are still experiencing this issue. I will leave this issue open for now for visibility. Thank you all for reporting it!

Do we need to uninstall and re-deploy ARC?

@nikola-jokic
Copy link
Collaborator

No, the issue was on the back-end side, so it should start working properly without touching the ARC installation.

@marcusisnard
Copy link

Image Image

We are still seeing this issue, lots of jobs still pending, we do not have a cap on the maximum number of runners. Please let me know how I can send the appropriate logs and helm chart values used for our deployment.

@nikola-jokic
Copy link
Collaborator

Do you have failed ephemeral runners? If you don't have failed ones, please send the listener log, the controller log and workflow URLs of the pending jobs. You can submit them in the support issue if you don't want to share them publicly.
If you do have failed runners, please remove all failed ephemeral runner instances, which would free up the slots to scale up.

@niodice
Copy link

niodice commented Apr 22, 2025

@marcusisnard unfortunately I offer no help, but I wanted to ask how you view that particular UI. It looks like a GitHub view showing the scale sets directly in the UI. I have no such view, but it would be nice to see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

No branches or pull requests

6 participants