Skip to content

Occasionally jobs are stuck in "waiting for a runnner to come online" #3649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
victor-smg opened this issue Jul 8, 2024 · 15 comments
Closed
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@victor-smg
Copy link

victor-smg commented Jul 8, 2024

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

No clear reproduction steps for the bug, as it is occasional. 
Team have noticed it occurs most often in the morning.

Describe the bug

In the example, on the CI piepline: image

  • At 06:49 A job was suppose to be schedule on gm2.dind.large-b4j7w runner after the job "get-ecr-login-password"
  • During 40 minutes the job "unit-tests" is stuck in waiting for runner
  • At 07:40:43 in the controller we get the log "Created new ephemeral runner", which is named gm2.dind.large-b4j7w-runner-bsk5p. The runner will be use to run the job "unit-tests"

We have multiple cases of jobs waiting a long time for a runner, without any clear explanation.

Describe the expected behavior

  • Jobs is trigger
  • ARC controler create the runner
  • Runner is taking the job and start running it

Additional Context

There has been attempt to play around the 'minWorkers' configuration. of the runner-set to prevent the waiting issue.

Runner-set are deployed with helm using the values.yaml file:

githubConfigUrl: ${github_config_url}

githubConfigSecret:
  github_app_id: "${github_app_id}"
  github_app_installation_id: "${github_app_installation_id}"
  github_app_private_key: |
    ${github_app_private_key}

containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "github-arc"
    resources:
      requests:
        storage: 10Gi

minWorkers: 2

listenerTemplate:
  metadata:
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/path: "/metrics"
      prometheus.io/port: "8080"
  spec:
    containers:
    - name: listener
      securityContext:
        runAsUser: 1000
      resources:
        limits:
          memory: 64Mi
        requests:
          cpu: 100m
          memory: 64Mi
    nodeSelector:
      "karpenter.sh/nodepool": runner
    tolerations:
      - key: "karpenter.sh/nodepool"
        value: "runner"
        operator: "Equal"
        effect: "NoSchedule"

template:
  metadata:
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/path: "/metrics"
      prometheus.io/port: "8080"
  spec:
    securityContext:
      fsGroup: 1001
    serviceAccount: "${sa_name}"
    containers:
    - name: runner
      image: REDACTED/github-arc-runner:0.7
      command: ["/home/runner/run.sh"]
      env:
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"
      resources:
        limits:
          cpu: "${resources.cpu}"
          memory: "${resources.memory}"
        requests:
          cpu: "${resources.cpu}"
          memory: "${resources.memory}"
    nodeSelector:
      "karpenter.sh/nodepool": runner
    tolerations:
      - key: "karpenter.sh/nodepool"
        value: "runner"
        operator: "Equal"
        effect: "NoSchedule"

Controller Logs

Logs would be best read when filtered on "gm2.dind.large-b4j7w"

"https://gist.github.com/victor-smg/1c7ed78e9702dd2389e98725c9ac37f5.js"

Runner Pod Logs

No relevant logs for the runner, as it is the lack of runner that is the issue
@victor-smg victor-smg added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jul 8, 2024
Copy link
Contributor

github-actions bot commented Jul 8, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@andresrsanchez
Copy link

We are now facing this issue also, same version.

@krzyzakp
Copy link

Bit more info, maybe will help.
So far we thought it is mostly problem in the morning, but we are impacted by it right now as well. In theory there was just GH Copilot outage in last hour, nothing about Actions or API, but still one of our runners is NOT fetching jobs. Seems like autoscalingrunnersets is not doing it's job correctly.

krzyzakp@X1Carbon:/home/krzyzakp $ k describe -n github-arc autoscalingrunnersets.actions.github.com gm2.small
Name:         gm2.small
Namespace:    github-arc
Labels:       actions.github.com/organization=tutti-ch
              actions.github.com/scale-set-name=gm2.small
              actions.github.com/scale-set-namespace=github-arc
              app.kubernetes.io/component=autoscaling-runner-set
              app.kubernetes.io/instance=gm2.small
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=gm2.small
              app.kubernetes.io/part-of=gha-rs
              app.kubernetes.io/version=0.9.3
              helm.sh/chart=gha-rs-0.9.3
Annotations:  actions.github.com/cleanup-github-secret-name: gm2.small-gha-rs-github-secret
              actions.github.com/cleanup-kubernetes-mode-role-binding-name: gm2.small-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-role-name: gm2.small-gha-rs-kube-mode
              actions.github.com/cleanup-kubernetes-mode-service-account-name: gm2.small-gha-rs-kube-mode
              actions.github.com/cleanup-manager-role-binding: gm2.small-gha-rs-manager
              actions.github.com/cleanup-manager-role-name: gm2.small-gha-rs-manager
              actions.github.com/runner-group-name: Default
              actions.github.com/runner-scale-set-name: gm2.small
              actions.github.com/values-hash: <REDACTED in comment>
              meta.helm.sh/release-name: gm2.small
              meta.helm.sh/release-namespace: github-arc
              runner-scale-set-id: 2
API Version:  actions.github.com/v1alpha1
Kind:         AutoscalingRunnerSet
Metadata:
  Creation Timestamp:  2024-05-22T12:57:11Z
  Finalizers:
    autoscalingrunnerset.actions.github.com/finalizer
  Generation:        32
  Resource Version:  326465633
  UID:               b6f45bbd-9c25-475c-b633-ebd4290365e3
Spec:
  Github Config Secret:  gm2.small-gha-rs-github-secret
  Github Config URL:     https://github.com/tutti-ch
  Listener Template:
    Metadata:
      Annotations:
        prometheus.io/path:    /metrics
        prometheus.io/port:    8080
        prometheus.io/scrape:  true
    Spec:
      Containers:
        Name:  listener
        Resources:
          Limits:
            Memory:  64Mi
          Requests:
            Cpu:     100m
            Memory:  64Mi
        Security Context:
          Run As User:  1000
      Node Selector:
        karpenter.sh/nodepool:  runner
      Tolerations:
        Effect:    NoSchedule
        Key:       karpenter.sh/nodepool
        Operator:  Equal
        Value:     runner
  Template:
    Metadata:
      Annotations:
        prometheus.io/path:    /metrics
        prometheus.io/port:    8080
        prometheus.io/scrape:  true
    Spec:
      Containers:
        Command:
          /home/runner/run.sh
        Env:
          Name:   ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          Value:  false
          Name:   ACTIONS_RUNNER_CONTAINER_HOOKS
          Value:  /home/runner/k8s/index.js
          Name:   ACTIONS_RUNNER_POD_NAME
          Value From:
            Field Ref:
              Field Path:  metadata.name
        Image:             668273420038.dkr.ecr.eu-central-1.amazonaws.com/github-arc-runner:0.7
        Name:              runner
        Resources:
          Limits:
            Cpu:     500m
            Memory:  1Gi
          Requests:
            Cpu:     500m
            Memory:  1Gi
        Volume Mounts:
          Mount Path:  /home/runner/_work
          Name:        work
      Node Selector:
        karpenter.sh/nodepool:  runner
      Restart Policy:           Never
      Security Context:
        Fs Group:            1001
      Service Account:       gm.small
      Service Account Name:  gm2.small-gha-rs-kube-mode
      Tolerations:
        Effect:    NoSchedule
        Key:       karpenter.sh/nodepool
        Operator:  Equal
        Value:     runner
      Volumes:
        Ephemeral:
          Volume Claim Template:
            Spec:
              Access Modes:
                ReadWriteOnce
              Resources:
                Requests:
                  Storage:         10Gi
              Storage Class Name:  github-arc
        Name:                      work
Status:
  Current Runners:            23
  Failed Ephemeral Runners:   22
  Pending Ephemeral Runners:  1
  Running Ephemeral Runners:  0
Events:                       <none>

You can clearly see that there are 23 runners, failed are 22, but even if I try to restart job and get a hold of pod, don't see any.
What more surprising, other runners, using exactly same config (except mem/cpu limits) are working fine - even if they're bigger. So don't think there is no node capacity.

On top of that in logs, I see (read from bottom to top):

2024-07-11 17:20:19.366	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Deleting the ephemeral runner pod	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "podId": "61019e6a-6221-4d3d-8fa3-689cad9a2396"}
2024-07-11 17:20:19.366	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Ephemeral runner pod has finished, but the runner still exists in the service. Deleting the pod to restart it.	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:19.366	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Runner exists in GitHub service	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerId": 604044}
2024-07-11 17:20:19.271	
2024-07-11T15:20:19Z	INFO	EphemeralRunner	Checking if runner exists in GitHub service	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerId": 604044}
2024-07-11 17:20:15.287	
2024-07-11T15:20:15Z	INFO	AutoscalingRunnerSet	Find existing ephemeral runner set	{"version": "0.9.3", "autoscalingrunnerset": {"name":"gm2.small","namespace":"github-arc"}, "name": "gm2.small-zh6h4", "specHash": "75f975979"}
2024-07-11 17:20:15.285	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:15.284	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Scaling comparison	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:15.284	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 0, "running": 1, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:15.272	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Updating status with current runners count	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "count": 23}
2024-07-11 17:20:15.272	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Scaling comparison	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:15.272	
2024-07-11T15:20:15Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 0, "running": 1, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:15.271	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:15.257	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "statusPhase": "Running", "statusReason": "", "statusMessage": ""}
2024-07-11 17:20:15.257	
2024-07-11T15:20:15Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.099	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.096	
2024-07-11T15:20:09Z	INFO	EphemeralRunnerSet	Scaling comparison	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:09.096	
2024-07-11T15:20:09Z	INFO	EphemeralRunnerSet	Ephemeral runner counts	{"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 1, "running": 0, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:09.096	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Updated ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.089	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Updating ephemeral runner status with pod phase	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "statusPhase": "Pending", "statusReason": "", "statusMessage": ""}
2024-07-11 17:20:09.089	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Ephemeral runner container is still running	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.070	
2024-07-11T15:20:09Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.156	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.153	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Waiting for runner container status to be available	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.145	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Created ephemeral runner pod	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerScaleSetId": 2, "runnerName": "gm2.small-zh6h4-runner-v9sch", "runnerId": 604044, "configUrl": "https://github.com/tutti-ch", "podName": "gm2.small-zh6h4-runner-v9sch"}
2024-07-11 17:20:02.116	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Created new pod spec for ephemeral runner	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.116	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Creating new pod for ephemeral runner	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.116	
2024-07-11T15:20:02Z	INFO	EphemeralRunner	Creating new EphemeralRunner pod.	{"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}

It's super strange, that pod was running just few seconds and than is deleted.

@mohsen0
Copy link

mohsen0 commented Jul 12, 2024

is there a workaround for this? the issue effectively causes a very poor experience

@keisari-ch
Copy link

experiencing this issue aswell with a new deployment on AKS.
the runner job is never deployed. the only custom thing is the fact that there is a nodeSelector

controller logs

2024-07-22T09:51:19Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:51:19Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:51:19Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}
2024-07-22T09:52:09Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:52:09Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:52:09Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}
2024-07-22T09:52:59Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:52:59Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:52:59Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}

runner logs

2024-07-22T09:51:19Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":0,\"replicas\":1}}"}
2024-07-22T09:51:19Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:51:19Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-07-22T09:52:09Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:52:09Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":2,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":2,\"replicas\":1}}"}
2024-07-22T09:52:59Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:52:59Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":3,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":3,\"replicas\":1}}"}
2024-07-22T09:53:49Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:53:49Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":4,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":4,\"replicas\":1}}"}
2024-07-22T09:54:40Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:54:40Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Calculated target runner count  {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":5,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Preparing EphemeralRunnerSet update     {"json": "{\"spec\":{\"patchID\":5,\"replicas\":1}}"}
2024-07-22T09:55:30Z    INFO    listener-app.worker.kubernetesworker    Ephemeral runner set scaled.    {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:55:30Z    INFO    listener-app.listener   Getting next message    {"lastMessageID": 0}

Totally blocked.

@gwynforthewyn
Copy link
Contributor

We've seen the same issue. We noticed that there are other issues where this issue was discovered in older versions of actions runner controller, and one of them has a comment saying that the issue is fixed by restarting the listener.

This doesn't fix the issue for anyone, but as triage you can deploy a cron job with something like this:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: actions-listener-restart-cron
  namespace: actions
spec:
  schedule: "*/5 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
         serviceAccountName: your-service-account
         containers:
         - name: actions-listener-restart-pod
           image: bitnami/kubectl:v1.26.14
           imagePullPolicy: Always
           securityContext:
             allowPrivilegeEscalation: false
             capabilities:
               drop: ["ALL"]
             runAsUser: 999
           command:
           - /bin/sh
           - -c
           - kubectl delete pod -n actions -l app.kubernetes.io/component=runner-scale-set-listener
         restartPolicy: Never

You'll also need to define the ServiceAcccount and its role/rolebinding, but the above should get you started.

@krzyzakp
Copy link

On our side, it somehow magically stopped to be an issue over 2 weeks ago. Since that time, no single job was not picked up (except around an hour, when we messed up our networking in AWS and you couldn't reach the Kubernetes cluster - but it was purely problem on network, which we caused).
So for now I can't really debug it more or add recommended restart, as it simply works.

@ricardojdsilva87
Copy link

Hello, we have a similar situation on our runners.
We have basically the same template applied to all the different labels, the difference between them is basically the resources set.
We noticed that even deleting all the resources and reinstalling everything causes the same behaviour. Here are the logs for a working listener triggering new pods:

{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Processing message","messageId":10391,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":1,"totalAcquiredJobs":0,"totalAssignedJobs":0,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Job available message received","jobId":7848922}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Acquiring jobs","count":1,"requestIds":"[7848922]"}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Jobs are acquired","count":1,"requestIds":"[7848922]"}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":10391}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.worker.kubernetesworker","message":"Created merge patch json for EphemeralRunnerSet update","json":"{\"spec\":{\"replicas\":null}}"}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.worker.kubernetesworker","message":"Scaling ephemeral runner set","assigned job":0,"decision":0,"min":0,"max":20,"currentRunnerCount":-1}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"NAMESPACE","name":"RUNNER-NAME","replicas":0}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10391}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"Processing message","messageId":10392,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"Job assigned message received","jobId":7848922}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":10392}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.worker.kubernetesworker","message":"Created merge patch json for EphemeralRunnerSet update","json":"{\"spec\":{\"replicas\":1}}"}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.worker.kubernetesworker","message":"Scaling ephemeral runner set","assigned job":1,"decision":1,"min":0,"max":20,"currentRunnerCount":-1}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"NAMESPACE","name":"RUNNER-NAME","replicas":1}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10392}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"Processing message","messageId":10393,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":1,"totalRegisteredRunners":1,"totalBusyRunners":1,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"Job started message received.","RequestId":7848922,"RunnerId":8678330}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Updating job info for the runner","runnerName":"RUNNER-NAME-runner-x6xq4","ownerName":"ORGANIZATION","repoName":"REPOSITORY","workflowRef":"ORGANIZATION/REPOSITORY/.github/workflows/workflow.yml@refs/heads/main","workflowRunId":9332229,"jobDisplayName":"JOBNAME","requestId":7848922}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Updating ephemeral runner with merge patch","json":"{\"status\":{\"jobDisplayName\":\"JOB NAME\",\"jobRepositoryName\":\"ORGANIZATION/REPOSITORY\",\"jobRequestId\":7848922,\"jobWorkflowRef\":\"ORGANIZATION/REPOSITORY/.github/workflows/workflow.yml@refs/heads/main\",\"workflowRunId\":9332229}}"}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner status updated with the merge patch successfully."}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Created merge patch json for EphemeralRunnerSet update","json":"{\"spec\":{\"replicas\":1}}"}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Scaling ephemeral runner set","assigned job":1,"decision":1,"min":0,"max":20,"currentRunnerCount":-1}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"NAMESPACE","name":"RUNNER-NAME","replicas":1}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:15:36Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:16:26Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:17:16Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}

As you can see there are 3 entries of Created merge patch json for EphemeralRunnerSet update the first with replicas:null and the other 2 starting to patch the number of replicas into the ephemeralrunnerset.

The only difference to a non working job is that it does not have the second scale. After the first one with replicas:null it keeps listening for new events but does not scale any pods.

We are using the runnerscaleset: v0.8.3 with the ghcr.io/actions/actions-runner:2.316.1 image.

I even tried to patch manually the ephemeralrunnerset with no success or even increasing the minimum or runners from 0 to 2 on the autoscalingrunnerset the pods appear but are not assigned any jobs.

Any idea where should we look into? As it seems there is an issue between the listener pod and the ephemeralrunnerset but like mentioned even after deleting everything and recreating the same behaviour happens.

Thanks

@ricardojdsilva87
Copy link

Hello everyone,
In our case the issue we create a separate Runners Group on GHE to register the listeners. For some reason even after deleting the entire deployment the listeners could not fetch any pending jobs until we deleted this Runner Group and Recreate it.
After that all the existing labels registered successfully and the jobs started right away.
Thanks

@ricardojdsilva87
Copy link

ricardojdsilva87 commented Oct 24, 2024

Just an update, we kept seeing the listeners being offline on our github instance with no apparent reason. Deleting the autoscalingrunnerset made the listener to be recreated and be able to register again.
As soon as we do this, the jobs are picked up and new runners are created.
We are checking also on our github instance side what might be wrong, since we do not see no errors on the listeners/controller side and for the moment we've implemented a cronjob that deletes the autoscalingrunnerset every hour, checking first if there are any active runners for that label. If so the autoscalingrunnerset is not deleted, if no runners are present we delete the autoscalingrunnerset so it can register again.

@Y0RI0
Copy link

Y0RI0 commented Nov 18, 2024

I have the same issue and @ricardojdsilva87 comment will solve it. Listener does not report any error logs. Last thing it says (repeatedly) before dying suddenly is:

Preparing EphemeralRunnerSet update

I'm wondering if this has something to do with it, but it shouldn't matter if no jobs are running. Controller should be able to "see" that the listener is missing, kill the set, and recreate. So perhaps if a job is not reporting as exited, it will hang out waiting to delete ephemeral set (which also controls whether the listener exists or not)

Referencing related issue: #3704

Additional details if it helps:

  • running with argocd
  • controller and runners are in separate NS
  • have set minimum runners to 2, so two seem to exist, but listener pod has died

@ganchesp
Copy link

ganchesp commented Jan 7, 2025

we have experienced the same issue here:
we have 2 runner-scale-set for redundancy, both have same name as runner-scale-set, each has different runner group.
It used to work, but now workflow hanging with "waiting for runner come online". however, both runner-scale-set are online with green light.
We could not see any errors in the arc-system namespace, and it seems like there is no way to check out the cause.
For torubleshooting, we have uninstall both helm release of arc and runner-scale-set, and reinstall via helm both of them, but problem persists.

Any tips will be much appreciated!

@nikola-jokic
Copy link
Collaborator

Hey everyone, this should b resolved now. Can you please confirm you are not getting this error anymore?

@nikola-jokic
Copy link
Collaborator

I'll close this issue here since we are tracking it on the new issue #3953
Thank you for reporting it and please let us know if you are still experiencing the issue.

@forgondolin
Copy link

forgondolin commented May 14, 2025

Hello @nikola-jokic, is this fix only for 0.11.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests