0.9.1 does not work with on-demand provisioned nodes #3450

wasd171 · 2024-04-17T17:48:44Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.1

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Configure Karpenter to schedule workload on a separate node pool
2. Run the pipeline

Describe the bug

We are using Karpenter to dynamically provision worker nodes for our pipelines. It takes around 2 minutes to provision the node and to start the runner pod on it. However, it seems that the controller attempts to scale the EphemeralRunnerSet down after 1 minute, while the pod is still in the Pending status. It leads to the pipeline being stuck

Might be related to #3420 and #3426

Describe the expected behavior

Controller does not scale the EphemeralRunnerSet down after 1 minute of runner pod being in the Pending state.

Additional Context

# values.yaml

arcOperator:
  values:
    replicaCount: 3
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: gha-rs-controller
          topologyKey: kubernetes.io/hostname

arcRunners:
  values:
    githubConfigUrl: "https://github.com/CareerLunch/career-lunch"
    githubConfigSecret:
      github_token: {{ requiredEnv "ARC_CHART_GITHUB_TOKEN" | quote }}
    containerMode:
      type: kubernetes
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "career-lunch-bootstrap-aws-ebs-gp3"
        resources:
          requests:
            storage: 10Gi
    template:
      metadata:
        annotations:
          karpenter.sh/do-not-evict: "true"
      spec:
        securityContext:
          fsGroup: 123
        nodeSelector:
          workload: arc_runners_v1
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  actions-ephemeral-runner: "True"
              topologyKey: kubernetes.io/hostname
        tolerations:
          - key: workload
            operator: Equal
            value: arc_runners_v1
            effect: NoSchedule
        containers:
          - name: runner
            image: ghcr.io/actions/actions-runner:latest
            command: ["/home/runner/run.sh"]
            env:
              - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
                value: "false"
              - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
                value: /mnt/pod-template/pod-template.yml
            volumeMounts:
              - name: pod-template
                mountPath: /mnt/pod-template
                readOnly: true
            resources:
              requests:
                memory: "250Mi"
                cpu: "100m"
              limits:
                memory: "250Mi"
        volumes:
          - name: pod-template
            configMap:
              name: career-lunch-bootstrap-pod-template

---
# pod-template.yml

{{- $name := ( include "career-lunch-library.fullname" . ) }}
{{- $labels := ( include "career-lunch-library.labels" . ) }}

{{- /* See https://github.com/actions/runner-container-hooks/blob/main/docs/adrs/0096-hook-extensions.md */}}
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ $name }}-pod-template
  namespace: {{ .Values.arcRunners.namespace }}
  labels:
{{ $labels | indent 4 }}
data:
  pod-template.yml: |
    {{- /*
      Technically, we do not need a PodTemplate, see https://github.com/actions/runner-container-hooks/blob/main/examples/extension.yaml
      But this communicates the intent better
    */}}
    apiVersion: v1
    kind: PodTemplate
    metadata:
      annotations:
        karpenter.sh/do-not-evict: "true"
    spec:
      serviceAccountName: {{ include "career-lunch-library.fullname" . }}-arc-runners-sa
      securityContext:
        {{- /* Provides access to /home/runner/_work directory in ephemeral volume */}}
        fsGroup: 123
      nodeSelector:
        workload: arc_runners_v1
      tolerations:
        - key: workload
          operator: Equal
          value: arc_runners_v1
          effect: NoSchedule
      containers:
        {{- /* Overrides job container */}}
        - name: $job
          resources:
            {{- /* Memory values are rather high, but we need them for webpack / Cypress jobs */}}
            requests:
              memory: 4Gi
              cpu: 1
            {{- /* 
              Best practice is to add limits too
              However, we run these jobs in isolated instances and want them to use all resources available
              Instance resources are configured via Karpenter
            */}}

Controller Logs

https://gist.github.com/wasd171/eae06234a06a26592a264cd617711a85
Look at the logs starting with 2024-04-17T17:32:32Z

Listener logs:
2024-04-17T17:32:24Z	INFO	listener-app.listener	Job available message received	{"jobId": 25672}
2024-04-17T17:32:24Z	INFO	listener-app.listener	Acquiring jobs	{"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z	INFO	listener-app.listener	Jobs are acquired	{"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 10}
2024-04-17T17:32:25Z	INFO	listener-app.worker.kubernetesworker	Created merge patch json for EphemeralRunnerSet update	{"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:32:25Z	INFO	listener-app.worker.kubernetesworker	Scaling ephemeral runner set	{"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:25Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:32:25Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 10}
2024-04-17T17:32:31Z	INFO	listener-app.listener	Processing message	{"messageId": 11, "messageType": "RunnerScaleSetJobMessages"}
2024-04-17T17:32:31Z	INFO	listener-app.listener	New runner scale set statistics.	{"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
2024-04-17T17:32:31Z	INFO	listener-app.listener	Job assigned message received	{"jobId": 25672}
2024-04-17T17:32:31Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 11}
2024-04-17T17:32:32Z	INFO	listener-app.worker.kubernetesworker	Created merge patch json for EphemeralRunnerSet update	{"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-04-17T17:32:32Z	INFO	listener-app.worker.kubernetesworker	Scaling ephemeral runner set	{"assigned job": 1, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:32Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 1}
2024-04-17T17:32:32Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}
2024-04-17T17:33:22Z	INFO	listener-app.worker.kubernetesworker	Created merge patch json for EphemeralRunnerSet update	{"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:33:22Z	INFO	listener-app.worker.kubernetesworker	Scaling ephemeral runner set	{"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-04-17T17:33:22Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:33:22Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}

Runner Pod Logs

Unable to provide since the pod is evicted by controller

nikola-jokic · 2024-04-18T11:00:50Z

Hey @wasd171,

Thank you for reporting this! The root cause of the problem is that in 0.9.1, the assumption was that if there is an empty batch, we self-correct the count. My assumption was that 50s is enough time for the cluster to be ready, but in this case, it was obviously wrong.

An important thing to point out is that it should self correct on the next run on the scale set. However, this should be fixed, so again, thank you for bringing this to our attention! Please, use the older version of ARC in the meantime. You can also use the older version of the listener that doesn't propagate patch ID, which may occasionally create more pods than necessary, but it would handle this case appropriately.

koreyGambill · 2024-04-22T15:30:18Z

@nikola-jokic , I want to check if this is the same issue I am having.

I have a listener that scheduled 7 jobs to 7 arc-runner pods, but those pods were pending while waiting for a node to come online. After about a minute, all of those pods went away, and then 1-2 minutes later the node was ready for them to run on. The pods didn't come back up, though.

Meanwhile, in github actions, the jobs were stuck waiting for a runner group. The listener logs in arc-systems shows that those jobs were assigned, but the pods they were assigned to had been terminated (probably by the controller) and so the jobs were lost never to be scheduled.

Does this sound like the same thing, and would rolling back to 0.9.0 to address the problem?

nikola-jokic · 2024-04-22T16:04:19Z

Hey @koreyGambill,

I think so. Rolling back to 0.9.0 should fix the problem. I have to mention that in a very unlucky case, 0.9.0 can increase latency of starting a job. It is a rare case, and if you have a busy cluster, it is very unlikely to happen. However, if you want to be 100% sure everything is executed as quickly as possible in every situation, please rollback to 0.8.3. This version of controller can create and delete ephemeral runners more than 0.9.0, but will ensure that the runner is created as soon as possible. In the next release (0.9.2), the controller should be able to pick up everything right away, and decrease the number of created pods.

ywang-psee · 2024-04-25T17:05:44Z

If we could set the waiting time for pods in the gha-runner-scale-set-controller values.yaml, that would be great :)

nikola-jokic · 2024-04-26T08:41:05Z

Hey @ywang-psee,

We don't have a waiting time, it was a bug introduced in 0.9.1 😞
This PR should fix it.

ohookins · 2024-04-29T10:20:52Z

Think we are having the same problem. Also using Karpenter here, and actually saw the listener create the appropriate ephemeralrunnerset settings, then ephemeralrunners started creating, then actual pods, but they all get killed off very quickly and it's back to 0 before the queued jobs can be picked up.

Have reverted to 0.9.0 and things seem to be working again.

OneideLuizSchneider · 2024-04-29T15:05:10Z

Yep, same here, we reverted to 0.9.0 as well!
...Keeping an eye on this PR #3204

chenghuang-mdsol · 2024-05-10T15:40:32Z

@nikola-jokic Hi, do you know when will release 0.9.2 happen ?

nikola-jokic · 2024-05-14T08:55:40Z

Hey @chenghuang-mdsol,

I cannot promise anything, but hopefully, later this week or next week.

casey-robertson-paypal · 2024-05-14T19:08:52Z

I need to dig into it but "rolling back" to 0.8.3 results in chart diff errors with regard to clusterrole and the namespace it's in. Chart behavior must have changed with regard to where accounts are created (which namespace). I set the namespace aligned with the quick start docs (arc-runners and arc-systems) - do most people just use kube-system? We are still testing so I can blow it all away but just wondering.

Also for removal - will helm uninstall do this cleanly or is there CRD cleanup required?

nikola-jokic · 2024-05-17T14:06:33Z

Hey everyone 👋

Would anyone be interested in testing out this fix before we release it please 🙏

To do it, you can follow these steps:

In the gha-runner-scale-set-controller values.yaml file, update the tag value to: canary-3bda9bb
Update the field appVersion in the Chart.yaml file for gha-runner-scale-set to be: canary-3bda9bb
Redeploy ARC using the updated helm chart and values.yaml files

Thank you in advance! ❤️

sungmincs · 2024-05-17T16:10:21Z

Thanks @nikola-jokic
I just tested with the canary image tag and the helm chart, and looks like it fixed the issue

NAME          NAMESPACE  	REVISION	UPDATED                             	STATUS  	CHART                                	APP VERSION
arc           arc-systems	1       	2024-05-18 00:57:23.367937 +0900 KST	deployed	gha-runner-scale-set-controller-0.9.1	0.9.1
runner-arm64  arc-runners	1       	2024-05-18 00:57:25.176608 +0900 KST	deployed	gha-runner-scale-set-0.9.1           	canary-3bda9bb

$ kubectl get pod -n arc-runners
NAME                              READY   STATUS    RESTARTS   AGE
runner-arm64-k9mpd-runner-j7g9n   2/2     Running   0          82s

it took roughly 70~80 sec for the node to be ready and for the runner pod to start running, and the controller tolerated the start time.

sungmincs · 2024-05-17T16:16:58Z

However, one thing I want to call out is that when I firstly tried to install this canary image and chart, the listener failed to boot up with the following missing role.

2024-05-17T15:44:10Z	ERROR	AutoscalingRunnerSet	Failed to remove finalizers on dependent resources	{"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}, "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*AutoscalingRunnerSetReconciler).Reconcile
	github.com/actions/actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go:139
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z	ERROR	Reconciler error	{"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"runner-arm64","namespace":"arc-runners"}, "namespace": "arc-runners", "name": "runner-arm64", "reconcileID": "1f35d8c0-b14f-471b-bbc0-c132a7a5196b", "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z	INFO	AutoscalingRunnerSet	Deleting resources	{"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}
2024-05-17T15:44:10Z	INFO	AutoscalingRunnerSet	Cleaning up the listener	{"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}

Maybe it was just a timing issue that role/rolebinding created after the controller got created, or just a one-off blip because I couldn't reproduce it after all.
I thought I would share it just in case anyways.

wasd171 added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 17, 2024

nikola-jokic removed the needs triage Requires review from the maintainers label Apr 18, 2024

nikola-jokic self-assigned this Apr 18, 2024

nikola-jokic mentioned this issue Apr 18, 2024

Re-use the last desired patch on empty batch #3453

Merged

nikola-jokic mentioned this issue Apr 25, 2024

Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

Closed

4 tasks

nikola-jokic mentioned this issue Apr 29, 2024

Scale Set Listener Stops Responding #3204

Open

4 tasks

mpjarvis mentioned this issue May 13, 2024

"Job is waiting for a runner from XXX to come online" with 0.9.0 and 0.9.1 (works with 0.8.3) #3499

Open

4 tasks

nikola-jokic mentioned this issue May 14, 2024

Actions Runner Controller gives up pending or initializing runner pod too quickly #3516

Closed

4 tasks

Link- closed this as completed in #3453 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.9.1 does not work with on-demand provisioned nodes #3450

0.9.1 does not work with on-demand provisioned nodes #3450

wasd171 commented Apr 17, 2024 •

edited

nikola-jokic commented Apr 18, 2024

koreyGambill commented Apr 22, 2024

nikola-jokic commented Apr 22, 2024

ywang-psee commented Apr 25, 2024

nikola-jokic commented Apr 26, 2024

ohookins commented Apr 29, 2024

OneideLuizSchneider commented Apr 29, 2024

chenghuang-mdsol commented May 10, 2024

nikola-jokic commented May 14, 2024

casey-robertson-paypal commented May 14, 2024

nikola-jokic commented May 17, 2024

sungmincs commented May 17, 2024

sungmincs commented May 17, 2024

0.9.1 does not work with on-demand provisioned nodes #3450

0.9.1 does not work with on-demand provisioned nodes #3450

Comments

wasd171 commented Apr 17, 2024 • edited

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

nikola-jokic commented Apr 18, 2024

koreyGambill commented Apr 22, 2024

nikola-jokic commented Apr 22, 2024

ywang-psee commented Apr 25, 2024

nikola-jokic commented Apr 26, 2024

ohookins commented Apr 29, 2024

OneideLuizSchneider commented Apr 29, 2024

chenghuang-mdsol commented May 10, 2024

nikola-jokic commented May 14, 2024

casey-robertson-paypal commented May 14, 2024

nikola-jokic commented May 17, 2024

sungmincs commented May 17, 2024

sungmincs commented May 17, 2024

wasd171 commented Apr 17, 2024 •

edited