Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.9.1 does not work with on-demand provisioned nodes #3450

Closed
4 tasks done
wasd171 opened this issue Apr 17, 2024 · 13 comments · Fixed by #3453
Closed
4 tasks done

0.9.1 does not work with on-demand provisioned nodes #3450

wasd171 opened this issue Apr 17, 2024 · 13 comments · Fixed by #3453
Assignees
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@wasd171
Copy link

wasd171 commented Apr 17, 2024

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Configure Karpenter to schedule workload on a separate node pool
2. Run the pipeline

Describe the bug

We are using Karpenter to dynamically provision worker nodes for our pipelines. It takes around 2 minutes to provision the node and to start the runner pod on it. However, it seems that the controller attempts to scale the EphemeralRunnerSet down after 1 minute, while the pod is still in the Pending status. It leads to the pipeline being stuck
SCR-20240417-redo

Might be related to #3420 and #3426

Describe the expected behavior

Controller does not scale the EphemeralRunnerSet down after 1 minute of runner pod being in the Pending state.

Additional Context

# values.yaml

arcOperator:
  values:
    replicaCount: 3
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: gha-rs-controller
          topologyKey: kubernetes.io/hostname

arcRunners:
  values:
    githubConfigUrl: "https://github.com/CareerLunch/career-lunch"
    githubConfigSecret:
      github_token: {{ requiredEnv "ARC_CHART_GITHUB_TOKEN" | quote }}
    containerMode:
      type: kubernetes
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "career-lunch-bootstrap-aws-ebs-gp3"
        resources:
          requests:
            storage: 10Gi
    template:
      metadata:
        annotations:
          karpenter.sh/do-not-evict: "true"
      spec:
        securityContext:
          fsGroup: 123
        nodeSelector:
          workload: arc_runners_v1
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  actions-ephemeral-runner: "True"
              topologyKey: kubernetes.io/hostname
        tolerations:
          - key: workload
            operator: Equal
            value: arc_runners_v1
            effect: NoSchedule
        containers:
          - name: runner
            image: ghcr.io/actions/actions-runner:latest
            command: ["/home/runner/run.sh"]
            env:
              - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
                value: "false"
              - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
                value: /mnt/pod-template/pod-template.yml
            volumeMounts:
              - name: pod-template
                mountPath: /mnt/pod-template
                readOnly: true
            resources:
              requests:
                memory: "250Mi"
                cpu: "100m"
              limits:
                memory: "250Mi"
        volumes:
          - name: pod-template
            configMap:
              name: career-lunch-bootstrap-pod-template

---
# pod-template.yml

{{- $name := ( include "career-lunch-library.fullname" . ) }}
{{- $labels := ( include "career-lunch-library.labels" . ) }}

{{- /* See https://github.com/actions/runner-container-hooks/blob/main/docs/adrs/0096-hook-extensions.md */}}
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ $name }}-pod-template
  namespace: {{ .Values.arcRunners.namespace }}
  labels:
{{ $labels | indent 4 }}
data:
  pod-template.yml: |
    {{- /*
      Technically, we do not need a PodTemplate, see https://github.com/actions/runner-container-hooks/blob/main/examples/extension.yaml
      But this communicates the intent better
    */}}
    apiVersion: v1
    kind: PodTemplate
    metadata:
      annotations:
        karpenter.sh/do-not-evict: "true"
    spec:
      serviceAccountName: {{ include "career-lunch-library.fullname" . }}-arc-runners-sa
      securityContext:
        {{- /* Provides access to /home/runner/_work directory in ephemeral volume */}}
        fsGroup: 123
      nodeSelector:
        workload: arc_runners_v1
      tolerations:
        - key: workload
          operator: Equal
          value: arc_runners_v1
          effect: NoSchedule
      containers:
        {{- /* Overrides job container */}}
        - name: $job
          resources:
            {{- /* Memory values are rather high, but we need them for webpack / Cypress jobs */}}
            requests:
              memory: 4Gi
              cpu: 1
            {{- /* 
              Best practice is to add limits too
              However, we run these jobs in isolated instances and want them to use all resources available
              Instance resources are configured via Karpenter
            */}}

Controller Logs

https://gist.github.com/wasd171/eae06234a06a26592a264cd617711a85
Look at the logs starting with 2024-04-17T17:32:32Z

Listener logs:
2024-04-17T17:32:24Z	INFO	listener-app.listener	Job available message received	{"jobId": 25672}
2024-04-17T17:32:24Z	INFO	listener-app.listener	Acquiring jobs	{"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z	INFO	listener-app.listener	Jobs are acquired	{"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 10}
2024-04-17T17:32:25Z	INFO	listener-app.worker.kubernetesworker	Created merge patch json for EphemeralRunnerSet update	{"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:32:25Z	INFO	listener-app.worker.kubernetesworker	Scaling ephemeral runner set	{"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:25Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:32:25Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 10}
2024-04-17T17:32:31Z	INFO	listener-app.listener	Processing message	{"messageId": 11, "messageType": "RunnerScaleSetJobMessages"}
2024-04-17T17:32:31Z	INFO	listener-app.listener	New runner scale set statistics.	{"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
2024-04-17T17:32:31Z	INFO	listener-app.listener	Job assigned message received	{"jobId": 25672}
2024-04-17T17:32:31Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 11}
2024-04-17T17:32:32Z	INFO	listener-app.worker.kubernetesworker	Created merge patch json for EphemeralRunnerSet update	{"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-04-17T17:32:32Z	INFO	listener-app.worker.kubernetesworker	Scaling ephemeral runner set	{"assigned job": 1, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:32Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 1}
2024-04-17T17:32:32Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}
2024-04-17T17:33:22Z	INFO	listener-app.worker.kubernetesworker	Created merge patch json for EphemeralRunnerSet update	{"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:33:22Z	INFO	listener-app.worker.kubernetesworker	Scaling ephemeral runner set	{"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-04-17T17:33:22Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:33:22Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}

Runner Pod Logs

Unable to provide since the pod is evicted by controller
@wasd171 wasd171 added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 17, 2024
@nikola-jokic nikola-jokic removed the needs triage Requires review from the maintainers label Apr 18, 2024
@nikola-jokic
Copy link
Member

Hey @wasd171,

Thank you for reporting this! The root cause of the problem is that in 0.9.1, the assumption was that if there is an empty batch, we self-correct the count. My assumption was that 50s is enough time for the cluster to be ready, but in this case, it was obviously wrong.

An important thing to point out is that it should self correct on the next run on the scale set. However, this should be fixed, so again, thank you for bringing this to our attention! Please, use the older version of ARC in the meantime. You can also use the older version of the listener that doesn't propagate patch ID, which may occasionally create more pods than necessary, but it would handle this case appropriately.

@koreyGambill
Copy link

@nikola-jokic , I want to check if this is the same issue I am having.

I have a listener that scheduled 7 jobs to 7 arc-runner pods, but those pods were pending while waiting for a node to come online. After about a minute, all of those pods went away, and then 1-2 minutes later the node was ready for them to run on. The pods didn't come back up, though.

Meanwhile, in github actions, the jobs were stuck waiting for a runner group. The listener logs in arc-systems shows that those jobs were assigned, but the pods they were assigned to had been terminated (probably by the controller) and so the jobs were lost never to be scheduled.

Does this sound like the same thing, and would rolling back to 0.9.0 to address the problem?

@nikola-jokic
Copy link
Member

Hey @koreyGambill,

I think so. Rolling back to 0.9.0 should fix the problem. I have to mention that in a very unlucky case, 0.9.0 can increase latency of starting a job. It is a rare case, and if you have a busy cluster, it is very unlikely to happen. However, if you want to be 100% sure everything is executed as quickly as possible in every situation, please rollback to 0.8.3. This version of controller can create and delete ephemeral runners more than 0.9.0, but will ensure that the runner is created as soon as possible. In the next release (0.9.2), the controller should be able to pick up everything right away, and decrease the number of created pods.

@ywang-psee
Copy link

If we could set the waiting time for pods in the gha-runner-scale-set-controller values.yaml, that would be great :)

@nikola-jokic
Copy link
Member

Hey @ywang-psee,

We don't have a waiting time, it was a bug introduced in 0.9.1 😞
This PR should fix it.

@ohookins
Copy link

Think we are having the same problem. Also using Karpenter here, and actually saw the listener create the appropriate ephemeralrunnerset settings, then ephemeralrunners started creating, then actual pods, but they all get killed off very quickly and it's back to 0 before the queued jobs can be picked up.

Have reverted to 0.9.0 and things seem to be working again.

@OneideLuizSchneider
Copy link

Yep, same here, we reverted to 0.9.0 as well!
...Keeping an eye on this PR #3204

@chenghuang-mdsol
Copy link

@nikola-jokic Hi, do you know when will release 0.9.2 happen ?

@nikola-jokic
Copy link
Member

Hey @chenghuang-mdsol,

I cannot promise anything, but hopefully, later this week or next week.

@casey-robertson-paypal
Copy link

I need to dig into it but "rolling back" to 0.8.3 results in chart diff errors with regard to clusterrole and the namespace it's in. Chart behavior must have changed with regard to where accounts are created (which namespace). I set the namespace aligned with the quick start docs (arc-runners and arc-systems) - do most people just use kube-system? We are still testing so I can blow it all away but just wondering.

Also for removal - will helm uninstall do this cleanly or is there CRD cleanup required?

@nikola-jokic
Copy link
Member

Hey everyone 👋

Would anyone be interested in testing out this fix before we release it please 🙏

To do it, you can follow these steps:

  1. In the gha-runner-scale-set-controller values.yaml file, update the tag value to: canary-3bda9bb
  2. Update the field appVersion in the Chart.yaml file for gha-runner-scale-set to be: canary-3bda9bb
  3. Redeploy ARC using the updated helm chart and values.yaml files

Thank you in advance! ❤️

@sungmincs
Copy link

Thanks @nikola-jokic
I just tested with the canary image tag and the helm chart, and looks like it fixed the issue

NAME          NAMESPACE  	REVISION	UPDATED                             	STATUS  	CHART                                	APP VERSION
arc           arc-systems	1       	2024-05-18 00:57:23.367937 +0900 KST	deployed	gha-runner-scale-set-controller-0.9.1	0.9.1
runner-arm64  arc-runners	1       	2024-05-18 00:57:25.176608 +0900 KST	deployed	gha-runner-scale-set-0.9.1           	canary-3bda9bb
$ kubectl get pod -n arc-runners
NAME                              READY   STATUS    RESTARTS   AGE
runner-arm64-k9mpd-runner-j7g9n   2/2     Running   0          82s

it took roughly 70~80 sec for the node to be ready and for the runner pod to start running, and the controller tolerated the start time.

@sungmincs
Copy link

However, one thing I want to call out is that when I firstly tried to install this canary image and chart, the listener failed to boot up with the following missing role.

2024-05-17T15:44:10Z	ERROR	AutoscalingRunnerSet	Failed to remove finalizers on dependent resources	{"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}, "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*AutoscalingRunnerSetReconciler).Reconcile
	github.com/actions/actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go:139
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z	ERROR	Reconciler error	{"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"runner-arm64","namespace":"arc-runners"}, "namespace": "arc-runners", "name": "runner-arm64", "reconcileID": "1f35d8c0-b14f-471b-bbc0-c132a7a5196b", "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z	INFO	AutoscalingRunnerSet	Deleting resources	{"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}
2024-05-17T15:44:10Z	INFO	AutoscalingRunnerSet	Cleaning up the listener	{"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}

Maybe it was just a timing issue that role/rolebinding created after the controller got created, or just a one-off blip because I couldn't reproduce it after all.
I thought I would share it just in case anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants