Open
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.8.3
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
runner-scale-set-values.yaml
githubConfigUrl: "https://github.com/my/repo"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
template:
spec:
securityContext:
fsGroup: 1001
serviceAccountName: gke-autopilot-gha-rs-kube-mode
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
- name: pod-templates
configMap:
name: pod-templates
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command:
- /home/runner/run.sh
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-templates/default.yaml
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "true"
- name: GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT
value: actions-runner-controller/0.8.3
resources:
requests:
cpu: 250m
memory: 1Gi
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: pod-templates
mountPath: /home/runner/pod-templates
readOnly: true
pod-template.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pod-templates
data:
default.yaml: |
---
apiVersion: v1
kind: PodTemplate
metadata:
annotations:
annotated-by: "extension"
labels:
labeled-by: "extension"
spec:
serviceAccountName: gke-autopilot-gha-rs-kube-mode
securityContext:
fsGroup: 1001
containers:
- name: $job # overwrites job container
resources:
requests:
cpu: "3800m"
memory: "4500"
rbac,yaml
---
# Source: gha-runner-scale-set/templates/kube_mode_serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
# Source: gha-runner-scale-set/templates/kube_mode_role.yaml
# default permission for runner pod service account in kubernetes mode (container hook)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "create", "delete"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["get", "create"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "create", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "create", "delete"]
---
# Source: gha-runner-scale-set/templates/kube_mode_role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: gke-autopilot-gha-rs-kube-mode
subjects:
- kind: ServiceAccount
name: gke-autopilot-gha-rs-kube-mode
namespace: actions
---
Describe the bug
I can see that a runner pod is created but it failed to create the job pod with the message Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed"
Describe the expected behavior
I expected it to create a job pod.
Additional Context
It works if I don't try to customize the job pod ie if I use a config like below. But I want to give more resources to the actual pod that's running the job so I need to use pod-templates to customize it.
githubConfigUrl: "https://github.com/my/org"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 4Gi
template:
spec:
securityContext:
fsGroup: 1001
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
controllerServiceAccount:
namespace: actions
name: gha-runner-scale-set-controller-gha-rs-controller
Controller Logs
No errors, just regular logs. I can provide it if required.
Runner Pod Logs
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] Publish step telemetry for current step {
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "action": "Pre Job Hook",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "type": "runner",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "stage": "Pre",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "stepId": "06f9adc3-e79d-405b-91eb-a7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "stepContextName": "06f9adc3e79d405b91eba7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "result": "failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "errorMessages": [
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "Process completed with exit code 1.",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "Executing the custom container implementation failed. Please contact your self hosted runner administrator."
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] ],
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "executionTimeInSeconds": 42,
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "startTime": "2024-03-27T15:18:57.1056563Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "finishTime": "2024-03-27T15:19:38.206926Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] "containerHookData": "{\"hookScriptPath\":\"/home/runner/k8s/index.js\"}"
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] }.
[WORKER 2024-03-27 15:19:38Z INFO StepsRunner] Update job result with current step result 'Failed'.