Skip to content

Job pod failed to start on GKE Autopilot with container hooks (kubernetes mode) #152

Open
@knkarthik

Description

@knkarthik

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

runner-scale-set-values.yaml

githubConfigUrl: "https://github.com/my/repo"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
template:
  spec:
    securityContext:
      fsGroup: 1001
    serviceAccountName: gke-autopilot-gha-rs-kube-mode
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 4Gi
      - name: pod-templates
        configMap:
          name: pod-templates
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command:
          - /home/runner/run.sh
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
             value: /home/runner/pod-templates/default.yaml
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
          - name: GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT
            value: actions-runner-controller/0.8.3
        resources:
          requests:
            cpu: 250m
            memory: 1Gi
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: pod-templates
            mountPath: /home/runner/pod-templates
            readOnly: true

pod-template.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: pod-templates
data:
  default.yaml: |
    ---
    apiVersion: v1
    kind: PodTemplate
    metadata:
      annotations:
        annotated-by: "extension"
      labels:
        labeled-by: "extension"
    spec:
      serviceAccountName: gke-autopilot-gha-rs-kube-mode
      securityContext:
        fsGroup: 1001
      containers:
        - name: $job # overwrites job container
          resources:
            requests:
              cpu: "3800m"
              memory: "4500"

rbac,yaml

---
# Source: gha-runner-scale-set/templates/kube_mode_serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gke-autopilot-gha-rs-kube-mode
  namespace: actions

# Source: gha-runner-scale-set/templates/kube_mode_role.yaml
# default permission for runner pod service account in kubernetes mode (container hook)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: gke-autopilot-gha-rs-kube-mode
  namespace: actions
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["get", "create"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "create", "delete"]
---
# Source: gha-runner-scale-set/templates/kube_mode_role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: gke-autopilot-gha-rs-kube-mode
  namespace: actions
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: gke-autopilot-gha-rs-kube-mode
subjects:
  - kind: ServiceAccount
    name: gke-autopilot-gha-rs-kube-mode
    namespace: actions
---

Describe the bug

I can see that a runner pod is created but it failed to create the job pod with the message Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed"

Describe the expected behavior

I expected it to create a job pod.

Additional Context

It works if I don't try to customize the job pod ie if I use a config like below. But I want to give more resources to the actual pod that's running the job so I need to use pod-templates to customize it.

githubConfigUrl: "https://github.com/my/org"
githubConfigSecret: github-token
runnerScaleSetName: "gke-autopilot"
maxRunners: 2
minRunners: 0
containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    resources:
      requests:
        storage: 4Gi
template:
  spec:
    securityContext:
      fsGroup: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
    
controllerServiceAccount:
  namespace: actions
  name: gha-runner-scale-set-controller-gha-rs-controller

Controller Logs

No errors, just regular logs. I can provide it if required.

Runner Pod Logs

[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] Publish step telemetry for current step {
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "action": "Pre Job Hook",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "type": "runner",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "stage": "Pre",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "stepId": "06f9adc3-e79d-405b-91eb-a7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "stepContextName": "06f9adc3e79d405b91eba7f72f1e56c4",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "result": "failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "errorMessages": [
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]     "Error: pod failed to come online with error: Error: Pod gke-autopilot-4vvrh-runner-74czb-workflow is unhealthy with phase status Failed",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]     "Process completed with exit code 1.",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]     "Executing the custom container implementation failed. Please contact your self hosted runner administrator."
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   ],
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "executionTimeInSeconds": 42,
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "startTime": "2024-03-27T15:18:57.1056563Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "finishTime": "2024-03-27T15:19:38.206926Z",
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext]   "containerHookData": "{\"hookScriptPath\":\"/home/runner/k8s/index.js\"}"
[WORKER 2024-03-27 15:19:38Z INFO ExecutionContext] }.
[WORKER 2024-03-27 15:19:38Z INFO StepsRunner] Update job result with current step result 'Failed'.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions