Skip to content

workflow container pod fails immediately upon unscheduled status waiting for node to provision #173

Open
@jonathan-fileread

Description

@jonathan-fileread

GHA jobs fail instantly if a pod is unscheduable due to waiting for another node to become available (if the resource request for CPU/Memory is high, waiting for the node autoscaler)

Screenshot 2024-07-05 at 5 13 03 PM

Warning OutOfcpu pod/arc-runner-set-productdev-b8rwm-runner-b5m9z-workflow Node didn't have enough resource: cpu, requested: 3000, used: 6560, capacity: 7820

What should be happening preferablly:

There should be a timeout field either in the runner set or container hooks podtemplate that allows the workflow pod to wait for x minutes till the pod is scheduled after another node is alive.

To Reproduce

Install ARC Controller + Runner set 0.9.2
define ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE with the podTemplate, and containerMode: "Kubernetes"
define a pod template like this

apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Additional Context

template:
  spec:
    initContainers:
      - name: kube-init
        image: ghcr.io/actions/actions-runner:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            sudo chown -R 1001:123 /home/runner/_work
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
    securityContext:
      fsGroup: 123 ## needed to resolve permission issues with mounted volume. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors#error-access-to-the-path-homerunner_work_tool-is-denied
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
          value: /home/runner/pod-templates/default.yml
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"  ## To allow jobs without a job container to run, set ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER to false on your runner container. This instructs the runner to disable this check.
        volumeMounts:
        - name: pod-templates
          mountPath: /home/runner/pod-templates
          readOnly: true
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteOnce"]
              storageClassName: "managed-csi"
              resources:
                requests:
                  storage: ${local.volume_claim_size}
      - name: pod-templates
        configMap:
          name: "runner-pod-template"
    

containerMode:
  type: "kubernetes"  ## type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: "managed-csi"
    resources:
      requests:
        storage: 50Gi


Pod Template YAML:
apiVersion: v1
data:
  default.yml: |
    "apiVersion": "v1"
    "kind": "PodTemplate"
    "metadata":
      "name": "runner-pod-template"
    "spec":
      "containers":
      - "name": "$job"
        "resources":
          "limits":
            "cpu": "3000m"
          "requests":
            "cpu": "3000m"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions