Skip to content

Scale-set listeners fail to start after 0.11 upgrade -- AWS CNI failed to assign an IP address to container #3998

Closed
@mattpopa

Description

@mattpopa

Checks

Controller Version

0.11

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. on an healthy running AWS EKS cluster with ARC 0.10.1 and scale-sets, 
2. uninstall scale-sets
3. uninstall ARC
4. manually delete github actions CRDs
5. install ARC 0.11
6. install scale-sets

Describe the bug

After upgrading, scale-set listener pods fail to start

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "309c3d6cd193758149c8f8354fe4e94fa8c4284938d73e7886d605134134cc6a": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

we have around 500 IPs available in 2 subnets, using a t3a.2xlarge with around 25 pods out of its 58 max capacity limit.

There is no EKS issue or misconfiguration, 0.10.1 has been working well. Also, after uninstalling the 0.11.0 (removing CRDs) and restoring 0.10.1 all is well again

Describe the expected behavior

The expected behavior is to get the scale-sets running healthy after upgrading to 0.11

Additional Context

githubConfigUrl: <>
githubConfigSecret: <>

maxRunners: 25
minRunners: 0

runnerScaleSetName: "small-set"

listenerTemplate:
  metadata:
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/path: /metrics
      prometheus.io/port: "8080"
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  spec:
    priorityClassName: ${arc_priority_class}
    nodeSelector:
      pool: ${arc_pool}
    containers:
    - name: listener
      securityContext:
        runAsUser: 1000

# runner template
template:
  metadata:
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  spec:
    serviceAccountName: ${runners_sa}
    priorityClassName: ${arc_priority_class}
    nodeSelector:
      pool: ${runner_pool}
    initContainers:
    - name: init-dind-externals
      image: ${ecr_runners_repo}:runner-${custom_img}
      command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
      volumeMounts:
        - name: dind-externals
          mountPath: /home/runner/tmpDir
    containers:
    - name: runner
      image: ${ecr_runners_repo}:runner-${custom_img}
      command: ["/home/runner/run.sh"]
      env:
        - name: DOCKER_HOST
          value: unix:///run/docker/docker.sock
      resources:
        requests:
          memory: "2Gi"
          cpu: "1000m"
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /run/docker
          readOnly: true
    - name: dind
      image: ${ecr_runners_repo}:dind-${custom_img}
      args:
        - dockerd
        - --host=unix:///run/docker/docker.sock
        - --group=$(DOCKER_GROUP_GID)
      env:
        - name: DOCKER_GROUP_GID
          value: "123"
      securityContext:
        privileged: true
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /run/docker
        - name: dind-externals
          mountPath: /home/runner/externals
      resources:
        requests:
          memory: "2Gi"
          cpu: "1000m"
    volumes:
    - name: work
      emptyDir: {}
    - name: dind-sock
      emptyDir: {}
    - name: dind-externals
      emptyDir: {}

Controller Logs

There are no logs in the controller, just this error of the crashloop reason for scale-set listener pods:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "309c3d6cd193758149c8f8354fe4e94fa8c4284938d73e7886d605134134cc6a": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

no logs in the aws-node daemon-set related pod either

Runner Pod Logs

this isn't about runners, just about the scale-set listeners failing to start on 0.11 on AWS EKS (1.31) healthy cluster

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions