Closed
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.11
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. on an healthy running AWS EKS cluster with ARC 0.10.1 and scale-sets,
2. uninstall scale-sets
3. uninstall ARC
4. manually delete github actions CRDs
5. install ARC 0.11
6. install scale-sets
Describe the bug
After upgrading, scale-set listener pods fail to start
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "309c3d6cd193758149c8f8354fe4e94fa8c4284938d73e7886d605134134cc6a": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
we have around 500 IPs available in 2 subnets, using a t3a.2xlarge with around 25 pods out of its 58 max capacity limit.
There is no EKS issue or misconfiguration, 0.10.1 has been working well. Also, after uninstalling the 0.11.0 (removing CRDs) and restoring 0.10.1 all is well again
Describe the expected behavior
The expected behavior is to get the scale-sets running healthy after upgrading to 0.11
Additional Context
githubConfigUrl: <>
githubConfigSecret: <>
maxRunners: 25
minRunners: 0
runnerScaleSetName: "small-set"
listenerTemplate:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "8080"
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
priorityClassName: ${arc_priority_class}
nodeSelector:
pool: ${arc_pool}
containers:
- name: listener
securityContext:
runAsUser: 1000
# runner template
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
serviceAccountName: ${runners_sa}
priorityClassName: ${arc_priority_class}
nodeSelector:
pool: ${runner_pool}
initContainers:
- name: init-dind-externals
image: ${ecr_runners_repo}:runner-${custom_img}
command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
volumeMounts:
- name: dind-externals
mountPath: /home/runner/tmpDir
containers:
- name: runner
image: ${ecr_runners_repo}:runner-${custom_img}
command: ["/home/runner/run.sh"]
env:
- name: DOCKER_HOST
value: unix:///run/docker/docker.sock
resources:
requests:
memory: "2Gi"
cpu: "1000m"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /run/docker
readOnly: true
- name: dind
image: ${ecr_runners_repo}:dind-${custom_img}
args:
- dockerd
- --host=unix:///run/docker/docker.sock
- --group=$(DOCKER_GROUP_GID)
env:
- name: DOCKER_GROUP_GID
value: "123"
securityContext:
privileged: true
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /run/docker
- name: dind-externals
mountPath: /home/runner/externals
resources:
requests:
memory: "2Gi"
cpu: "1000m"
volumes:
- name: work
emptyDir: {}
- name: dind-sock
emptyDir: {}
- name: dind-externals
emptyDir: {}
Controller Logs
There are no logs in the controller, just this error of the crashloop reason for scale-set listener pods:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "309c3d6cd193758149c8f8354fe4e94fa8c4284938d73e7886d605134134cc6a": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
no logs in the aws-node daemon-set related pod either
Runner Pod Logs
this isn't about runners, just about the scale-set listeners failing to start on 0.11 on AWS EKS (1.31) healthy cluster