Skip to content

RunnerSet does not always re-use Available PV #3221

Open
@chaosun-abnormalsecurity

Description

Checks

Controller Version

0.27.0

Helm Chart Version

0.22.0

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: gha-runner
  namespace: cicd--ci
spec:
  dockerEnabled: true
  ephemeral: true
  group: Default
  labels:
  - ci
  replicas: 3
  repository: <REPOSITORY>
  selector:
    matchLabels:
      app: ci
  serviceName: gha-runner
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        kubectl.kubernetes.io/default-logs-container: runner
      labels:
        app: ci
    spec:
      containers:
      - env:
        - name: DISABLE_RUNNER_UPDATE
          value: "true"
        - name: RUNNER_ALLOW_RUNASROOT
          value: "1"
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: "110"
        - name: STARTUP_DELAY_IN_SECONDS
          value: "30"
        name: runner
        resources:
          limits:
            cpu: "1.8"
            memory: 7Gi
          requests:
            cpu: "1.5"
            memory: 6Gi
      - name: docker
        volumeMounts:
        - mountPath: /var/lib/docker
          name: docker
      securityContext:
        fsGroup: 1001
      serviceAccountName: gha-runner
  volumeClaimTemplates:
  - metadata:
      name: docker
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 200Gi

To Reproduce

1. Deploy the RunnerSet and let it work normally
2. `Available` PVs can grow quickly along the time (we got 4.5k in 1 month)

Describe the bug

  1. We noticed Available PVs grow quickly along the time and reached 4.5k in 1 month. This indicates the RunnerSet is not re-using PVs properly
  2. We also noticed some PVs are indeed being re-used, e.g. a Runner that was created 10m ago is using a PV that is 18d old. But the majority of Runners just spins up new volumes
  3. We use a custom runner image which is built on top of docker.io/summerwind/actions-runner. The only difference is we installed a few additional libraries and binaries, e.g. kubectl, helm, aws cli etc. and we are not using a custom entrypoint

Describe the expected behavior

As described in the doc and discussion, ARC should maintain a pool of persistent volumes to be re-used by Runners, instead of provisioning new ones for most of the Runners.

Whole Controller Logs

https://gist.github.com/chaosun-abnormalsecurity/4d92b87f3807fcbaa279e1099200d20e

Whole Runner Pod Logs

https://gist.github.com/chaosun-abnormalsecurity/4879c98298f992698ee6824c9a2d4bb6

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcommunityCommunity contributionneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions