Skip to content

RunnerSet does not always re-use Available PV #3221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
7 tasks done
chaosun-abnormalsecurity opened this issue Jan 11, 2024 · 3 comments
Open
7 tasks done

RunnerSet does not always re-use Available PV #3221

chaosun-abnormalsecurity opened this issue Jan 11, 2024 · 3 comments
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers

Comments

@chaosun-abnormalsecurity

Checks

Controller Version

0.27.0

Helm Chart Version

0.22.0

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: gha-runner
  namespace: cicd--ci
spec:
  dockerEnabled: true
  ephemeral: true
  group: Default
  labels:
  - ci
  replicas: 3
  repository: <REPOSITORY>
  selector:
    matchLabels:
      app: ci
  serviceName: gha-runner
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        kubectl.kubernetes.io/default-logs-container: runner
      labels:
        app: ci
    spec:
      containers:
      - env:
        - name: DISABLE_RUNNER_UPDATE
          value: "true"
        - name: RUNNER_ALLOW_RUNASROOT
          value: "1"
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: "110"
        - name: STARTUP_DELAY_IN_SECONDS
          value: "30"
        name: runner
        resources:
          limits:
            cpu: "1.8"
            memory: 7Gi
          requests:
            cpu: "1.5"
            memory: 6Gi
      - name: docker
        volumeMounts:
        - mountPath: /var/lib/docker
          name: docker
      securityContext:
        fsGroup: 1001
      serviceAccountName: gha-runner
  volumeClaimTemplates:
  - metadata:
      name: docker
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 200Gi

To Reproduce

1. Deploy the RunnerSet and let it work normally
2. `Available` PVs can grow quickly along the time (we got 4.5k in 1 month)

Describe the bug

  1. We noticed Available PVs grow quickly along the time and reached 4.5k in 1 month. This indicates the RunnerSet is not re-using PVs properly
  2. We also noticed some PVs are indeed being re-used, e.g. a Runner that was created 10m ago is using a PV that is 18d old. But the majority of Runners just spins up new volumes
  3. We use a custom runner image which is built on top of docker.io/summerwind/actions-runner. The only difference is we installed a few additional libraries and binaries, e.g. kubectl, helm, aws cli etc. and we are not using a custom entrypoint

Describe the expected behavior

As described in the doc and discussion, ARC should maintain a pool of persistent volumes to be re-used by Runners, instead of provisioning new ones for most of the Runners.

Whole Controller Logs

https://gist.github.com/chaosun-abnormalsecurity/4d92b87f3807fcbaa279e1099200d20e

Whole Runner Pod Logs

https://gist.github.com/chaosun-abnormalsecurity/4879c98298f992698ee6824c9a2d4bb6

Additional Context

No response

@chaosun-abnormalsecurity chaosun-abnormalsecurity added bug Something isn't working community Community contribution needs triage Requires review from the maintainers labels Jan 11, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@waveofmymind
Copy link

I ran into this problem today
It creates a new PV even though an available PV exists.
I'm wondering if it needs time to unbind from the PV and become available again, and if not, if it's a bug.

@rdepres
Copy link

rdepres commented Jan 19, 2024

I believe this issue is a duplicate of #2282.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community Community contribution needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

3 participants