Closed
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.I am using charts that are officially provided
Controller Version
0.11.0
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
- have a scaleset with a GHA token
- expire the token
- scaleset listener pod crashes (correct)
- controller deletes listener pod
- listener config is kept in place, with old copy of the secret
- new listener pod is created with as context the new secret, but as the config already exists, it's not replaced
- listener pod crashes again, because token is still invalid
Describe the bug
So the lines after also never run, which seem to be in charge of deleting the config secret.
Which results into the following, when the token gets refreshed / replaced in the original secret (configured with the autoscalingrunnerset.spec.githubConfigSecret
):
$ kubectl -n github-runners get secret test-github-runners-55655b45-listener-config -o jsonpath="{.data['config\.json']}"|base64 -d|jq .token | md5sum
c4e76cd6a3e556f2348fa05590baf4a7 -
$ kubectl -n github-runners get secret test-github-runners-55655b45-listener -o jsonpath={.data.github_token}|base64 -d|md5sum
91ff6fda9734754b5fed52613bb15d13 -
Describe the expected behavior
I would expect that if I replace the token in the secret configured in autoscalingrunnerset.spec.githubConfigSecret
, the listener pod will eventually run with the new token.
Additional Context
.
Controller Logs
https://gist.github.com/hsmade/9f21b38f9680b7c3a42f690ecb6ddf65
Runner Pod Logs
https://gist.github.com/hsmade/82d14b70299d878322e420c605733760
Activity
github-actions commentedon Apr 9, 2025
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
hsmade commentedon Apr 9, 2025
Maybe this is happening because the listener pod gets recreated by the reconciler between the return on line 292 and the next execution? So it never matches line 293 (
case kerrors.IsNotFound(err)
)?hsmade commentedon Apr 11, 2025
Hmm no, it doesn't even get there. What I think happens is that the pod terminates because the container terminates and restartPolicy is set to never. Then the Reconcile loop sees that, deletes the pod, and recreates it the next loop. But because the config secret isn't deleted, it doesn't get recreated, so it still has the old values.
nikola-jokic commentedon Apr 11, 2025
Hey @hsmade, if you want to update the token, you should probably update it for the autoscaling runner set as well. But if this secret gets updated, we should probably re-create the AutoscalingListener resource which should re-create the configuration and the listener pod, mirror secret etc.
Delete config secret when listener pod gets deleted
hsmade commentedon Apr 11, 2025
Hi!
The autoscaler runner set has:
Which is the secret that is getting updated. But the changes don't get reflected in the listener.
What I find curious is that when the listener pod terminates, the pod gets deleted by the reconciler, just to get recreated with the exact same data. Including the config secret that isn't updated.
The mirror secret is actually updated automatically, and correctly.
What I suggest is that when the listener pod gets deleted, its config secret gets deleted as well.
This actually does fix the issue I'm seeing (tested this in a running environment).
PR #4033 is what I tested on my environment, and seems to fix the issue.
nikola-jokic commentedon Apr 14, 2025
Hey @hsmade,
Thank you for raising the PR, but it doesn't fix the root of the problem. The listener exit could be happening due to various reasons, and re-creating the secret on each crash would put more pressure on the API server without any need for it.
I think that the appropriate fix is to track the spec hash, including the secret hash. Including both would re-create the entire
AutoscalingListener
instance, which would include mirroring the secret.Would you be interested in testing this change?
hsmade commentedon Apr 14, 2025
The code does seem to know when the mirror secret is out of sync. Maybe that should invalidate the config secrets as well:
I'm confused to as to why
updateSecretsForListener
doesn't create the config secret, only the mirror secret.What's the purpose of having this mirror secret actually?
nikola-jokic commentedon Apr 23, 2025
Hey @hsmade,
I was playing around with the fix, and I was wondering if you can simply update the values.yaml file and re-apply the new secret with it, wouldn't that solve your problem? Instead of updating the secret directly, you update it through the values.yaml file. We do have some mechanisms to understand when the values.yaml hash is changed, and it should solve your problem.
hsmade commentedon Apr 25, 2025
We're using argo-cd to deploy this. The token refresh happens because I'm using an external secret operator that generates it from the GH app. And the TTL for such token is just 1 hour.
But in general, I'd say it would be good to support the primary token secret to update to work?
nikola-jokic commentedon May 7, 2025
You were right, your approach was much better! I left few comments on the PR. 😄