Skip to content

scaleset listener's config is not deleted on cleanup #4029

Closed
@hsmade

Description

@hsmade

Checks

Controller Version

0.11.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
    I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

- have a scaleset with a GHA token
- expire the token
- scaleset listener pod crashes (correct)
- controller deletes listener pod
- listener config is kept in place, with old copy of the secret
- new listener pod is created with as context the new secret, but as the config already exists, it's not replaced
- listener pod crashes again, because token is still invalid

Describe the bug

is never reached (I don't see the logged line in my logs).
So the lines after also never run, which seem to be in charge of deleting the config secret.

Which results into the following, when the token gets refreshed / replaced in the original secret (configured with the autoscalingrunnerset.spec.githubConfigSecret):

$ kubectl -n github-runners get secret test-github-runners-55655b45-listener-config  -o jsonpath="{.data['config\.json']}"|base64 -d|jq .token | md5sum 
c4e76cd6a3e556f2348fa05590baf4a7  -
$ kubectl -n github-runners get secret test-github-runners-55655b45-listener  -o jsonpath={.data.github_token}|base64 -d|md5sum 
91ff6fda9734754b5fed52613bb15d13  -

Describe the expected behavior

I would expect that if I replace the token in the secret configured in autoscalingrunnerset.spec.githubConfigSecret, the listener pod will eventually run with the new token.

Additional Context

.

Controller Logs

https://gist.github.com/hsmade/9f21b38f9680b7c3a42f690ecb6ddf65

Runner Pod Logs

https://gist.github.com/hsmade/82d14b70299d878322e420c605733760

Activity

added
bugSomething isn't working
gha-runner-scale-setRelated to the gha-runner-scale-set mode
needs triageRequires review from the maintainers
on Apr 9, 2025
github-actions

github-actions commented on Apr 9, 2025

@github-actions
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

hsmade

hsmade commented on Apr 9, 2025

@hsmade
ContributorAuthor

Maybe this is happening because the listener pod gets recreated by the reconciler between the return on line 292 and the next execution? So it never matches line 293 (case kerrors.IsNotFound(err))?

hsmade

hsmade commented on Apr 11, 2025

@hsmade
ContributorAuthor

Hmm no, it doesn't even get there. What I think happens is that the pod terminates because the container terminates and restartPolicy is set to never. Then the Reconcile loop sees that, deletes the pod, and recreates it the next loop. But because the config secret isn't deleted, it doesn't get recreated, so it still has the old values.

nikola-jokic

nikola-jokic commented on Apr 11, 2025

@nikola-jokic
Collaborator

Hey @hsmade, if you want to update the token, you should probably update it for the autoscaling runner set as well. But if this secret gets updated, we should probably re-create the AutoscalingListener resource which should re-create the configuration and the listener pod, mirror secret etc.

added a commit that references this issue on Apr 11, 2025
hsmade

hsmade commented on Apr 11, 2025

@hsmade
ContributorAuthor

Hi!

The autoscaler runner set has:

spec:
  githubConfigSecret: github-auth

Which is the secret that is getting updated. But the changes don't get reflected in the listener.
What I find curious is that when the listener pod terminates, the pod gets deleted by the reconciler, just to get recreated with the exact same data. Including the config secret that isn't updated.
The mirror secret is actually updated automatically, and correctly.

What I suggest is that when the listener pod gets deleted, its config secret gets deleted as well.
This actually does fix the issue I'm seeing (tested this in a running environment).

PR #4033 is what I tested on my environment, and seems to fix the issue.

nikola-jokic

nikola-jokic commented on Apr 14, 2025

@nikola-jokic
Collaborator

Hey @hsmade,

Thank you for raising the PR, but it doesn't fix the root of the problem. The listener exit could be happening due to various reasons, and re-creating the secret on each crash would put more pressure on the API server without any need for it.
I think that the appropriate fix is to track the spec hash, including the secret hash. Including both would re-create the entire AutoscalingListener instance, which would include mirroring the secret.

Would you be interested in testing this change?

hsmade

hsmade commented on Apr 14, 2025

@hsmade
ContributorAuthor

The code does seem to know when the mirror secret is out of sync. Maybe that should invalidate the config secrets as well:

	mirrorSecretDataHash := mirrorSecret.Labels["secret-data-hash"]
	secretDataHash := hash.ComputeTemplateHash(secret.Data)
	if mirrorSecretDataHash != secretDataHash {
		log.Info("Updating mirror listener secret for the listener pod", "mirrorSecretDataHash", mirrorSecretDataHash, "secretDataHash", secretDataHash)
		return r.updateSecretsForListener(ctx, secret, mirrorSecret, log)
	}

I'm confused to as to why updateSecretsForListener doesn't create the config secret, only the mirror secret.
What's the purpose of having this mirror secret actually?

nikola-jokic

nikola-jokic commented on Apr 23, 2025

@nikola-jokic
Collaborator

Hey @hsmade,

I was playing around with the fix, and I was wondering if you can simply update the values.yaml file and re-apply the new secret with it, wouldn't that solve your problem? Instead of updating the secret directly, you update it through the values.yaml file. We do have some mechanisms to understand when the values.yaml hash is changed, and it should solve your problem.

hsmade

hsmade commented on Apr 25, 2025

@hsmade
ContributorAuthor

We're using argo-cd to deploy this. The token refresh happens because I'm using an external secret operator that generates it from the GH app. And the TTL for such token is just 1 hour.
But in general, I'd say it would be good to support the primary token secret to update to work?

nikola-jokic

nikola-jokic commented on May 7, 2025

@nikola-jokic
Collaborator

You were right, your approach was much better! I left few comments on the PR. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set mode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @hsmade@nikola-jokic

      Issue actions

        scaleset listener's config is not deleted on cleanup · Issue #4029 · actions/actions-runner-controller