Skip to content

Listener not aware of pending work after restart/reinstall #4027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
katarzynainit opened this issue Apr 9, 2025 · 7 comments
Open
4 tasks done

Listener not aware of pending work after restart/reinstall #4027

katarzynainit opened this issue Apr 9, 2025 · 7 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@katarzynainit
Copy link

katarzynainit commented Apr 9, 2025

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. I have ARC setup working fine
2. I am unistalling runnerset X
3. Triggering workflow that awaits for runner X
4. I'm installing runnerset X again (no changes in runnerset configuration)
5. The workflow awaits forever
6. After clicking Cancel/Rerun on the workflow, listener sees new work to be done and all works fine again

It also happens in such scenario:

1. I have ARC setup working fine
2. I am updating runnerset X - new listener is being created
3. Triggering workflow that awaits for runner X
4. The new (updated) listener is up
5. The workflow awaits forever
6. After clicking Cancel/Rerun on the workflow, listener sees new work to be done and all works fine again

Describe the bug

The listener is not aware of the work and runners needed, if the work was scheduled when listener was offline.
After cancel/rerun on the job all works fine.

Describe the expected behavior

The work should be picked up by the listener once it is online, no matter if it was scheduled when listener was up or not.

Additional Context

Listener logs:

{"severity":"info","ts":"2025-04-09T09:31:33Z","logger":"listener-app","message":"app initialized"}
{"severity":"info","ts":"2025-04-09T09:31:33Z","logger":"listener-app","message":"Starting listener"}
{"severity":"info","ts":"2025-04-09T09:31:33Z","logger":"listener-app","message":"refreshing token","githubConfigUrl":"https://github.com/REDACTED"}
{"severity":"info","ts":"2025-04-09T09:31:33Z","logger":"listener-app","message":"getting access token for GitHub App auth","accessTokenURL":"https://api.github.com/app/installations/REDACTED"}
{"severity":"info","ts":"2025-04-09T09:31:33Z","logger":"listener-app","message":"getting runner registration token","registrationTokenURL":"https://api.github.com/orgs/REDACTED/actions/runners/registration-token"}
{"severity":"info","ts":"2025-04-09T09:31:34Z","logger":"listener-app","message":"getting Actions tenant URL and JWT","registrationURL":"https://api.github.com/actions/runner-registration"}
{"severity":"info","ts":"2025-04-09T09:31:34Z","logger":"listener-app.listener","message":"Current runner scale set statistics.","statistics":"{\"totalAvailableJobs\":0,\"totalAcquiredJobs\":0,\"totalAssignedJobs\":0,\"totalRunningJobs\":0,\"totalRegisteredRunners\":0,\"totalBusyRunners\":0,\"totalIdleRunners\":0}"}
{"severity":"info","ts":"2025-04-09T09:31:34Z","logger":"listener-app.worker.kubernetesworker","message":"Calculated target runner count","assigned job":0,"decision":0,"min":0,"max":3,"currentRunnerCount":0,"jobsCompleted":0}
{"severity":"info","ts":"2025-04-09T09:31:34Z","logger":"listener-app.worker.kubernetesworker","message":"Compare","original":"{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}","patch":"{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"patchID\":0,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
{"severity":"info","ts":"2025-04-09T09:31:34Z","logger":"listener-app.worker.kubernetesworker","message":"Preparing EphemeralRunnerSet update","json":"{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
{"severity":"info","ts":"2025-04-09T09:31:34Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"REDACTED","name":"REDACTED","replicas":0}
{"severity":"info","ts":"2025-04-09T09:31:34Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":0}

I use fork, no changes in the listener code.
Newest runner image in use.

Controller Logs

Controller logs look normally. Only listener is not picking up work

Runner Pod Logs

N/A
@katarzynainit katarzynainit added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 9, 2025
@nikola-jokic
Copy link
Collaborator

Hey, @katarzynainit,

If you schedule another workflow, you should be able to get the ones in the queue as well. It shouldn't present a problem for busy scale sets, but it is an issue on the back-end side that should be live soon.

I'll keep this one open here for visibility, and I will notify you when the fix is done.

@nikola-jokic nikola-jokic removed the needs triage Requires review from the maintainers label Apr 14, 2025
@katarzynainit
Copy link
Author

Thank you, awaiting your update :)

@WyriHaximus
Copy link
Contributor

If you schedule another workflow, you should be able to get the ones in the queue as well. It shouldn't present a problem for busy scale sets, but it is an issue on the back-end side that should be live soon.

I'll keep this one open here for visibility, and I will notify you when the fix is done.

Just to add additional context here: Seeing the same on 0.11.0 over the weekend. And indeed most of the time scheduling another job runs the previous one but not the new one. This feels somewhat related to ##4013

@nikola-jokic
Copy link
Collaborator

nikola-jokic commented Apr 15, 2025

Hey @WyriHaximus,

Can you also send the log so we can correlate what was going on with your scale set as well as with @katarzynainit. This way, we can inspect both situations and see if they have any similarities in building up the job message.

And @katarzynainit, in the log you provided, you redacted the information related to the org and the scale set. If this information needs to stay private, please reach out to the support because we will need this information to see activity on your scale set and properly debug it. Thanks!

@WyriHaximus
Copy link
Contributor

@nikola-jokic Will do when I get home. But until then, I can give you the queue graph I've been using (it does the following query sum(gha_assigned_jobs) - sum(gha_running_jobs)) for all my runners:

Image

And this is from private repo wyrimaps.net-static-maps-serverless on my personal user WyriHaximus for scale set chaos-ng where I am expecting available jobs to be non-0:
Image

Jobs have been queue for 12 hours at this point.

@katarzynainit
Copy link
Author

And @katarzynainit, in the log you provided, you redacted the information related to the org and the scale set. If this information needs to stay private, please reach out to the support because we will need this information to see activity on your scale set and properly debug it. Thanks!

@nikola-jokic I have provided full logs via our support channel.

Adding additional finding to the ticket:

  • if the runnerset is uninstalled fully before triggering job, it seems that after installation, the work is started minute after listener is up (this was not the case week ago, now it works)
  • if there are running jobs - on helm uninstall controller removes listener and awaits all runners to finish. After full uninstall, I run helm install - the listener is not aware of the pending work, until another job is triggered.

@WyriHaximus
Copy link
Contributor

Here are the logs @nikola-jokic chaos.log. Looks like it restarted two and a half hours ago but didn't process anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

No branches or pull requests

3 participants