-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Scale Set Listener Stops Responding #3204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @jameshounshell, Is it possible that there are network interrupts causing problems? At version 0.8.0, we did add client timeouts to retriable client that could help determine if there are any network issues. Just keep in mind, we are missing metrics on the newest listener app (Will be part of 0.8.2, and to enable metrics, you can use the old listener app as described here) |
We can sit tight and wait for 0.8.2 and report back on this issue. |
I am seeing this exact same problem with our listener pod. I can also confirm it appears to be related to the token refresh. Its always the last thing logged by the pod before it stops responding. |
Hey @jasonluck, Could you also please post the log so we can inspect it? It is always better to have more evidence of an issue to help us find the root cause |
I am going to close this issue until we hear back from you. Please, let us know if the issue is resolved. If not, we can always re-open it |
Now that we've updated to the Here's the logs that I collected this morning:
|
@nikola-jokic Please re-open this ticket :) |
We are on 0.8.1 (with the LISTENER_ENTRYPOINT workaround) and I think we saw this today, one of our four listeners just stopped responding. The controller finalised and removed its remaining runner pods as they finished their jobs but didn't produce any errors or otherwise appear to notice. About 45 mins later a node scaledown evicted the controller and it all came back to life after that. We also saw a token refresh just before this happened, listener logs:
We have been using the current version for about a month without seeing it until today, but we are on EKS and upgraded from 1.28 to 1.29 yesterday. If it happens again I may downgrade and see if it goes away. |
@RalphSleighK This is the behavior we saw with 0.8.1 as well. If left as it is, it seems to recover after about an hour of being brain dead. Otherwise it takes us deleting the pod to make it happy again. |
@ccureau What version of K8s are you using out of curiosity? We've just upgraded to 1.29, wondering if it's a coincidence that we saw this issue post-upgrade. |
@angusjellis We're on 1.27.10 in a GKE cluster. We did an upgrade from 1.26 to 1.27 while this was happening without any change in listener behavior for what it's worth. |
We got a new error this morning...unsure what to make of it:
|
With
once/twice a day (k8s v1.28.5) |
Same here with GKE |
Re-opening this one for #3204 (comment) |
Also seeing this here, with 0.9.1. The listener seems to be running but we don't see any new jobs being created in the cluster. What's frustrating is that the log level is already debug by default, and there's no way to get any further detail from the listener to see what might be happening. I did take a quick look at the listener queue handling code, and while I'm not positive I'm either correct or that the possible problem is even related to this, I wonder if we have a goroutine leak here: When a queue message is requested and read, if any of the error paths are followed, we defer the closing of EDIT: I missed the full response body buffering in |
@nikola-jokic yes, that sounds very much like it! Thanks :) |
I add So far, one day has passed, and I am waiting for more statistical data to accumulate. |
Hey @verdel, When you can, please provide the log when the listener restarts. Since this has been an issue with the older version of the listener, it may be related to the actions server. In the 0.9.1 release, we added the trace ID which should help us track down the issue. |
I'll send the logs a bit later. We are staying on version 0.9.0 as we are waiting for a fix that will be added in this PR. |
Lately, I've been seeing the following message in the listener.
We use runners that are rarely launched (no more than once or twice a day). Such a message appears during the first launch of the day. After that, Today, after I added a restart policy, there have been no messages so far. I will continue to observe and provide additional information when it becomes available. |
Hey everyone, I believe we found the root cause of this issue, and it is now fixed in 0.9.2. I will leave this issue opened for now until someone confirms they are not seeing these problems with the newest release |
I think this issue is safe to close now since there are no reports of listener crashes in the latest release. Thank you all for providing more information, and please let us know if you experience any issues like this again! |
I am having this issue on 0.9.3 |
Seeing this in 0.10.0 |
We believe we also may be seeing this; can we re-open this issue @nikola-jokic ? In our case the logs we see are as follows: a bunch of "job assigned" messages followed by the listener computing zero new runners should be deployed, and then a token refresh a few minutes later. Perhaps related? We are running on the scale sets
|
Hey @niodice, I'm hesitant to reopen this issue since the one you submitted is a separate issue from the one originally reported. Could you please share the workflow run URL here or via support so we can investigate it further? |
Checks
Controller Version
0.7.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Symptom:
Diagnostics:
Another symptom I've noticed: When the controller pod is updated (for instance, between 0.7.0 and 0.8.0) two inconsistencies happen:
A separate instance where we observed the issue
Currently running listener gha-runner-scale-set-controller:0.7.0
No logs were being generated, had found a workflow job in a pending state (from the listener logs).
10 Runners were online but the runner group was offline when looking at our org's action runner group settings ( url like this: https://github.com/organizations/bigcorp/settings/actions/runner-groups/9 )
Verified listener metrics had flatlined, for example:

Then the queued job being watched was assigned with no intervention.
Below is the section of logs where the listener had stopped and then started working again with no intervention.
Describe the expected behavior
Ideally the listener would never stop responding.
Additional Context
The only additional thing we tried was using the

opt out
button on our github app advanced features. This was kind of a hail mary since we saw the logs related to refreshing the token.It seems to have helped but maybe that's just a fluke.
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: