-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Occasionally jobs are stuck in "waiting for a runnner to come online" #3649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
We are now facing this issue also, same version. |
Bit more info, maybe will help.
You can clearly see that there are 23 runners, failed are 22, but even if I try to restart job and get a hold of pod, don't see any. On top of that in logs, I see (read from bottom to top):
It's super strange, that pod was running just few seconds and than is deleted. |
is there a workaround for this? the issue effectively causes a very poor experience |
experiencing this issue aswell with a new deployment on AKS. controller logs
runner logs
Totally blocked. |
We've seen the same issue. We noticed that there are other issues where this issue was discovered in older versions of actions runner controller, and one of them has a comment saying that the issue is fixed by restarting the listener. This doesn't fix the issue for anyone, but as triage you can deploy a cron job with something like this:
You'll also need to define the ServiceAcccount and its role/rolebinding, but the above should get you started. |
On our side, it somehow magically stopped to be an issue over 2 weeks ago. Since that time, no single job was not picked up (except around an hour, when we messed up our networking in AWS and you couldn't reach the Kubernetes cluster - but it was purely problem on network, which we caused). |
Hello, we have a similar situation on our runners.
As you can see there are 3 entries of The only difference to a non working job is that it does not have the second scale. After the first one with We are using the runnerscaleset: v0.8.3 with the I even tried to patch manually the Any idea where should we look into? As it seems there is an issue between the listener pod and the Thanks |
Hello everyone, |
Just an update, we kept seeing the listeners being offline on our github instance with no apparent reason. Deleting the |
I have the same issue and @ricardojdsilva87 comment will solve it. Listener does not report any error logs. Last thing it says (repeatedly) before dying suddenly is:
I'm wondering if this has something to do with it, but it shouldn't matter if no jobs are running. Controller should be able to "see" that the listener is missing, kill the set, and recreate. So perhaps if a job is not reporting as exited, it will hang out waiting to delete ephemeral set (which also controls whether the listener exists or not)
Referencing related issue: #3704 Additional details if it helps:
|
we have experienced the same issue here: Any tips will be much appreciated! |
Hey everyone, this should b resolved now. Can you please confirm you are not getting this error anymore? |
I'll close this issue here since we are tracking it on the new issue #3953 |
Hello @nikola-jokic, is this fix only for 0.11.0? |
Uh oh!
There was an error while loading. Please reload this page.
Checks
Controller Version
0.9.3
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
In the example, on the CI piepline:
We have multiple cases of jobs waiting a long time for a runner, without any clear explanation.
Describe the expected behavior
Additional Context
There has been attempt to play around the 'minWorkers' configuration. of the runner-set to prevent the waiting issue.
Runner-set are deployed with helm using the values.yaml file:
Controller Logs
Runner Pod Logs
No relevant logs for the runner, as it is the lack of runner that is the issue
The text was updated successfully, but these errors were encountered: