-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Sometimes, a job need to wait 30/60 minutes before getting a runner #3953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We are seeing the same issue and we have a similar setup. We are unsure if the two runnerscalesets (for upgrade ease) is actually causing problems. |
We are on 0.8.2, and seem to be encountering a similar issue. We recently upgraded Karpenter to 1.3.3, and that's when we began seeing this issue. But it may have been existing before that. |
I’m observing similar behavior, even when not running in a high availability setup (single cluster on Azure). Unfortunately, the logs offer no insight, and the latency is unpredictable. |
Our organization is experiencing job queue delays exceeding 12 hours, severely impacting production workloads. No error logs are observed on our side. What steps can we take to troubleshoot this issue? @nikola-jokic |
Hey everyone, Could you please submit these logs without obfuscation through the support? We cannot investigate it without understanding which workflow runs are stuck. If you have failed runners, they count as the number of runners, so we can avoid creating an indefinite number of runners if something goes wrong with the cluster. |
Hey everyone, We found the root cause of the issue, and it should be fixed now. Please let us know if you are still experiencing this issue. I will leave this issue open for now for visibility. Thank you all for reporting it! |
Do we need to uninstall and re-deploy ARC? |
No, the issue was on the back-end side, so it should start working properly without touching the ARC installation. |
Do you have failed ephemeral runners? If you don't have failed ones, please send the listener log, the controller log and workflow URLs of the pending jobs. You can submit them in the support issue if you don't want to share them publicly. |
@marcusisnard unfortunately I offer no help, but I wanted to ask how you view that particular UI. It looks like a GitHub view showing the scale sets directly in the UI. I have no such view, but it would be nice to see it. |
Checks
Controller Version
0.10.1
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Let's say that I have a workflow with 3 jobs running in parallel.
Sometimes, the jobs 1 and 2 will get a runner right away but the third one will have to wait 30 minutes to an hour before getting a runner.
Describe the expected behavior
All the jobs should start right away.
Note that I have two runner-scale-sets with the same
runnerScaleSetName
name, I don't know if its a bad practice or not but its working fine 🤷♂I did that to ease teh upgrade process when a new chart is available, I update the gha-runner-scale-sets one by one to avoid service interruptions.
Thanks
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: