-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Create backoff mechanism for failed runners and allow re-creation of failed ephemeral runners #4059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…failed ephemeral runners
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
This would be a big help to us. We're having some issues with our Kubevirt runners that we need to resolve, but this would help us out until we can nail down the precise problem. |
Could we please prioritize this PR? We are currently experiencing the same problem as described in #2721, and it is impacting our workflow. |
Hey @rr-krupesh-savaliya, I just want to say that depending on the root cause of the failures happening on your cluster, you might still want to inspect them and fix them. This is just a measure that would allow self-healing, but underlying problems should still be fixed, so you can get all the benefits of autoscaling |
} | ||
return updated.Status.Reason, nil | ||
}, ephemeralRunnerTimeout, ephemeralRunnerInterval).Should(BeEquivalentTo("TooManyPodFailures"), "Reason should be TooManyPodFailures") | ||
for i := range 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have coverage for partial failures? Do we want to add one more test case to cover behaviour for less than < 5
failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test that covers it on eviction, which kind of covers the case for re-creation, and we have the case that checks the exit code 0 when the runner exists in the service. I feel like these tests are covering the less than 5 failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already discussed this at length, left a non blocking comment
Fixes #2721
Fixes #3685
Fixes #3700