Skip to content

Create backoff mechanism for failed runners and allow re-creation of failed ephemeral runners #4059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 14, 2025

Conversation

nikola-jokic
Copy link
Collaborator

@nikola-jokic nikola-jokic commented Apr 25, 2025

Fixes #2721
Fixes #3685
Fixes #3700

@nikola-jokic nikola-jokic marked this pull request as ready for review April 28, 2025 10:15
@Copilot Copilot AI review requested due to automatic review settings April 28, 2025 10:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@nikola-jokic nikola-jokic added the gha-runner-scale-set Related to the gha-runner-scale-set mode label Apr 28, 2025
@adamcharnock
Copy link

This would be a big help to us. We're having some issues with our Kubevirt runners that we need to resolve, but this would help us out until we can nail down the precise problem.

@rr-krupesh-savaliya
Copy link

rr-krupesh-savaliya commented May 13, 2025

Could we please prioritize this PR? We are currently experiencing the same problem as described in #2721, and it is impacting our workflow.

@nikola-jokic @mumoshu @rentziass @toast-gear

@nikola-jokic
Copy link
Collaborator Author

Hey @rr-krupesh-savaliya,

I just want to say that depending on the root cause of the failures happening on your cluster, you might still want to inspect them and fix them. This is just a measure that would allow self-healing, but underlying problems should still be fixed, so you can get all the benefits of autoscaling

}
return updated.Status.Reason, nil
}, ephemeralRunnerTimeout, ephemeralRunnerInterval).Should(BeEquivalentTo("TooManyPodFailures"), "Reason should be TooManyPodFailures")
for i := range 5 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have coverage for partial failures? Do we want to add one more test case to cover behaviour for less than < 5 failures?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test that covers it on eviction, which kind of covers the case for re-creation, and we have the case that checks the exit code 0 when the runner exists in the service. I feel like these tests are covering the less than 5 failures.

Copy link
Member

@Link- Link- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already discussed this at length, left a non blocking comment

@nikola-jokic nikola-jokic merged commit cae7efa into master May 14, 2025
17 checks passed
@nikola-jokic nikola-jokic deleted the nikola-jokic/retry-ephemeral-runner branch May 14, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
4 participants