Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if no. of failed pods exceeded BackOffLimit #500

Merged
merged 2 commits into from
Dec 8, 2020

Conversation

katrybacka
Copy link
Contributor

I found that there is a bug in k8s that job-controller responsible for setting Failed Condition on Jobs does not work correctly. Sometimes the Status is not updated or updated with a huge delay. As the result multiple pods are ran and the number of them exceeds the BackOffLimit specified in Job.Spec.
This bug sometimes causes the failure of nightly job.
This PR contains a workaround that Job is assumed as Failed when it has Failed Condition set to true or the number of failed Pods i equal or greater than BackOffLimit.

@katrybacka katrybacka self-assigned this Dec 7, 2020
@katrybacka
Copy link
Contributor Author

recheck ha

@katrybacka katrybacka merged commit 52f2b27 into master Dec 8, 2020
@katrybacka katrybacka deleted the workaroud_for_job_failure branch December 8, 2020 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants