ci: auto-retry infra-caused script failures via exit code 75 sentinel#3812
Merged
ci: auto-retry infra-caused script failures via exit code 75 sentinel#3812
Conversation
Service startup timeouts (Kafka, Zookeeper, MySQL, etc.) exit the job with code 1, which GitLab classifies as script_failure — not covered by the existing runner-level retry rules, and the team doesn't want to enable blanket script_failure retry (hides real flakiness). GitLab 14.9+ supports `retry: exit_codes:` which fires the script_failure retry rule only when the exit code matches. We use EX_TEMPFAIL (75) as the infra-sentinel: wait-for-service-ready.sh now exits 75 on service timeout instead of 1, and the global default retry block adds `script_failure` gated on exit code 75. Effect: Kafka/Zookeeper/other service startup races are retried up to 2 times automatically. Real test failures (exit 1) are never retried.
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 930acf6 | Docs | Datadog PR Page | Give us feedback! |
Remove script_failure from the global when: list. Having both script_failure and exit_codes: [75] retries on any script failure OR exit code 75 — not exit code 75 only as intended. exit_codes: [75] alone correctly retries only jobs that exit with code 75 (the EX_TEMPFAIL sentinel from wait-for-service-ready.sh), leaving all other script failures (exit 1, real test failures) unretried.
test-retry-exit-75: exits 75, no job-level retry override → inherits global default → should be retried twice (3 total attempts) test-no-retry-exit-1: exits 1, same config → should run exactly once Both jobs are branch-scoped (leiyks/infra-failure-retry only) and allow_failure: true. Remove after verification.
bwoebi
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Service startup timeouts (Kafka, Zookeeper, etc.) exit with code 1, which GitLab classifies as
script_failure— not covered by our runner-level retry rules. We don't want blanketscript_failureretry as it hides real test flakiness.GitLab's
retry: exit_codes:is a standalone OR condition: jobs are retried if they exit with one of the listed codes, independently ofwhen:. This meansexit_codes: [75]retries only on exit code 75 — real test failures (exit 1) are never retried.This PR introduces exit code 75 (
EX_TEMPFAIL) as an infra-failure sentinel:generate-common.php: addexit_codes: [75]to the global retry block — jobs exiting with the infra sentinel are retried up to 2×, all other failures are unaffectedwait-for-service-ready.sh:exit 1→exit 75on service startup timeout (Kafka, Zookeeper, MySQL, Redis, etc.)Verified on CI:
exit 75→ 3 attempts (retried twice);exit 1→ 1 attempt only.Any script that fails due to transient infra can adopt the same convention (
exit 75) and will be picked up by the global rule automatically.