ci: auto-retry infra-caused script failures via exit code 75 sentinel by Leiyks · Pull Request #3812 · DataDog/dd-trace-php

Leiyks · 2026-04-22T11:49:08Z

Service startup timeouts (Kafka, Zookeeper, etc.) exit with code 1, which GitLab classifies as script_failure — not covered by our runner-level retry rules. We don't want blanket script_failure retry as it hides real test flakiness.

GitLab's retry: exit_codes: is a standalone OR condition: jobs are retried if they exit with one of the listed codes, independently of when:. This means exit_codes: [75] retries only on exit code 75 — real test failures (exit 1) are never retried.

This PR introduces exit code 75 (EX_TEMPFAIL) as an infra-failure sentinel:

generate-common.php: add exit_codes: [75] to the global retry block — jobs exiting with the infra sentinel are retried up to 2×, all other failures are unaffected
wait-for-service-ready.sh: exit 1 → exit 75 on service startup timeout (Kafka, Zookeeper, MySQL, Redis, etc.)

Verified on CI: exit 75 → 3 attempts (retried twice); exit 1 → 1 attempt only.

Any script that fails due to transient infra can adopt the same convention (exit 75) and will be picked up by the global rule automatically.

Service startup timeouts (Kafka, Zookeeper, MySQL, etc.) exit the job with code 1, which GitLab classifies as script_failure — not covered by the existing runner-level retry rules, and the team doesn't want to enable blanket script_failure retry (hides real flakiness). GitLab 14.9+ supports `retry: exit_codes:` which fires the script_failure retry rule only when the exit code matches. We use EX_TEMPFAIL (75) as the infra-sentinel: wait-for-service-ready.sh now exits 75 on service timeout instead of 1, and the global default retry block adds `script_failure` gated on exit code 75. Effect: Kafka/Zookeeper/other service startup races are retried up to 2 times automatically. Real test failures (exit 1) are never retried.

datadog-prod-us1-3 · 2026-04-22T12:01:53Z

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 60.65% (-0.04%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 930acf6 | Docs | Datadog PR Page | Give us feedback!}

Remove script_failure from the global when: list. Having both script_failure and exit_codes: [75] retries on any script failure OR exit code 75 — not exit code 75 only as intended. exit_codes: [75] alone correctly retries only jobs that exit with code 75 (the EX_TEMPFAIL sentinel from wait-for-service-ready.sh), leaving all other script failures (exit 1, real test failures) unretried.

test-retry-exit-75: exits 75, no job-level retry override → inherits global default → should be retried twice (3 total attempts) test-no-retry-exit-1: exits 1, same config → should run exactly once Both jobs are branch-scoped (leiyks/infra-failure-retry only) and allow_failure: true. Remove after verification.

Leiyks added 2 commits April 22, 2026 13:48

ci: remove inline comment on exit 75

ebb68ce

Leiyks added 3 commits April 22, 2026 15:28

ci: remove temporary retry validation test jobs

930acf6

Leiyks marked this pull request as ready for review April 22, 2026 14:19

Leiyks requested a review from a team as a code owner April 22, 2026 14:19

bwoebi approved these changes Apr 22, 2026

View reviewed changes

Leiyks merged commit 0d88e71 into master Apr 22, 2026
1888 of 1958 checks passed

Leiyks deleted the leiyks/infra-failure-retry branch April 22, 2026 14:54

github-actions Bot added this to the 1.19.0 milestone Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: auto-retry infra-caused script failures via exit code 75 sentinel#3812

ci: auto-retry infra-caused script failures via exit code 75 sentinel#3812
Leiyks merged 5 commits intomasterfrom
leiyks/infra-failure-retry

Leiyks commented Apr 22, 2026 •

edited

Loading

Uh oh!

datadog-prod-us1-3 Bot commented Apr 22, 2026 •

edited by datadog-prod-us1-5 Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Leiyks commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-prod-us1-3 Bot commented Apr 22, 2026 • edited by datadog-prod-us1-5 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Leiyks commented Apr 22, 2026 •

edited

Loading

datadog-prod-us1-3 Bot commented Apr 22, 2026 •

edited by datadog-prod-us1-5 Bot

Loading