chore(workflows): retry transient failures in impl-generate + impl-review#5819
Merged
Conversation
…view Two minimal resilience fixes for the failures observed in the audit investigation on 2026-05-06: 1. impl-review.yml — "Extract PR info" step (5 failures in 24h): wrap `gh pr view` in a 3-attempt retry with backoff. When the GitHub API blips, the whole review job currently aborts and the PR ends up unlabeled, blocking review→repair→merge. 2. impl-generate.yml — "Create library metadata file" step (6 failures in 24h): wrap the final `git push origin "$BRANCH"` in a 3-attempt retry that does fetch + rebase between attempts. The dominant failure mode is racing against Claude's earlier push to the same branch. Both fixes are inline bash retries — no new action dependency. Each step still hard-fails after 3 attempts so persistent issues still surface. Out of scope (deferred): the `daily-regen.yml` "pick" job already exits cleanly with `count=0` on no eligible specs (downstream is gated `if: count != '0'`); the 2 reported "cancellations" were unrelated scheduler-level events, not pick-step bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves resilience of the implementation automation pipeline by adding inline retry/backoff logic around two common transient failure points in the impl-generate → impl-review cascade.
Changes:
- Add a 3-attempt retry loop with backoff around
gh pr viewinimpl-review.ymlto reduce failures from transient GitHub API errors. - Add a 3-attempt retry loop around
git pushinimpl-generate.yml, rebasing between attempts to mitigate non-fast-forward races when multiple agents push to the same branch.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| .github/workflows/impl-review.yml | Retries PR metadata extraction (gh pr view) to avoid aborting review jobs on transient API blips. |
| .github/workflows/impl-generate.yml | Retries the metadata commit push with fetch+rebase between attempts to reduce non-fast-forward push failures. |
Comment on lines
59
to
72
| if PR_DATA=$(gh pr view "$PR_NUMBER" --json headRefName,headRefOid,body 2>&1); then | ||
| break | ||
| fi | ||
| echo "::warning::gh pr view failed (attempt ${attempt}/3): ${PR_DATA}" | ||
| PR_DATA="" | ||
| sleep $((attempt * 5)) | ||
| done | ||
| if [ -z "$PR_DATA" ]; then | ||
| echo "::error::gh pr view failed after 3 attempts for PR #${PR_NUMBER}" | ||
| exit 1 | ||
| fi | ||
| HEAD_REF=$(echo "$PR_DATA" | jq -r '.headRefName') | ||
| HEAD_SHA=$(echo "$PR_DATA" | jq -r '.headRefOid') | ||
| BODY=$(echo "$PR_DATA" | jq -r '.body') |
| run: | | ||
| PR_DATA=$(gh pr view "$PR_NUMBER" --json headRefName,headRefOid,body) | ||
| # Retry gh pr view — transient API blips were the dominant impl-review | ||
| # failure mode (5x in 24h on 2026-05-06). Three attempts with backoff. |
| fi | ||
| echo "::warning::gh pr view failed (attempt ${attempt}/3): ${PR_DATA}" | ||
| PR_DATA="" | ||
| sleep $((attempt * 5)) |
Comment on lines
+552
to
+563
| # Retry git push — transient races against Claude's earlier push were | ||
| # the dominant "Create metadata file" failure mode (6x in 24h on | ||
| # 2026-05-06). On a non-fast-forward, fetch+rebase before retrying. | ||
| push_ok=0 | ||
| for attempt in 1 2 3; do | ||
| if git push origin "$BRANCH"; then | ||
| push_ok=1 | ||
| break | ||
| fi | ||
| echo "::warning::git push failed (attempt ${attempt}/3) — fetching + rebasing then retrying" | ||
| git fetch origin "$BRANCH" || true | ||
| git pull --rebase origin "$BRANCH" || true |
| echo "::warning::git push failed (attempt ${attempt}/3) — fetching + rebasing then retrying" | ||
| git fetch origin "$BRANCH" || true | ||
| git pull --rebase origin "$BRANCH" || true | ||
| sleep $((attempt * 5)) |
Five suggestions, all applied: 1. impl-review.yml: separate stderr from stdout when capturing `gh pr view` output. The previous `2>&1` merge would have corrupted PR_DATA on success-with-warnings (rate-limit notices etc.), causing jq to fail with confusing input. 2. impl-review.yml: comment said "exponential backoff" but the sleep is linear (5s, 10s). Updated comment to match. 3. impl-review.yml + impl-generate.yml: skip the sleep on the final attempt — no point delaying before the hard-fail. 4. impl-generate.yml: fail fast when the rebase itself errors (conflicts, dirty state) instead of swallowing with `|| true` and retrying on a half-rebased repo. Includes a `rebase --abort` guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| PR_DATA=$(cat /tmp/pr.json) | ||
| break | ||
| fi | ||
| echo "::warning::gh pr view failed (attempt ${attempt}/3): $(cat /tmp/pr.err)" |
Comment on lines
+56
to
+67
| # failure mode (5x in 24h on 2026-05-06). Three attempts with linear | ||
| # backoff (5s, 10s); keep stderr separate from stdout so warnings | ||
| # (rate-limit notices, etc.) don't corrupt the JSON jq parses below. | ||
| PR_DATA="" | ||
| for attempt in 1 2 3; do | ||
| if gh pr view "$PR_NUMBER" --json headRefName,headRefOid,body > /tmp/pr.json 2> /tmp/pr.err; then | ||
| PR_DATA=$(cat /tmp/pr.json) | ||
| break | ||
| fi | ||
| echo "::warning::gh pr view failed (attempt ${attempt}/3): $(cat /tmp/pr.err)" | ||
| if [ "$attempt" -lt 3 ]; then | ||
| sleep $((attempt * 5)) |
Comment on lines
+554
to
+572
| # 2026-05-06). On a non-fast-forward, fetch + rebase before retrying; | ||
| # if the rebase itself fails (conflicts), abort fast — leaving a | ||
| # half-rebased repo in place would just hide the real error. | ||
| push_ok=0 | ||
| for attempt in 1 2 3; do | ||
| if git push origin "$BRANCH"; then | ||
| push_ok=1 | ||
| break | ||
| fi | ||
| echo "::warning::git push failed (attempt ${attempt}/3) — fetching + rebasing then retrying" | ||
| if ! git fetch origin "$BRANCH"; then | ||
| echo "::error::git fetch origin ${BRANCH} failed during retry" | ||
| exit 1 | ||
| fi | ||
| if ! git pull --rebase origin "$BRANCH"; then | ||
| echo "::error::git rebase against origin/${BRANCH} failed (conflicts) — aborting" | ||
| git rebase --abort 2>/dev/null || true | ||
| exit 1 | ||
| fi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two minimal resilience patches for the dominant transient failures observed in the daily-regen audit (2026-05-06):
impl-review.yml— "Extract PR info" step (5 failures in 24h). Wrapgh pr viewin a 3-attempt retry with exponential backoff. When the GitHub API blips, the entire review job aborts and the PR ends up unlabeled — blocking the review → repair → merge cascade. Concretely caused 4 of the 14 stuck PRs we recovered today (feat(highcharts): implement swarm-basic #5696, feat(bokeh): implement heatmap-annotated #5789, feat(matplotlib): implement line-multi #5796, feat(pygal): implement line-multi #5801 had no labels at all because review never made it past step 1).impl-generate.yml— "Create library metadata file" step (6 failures in 24h). Wrap the finalgit push origin \"\$BRANCH\"in a 3-attempt retry that doesfetch + rebasebetween attempts. The dominant failure mode is racing against Claude's earlier push to the same branch — when the metadata commit hits a non-fast-forward, the whole generation aborts and the PR never opens.Both fixes are inline bash retries — no new action dependency. Each step still hard-fails after 3 attempts so persistent issues still surface (we don't want to mask real bugs).
Out of scope (deferred)
daily-regen.yml"pick" job clean exit: the existing logic already writescount=0when no specs are eligible, and downstream is gated onif: needs.pick.outputs.count != '0'. The 2 reported "cancellations" in the audit period look unrelated to the pick step itself (likely scheduler-level events).ai-review-failed: would require a more invasiveif: failure()job-level step. Holding until we see whether the Extract-PR-info retry alone reduces the rate enough.Context
This branch is the Stage 5 follow-up to today's recovery work, which manually shepherded 13 of 14 stuck PRs through the review/repair/merge pipeline. The full investigation + recovery plan lives at `/home/tirao/.claude/plans/bitte-schaue-dir-alle-peppy-bunny.md` (local).
Test plan
🤖 Generated with Claude Code