Skip to content

chore(workflows): retry transient failures in impl-generate + impl-review#5819

Merged
MarkusNeusinger merged 4 commits into
mainfrom
chore/regen-pipeline-hardening
May 6, 2026
Merged

chore(workflows): retry transient failures in impl-generate + impl-review#5819
MarkusNeusinger merged 4 commits into
mainfrom
chore/regen-pipeline-hardening

Conversation

@MarkusNeusinger
Copy link
Copy Markdown
Owner

Summary

Two minimal resilience patches for the dominant transient failures observed in the daily-regen audit (2026-05-06):

  1. impl-review.yml — "Extract PR info" step (5 failures in 24h). Wrap gh pr view in a 3-attempt retry with exponential backoff. When the GitHub API blips, the entire review job aborts and the PR ends up unlabeled — blocking the review → repair → merge cascade. Concretely caused 4 of the 14 stuck PRs we recovered today (feat(highcharts): implement swarm-basic #5696, feat(bokeh): implement heatmap-annotated #5789, feat(matplotlib): implement line-multi #5796, feat(pygal): implement line-multi #5801 had no labels at all because review never made it past step 1).

  2. impl-generate.yml — "Create library metadata file" step (6 failures in 24h). Wrap the final git push origin \"\$BRANCH\" in a 3-attempt retry that does fetch + rebase between attempts. The dominant failure mode is racing against Claude's earlier push to the same branch — when the metadata commit hits a non-fast-forward, the whole generation aborts and the PR never opens.

Both fixes are inline bash retries — no new action dependency. Each step still hard-fails after 3 attempts so persistent issues still surface (we don't want to mask real bugs).

Out of scope (deferred)

  • daily-regen.yml "pick" job clean exit: the existing logic already writes count=0 when no specs are eligible, and downstream is gated on if: needs.pick.outputs.count != '0'. The 2 reported "cancellations" in the audit period look unrelated to the pick step itself (likely scheduler-level events).
  • Auto-retry on ai-review-failed: would require a more invasive if: failure() job-level step. Holding until we see whether the Extract-PR-info retry alone reduces the rate enough.

Context

This branch is the Stage 5 follow-up to today's recovery work, which manually shepherded 13 of 14 stuck PRs through the review/repair/merge pipeline. The full investigation + recovery plan lives at `/home/tirao/.claude/plans/bitte-schaue-dir-alle-peppy-bunny.md` (local).

Test plan

  • YAML-validate both edited workflows (passes)
  • No-op for the happy path: first attempt of each retry preserves existing behavior exactly
  • CI green on this PR
  • Post-merge: watch the next 24h of impl-generate / impl-review run conclusions; expect failure rate to drop from ~10% to near 0% on these two steps

🤖 Generated with Claude Code

…view

Two minimal resilience fixes for the failures observed in the audit
investigation on 2026-05-06:

1. impl-review.yml — "Extract PR info" step (5 failures in 24h):
   wrap `gh pr view` in a 3-attempt retry with backoff. When the
   GitHub API blips, the whole review job currently aborts and the
   PR ends up unlabeled, blocking review→repair→merge.

2. impl-generate.yml — "Create library metadata file" step (6
   failures in 24h): wrap the final `git push origin "$BRANCH"`
   in a 3-attempt retry that does fetch + rebase between
   attempts. The dominant failure mode is racing against Claude's
   earlier push to the same branch.

Both fixes are inline bash retries — no new action dependency.
Each step still hard-fails after 3 attempts so persistent issues
still surface.

Out of scope (deferred): the `daily-regen.yml` "pick" job already
exits cleanly with `count=0` on no eligible specs (downstream is
gated `if: count != '0'`); the 2 reported "cancellations" were
unrelated scheduler-level events, not pick-step bugs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 22:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resilience of the implementation automation pipeline by adding inline retry/backoff logic around two common transient failure points in the impl-generateimpl-review cascade.

Changes:

  • Add a 3-attempt retry loop with backoff around gh pr view in impl-review.yml to reduce failures from transient GitHub API errors.
  • Add a 3-attempt retry loop around git push in impl-generate.yml, rebasing between attempts to mitigate non-fast-forward races when multiple agents push to the same branch.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
.github/workflows/impl-review.yml Retries PR metadata extraction (gh pr view) to avoid aborting review jobs on transient API blips.
.github/workflows/impl-generate.yml Retries the metadata commit push with fetch+rebase between attempts to reduce non-fast-forward push failures.

Comment thread .github/workflows/impl-review.yml Outdated
Comment on lines 59 to 72
if PR_DATA=$(gh pr view "$PR_NUMBER" --json headRefName,headRefOid,body 2>&1); then
break
fi
echo "::warning::gh pr view failed (attempt ${attempt}/3): ${PR_DATA}"
PR_DATA=""
sleep $((attempt * 5))
done
if [ -z "$PR_DATA" ]; then
echo "::error::gh pr view failed after 3 attempts for PR #${PR_NUMBER}"
exit 1
fi
HEAD_REF=$(echo "$PR_DATA" | jq -r '.headRefName')
HEAD_SHA=$(echo "$PR_DATA" | jq -r '.headRefOid')
BODY=$(echo "$PR_DATA" | jq -r '.body')
Comment thread .github/workflows/impl-review.yml Outdated
run: |
PR_DATA=$(gh pr view "$PR_NUMBER" --json headRefName,headRefOid,body)
# Retry gh pr view — transient API blips were the dominant impl-review
# failure mode (5x in 24h on 2026-05-06). Three attempts with backoff.
Comment thread .github/workflows/impl-review.yml Outdated
fi
echo "::warning::gh pr view failed (attempt ${attempt}/3): ${PR_DATA}"
PR_DATA=""
sleep $((attempt * 5))
Comment thread .github/workflows/impl-generate.yml Outdated
Comment on lines +552 to +563
# Retry git push — transient races against Claude's earlier push were
# the dominant "Create metadata file" failure mode (6x in 24h on
# 2026-05-06). On a non-fast-forward, fetch+rebase before retrying.
push_ok=0
for attempt in 1 2 3; do
if git push origin "$BRANCH"; then
push_ok=1
break
fi
echo "::warning::git push failed (attempt ${attempt}/3) — fetching + rebasing then retrying"
git fetch origin "$BRANCH" || true
git pull --rebase origin "$BRANCH" || true
Comment thread .github/workflows/impl-generate.yml Outdated
echo "::warning::git push failed (attempt ${attempt}/3) — fetching + rebasing then retrying"
git fetch origin "$BRANCH" || true
git pull --rebase origin "$BRANCH" || true
sleep $((attempt * 5))
MarkusNeusinger and others added 2 commits May 7, 2026 00:20
Five suggestions, all applied:

1. impl-review.yml: separate stderr from stdout when capturing
   `gh pr view` output. The previous `2>&1` merge would have
   corrupted PR_DATA on success-with-warnings (rate-limit
   notices etc.), causing jq to fail with confusing input.

2. impl-review.yml: comment said "exponential backoff" but the
   sleep is linear (5s, 10s). Updated comment to match.

3. impl-review.yml + impl-generate.yml: skip the sleep on the
   final attempt — no point delaying before the hard-fail.

4. impl-generate.yml: fail fast when the rebase itself errors
   (conflicts, dirty state) instead of swallowing with `|| true`
   and retrying on a half-rebased repo. Includes a `rebase
   --abort` guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 22:27
@MarkusNeusinger MarkusNeusinger enabled auto-merge (squash) May 6, 2026 22:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

PR_DATA=$(cat /tmp/pr.json)
break
fi
echo "::warning::gh pr view failed (attempt ${attempt}/3): $(cat /tmp/pr.err)"
Comment on lines +56 to +67
# failure mode (5x in 24h on 2026-05-06). Three attempts with linear
# backoff (5s, 10s); keep stderr separate from stdout so warnings
# (rate-limit notices, etc.) don't corrupt the JSON jq parses below.
PR_DATA=""
for attempt in 1 2 3; do
if gh pr view "$PR_NUMBER" --json headRefName,headRefOid,body > /tmp/pr.json 2> /tmp/pr.err; then
PR_DATA=$(cat /tmp/pr.json)
break
fi
echo "::warning::gh pr view failed (attempt ${attempt}/3): $(cat /tmp/pr.err)"
if [ "$attempt" -lt 3 ]; then
sleep $((attempt * 5))
Comment on lines +554 to +572
# 2026-05-06). On a non-fast-forward, fetch + rebase before retrying;
# if the rebase itself fails (conflicts), abort fast — leaving a
# half-rebased repo in place would just hide the real error.
push_ok=0
for attempt in 1 2 3; do
if git push origin "$BRANCH"; then
push_ok=1
break
fi
echo "::warning::git push failed (attempt ${attempt}/3) — fetching + rebasing then retrying"
if ! git fetch origin "$BRANCH"; then
echo "::error::git fetch origin ${BRANCH} failed during retry"
exit 1
fi
if ! git pull --rebase origin "$BRANCH"; then
echo "::error::git rebase against origin/${BRANCH} failed (conflicts) — aborting"
git rebase --abort 2>/dev/null || true
exit 1
fi
@MarkusNeusinger MarkusNeusinger disabled auto-merge May 6, 2026 22:37
@MarkusNeusinger MarkusNeusinger merged commit 6103152 into main May 6, 2026
6 checks passed
@MarkusNeusinger MarkusNeusinger deleted the chore/regen-pipeline-hardening branch May 6, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants