fix(workflows): make impl pipeline resilient to transient Claude failures#5410
fix(workflows): make impl pipeline resilient to transient Claude failures#5410MarkusNeusinger merged 2 commits intomainfrom
Conversation
…ures Three resilience gaps were leaving PRs stuck after a single Claude Code Action hiccup: - impl-generate: only allowed 2 attempts before declaring `failed`. Bumped cap to 3, aligning with the existing "max retries exhausted (3 attempts)" label description and the repair phase's 3-attempt budget. - impl-repair: had no failure handler — if the repair workflow itself crashed, `ai-rejected` was already removed and re-review never fired, leaving the PR silently stuck. Added handler that restores the label and auto-retries once via a marker comment, then falls back to manual. - impl-review: both failure paths (Claude crash + score=0 from missing quality_score.txt) immediately surfaced `ai-review-failed`, requiring manual rerun. Both now auto-retry once via repository_dispatch with a shared marker comment before giving up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens the implementation automation workflows against transient Claude Code Action failures so that generate/review/repair loops self-heal instead of leaving issues/PRs stuck.
Changes:
- Increase
impl-generatefailure retry budget to 3 attempts and update failure messaging accordingly. - Add a failure handler to
impl-repairto restoreai-rejectedand auto-retry a crashed repair attempt once. - Add one-time auto-retry behavior to
impl-reviewfor both Claude crashes and missing/invalid review outputs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
.github/workflows/impl-generate.yml |
Raises generation retry cap to 3 and updates attempt numbering/messages. |
.github/workflows/impl-repair.yml |
Adds failure handling to reapply ai-rejected and auto-retry a crashed repair attempt once. |
.github/workflows/impl-review.yml |
Adds marker-based one-time auto-retry via repository_dispatch for crash and score=0 failure paths. |
| RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \ | ||
| --jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0") |
There was a problem hiding this comment.
The retry guard counts marker comments via gh api .../comments but does not paginate (GitHub API defaults to 30 comments). On PRs with >30 comments, the marker may be older than the first page, causing RETRY_COUNT to incorrectly be 0 and potentially triggering repeated “first” auto-retries.
Consider adding pagination (e.g., gh api --paginate and/or per_page=100) when counting markers, or track the auto-retry via a label/check that can’t be pushed out of the first page.
| RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \ | |
| --jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0") | |
| RETRY_COUNT=$(gh api --paginate --slurp "repos/${{ github.repository }}/issues/${PR_NUM}/comments?per_page=100" \ | |
| --jq "[.[][] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0") |
| RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \ | ||
| --jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0") |
There was a problem hiding this comment.
Same pagination issue as above: RETRY_COUNT is computed from the first page of PR comments only. If the marker comment is older than the most recent 30 comments, this step will incorrectly think no auto-retry has occurred and will re-dispatch again.
Recommendation: use a paginated comments fetch when searching for the marker, or persist retry state in a label.
| RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \ | |
| --jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0") | |
| RETRY_COUNT=$(gh api --paginate "repos/${{ github.repository }}/issues/${PR_NUM}/comments" 2>/dev/null \ | |
| | jq -s --arg marker "$MARKER" '[.[][] | select(.body != null and (.body | contains($marker)))] | length' \ | |
| || echo "0") |
| RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \ | ||
| --jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0") |
There was a problem hiding this comment.
RETRY_COUNT is derived from gh api .../comments without pagination (default 30). If the retry marker comment is older than the most recent page, this will miscount retries and can cause extra unintended auto-repair reruns.
Suggestion: fetch comments with pagination when searching for the marker (or store retry state in a label) to keep the retry cap reliable on long-lived PRs.
| RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \ | |
| --jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0") | |
| RETRY_COUNT=$( | |
| ( | |
| gh api --paginate "repos/${{ github.repository }}/issues/${PR_NUM}/comments?per_page=100" \ | |
| --jq ".[] | select(.body != null and (.body | contains(\"$MARKER\"))) | 1" 2>/dev/null || true | |
| ) | wc -l | tr -d '[:space:]' | |
| ) |
| gh api repos/${{ github.repository }}/dispatches \ | ||
| -f event_type=review-pr \ | ||
| -f "client_payload[pr_number]=$PR_NUM" | ||
| echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM" |
There was a problem hiding this comment.
In the “first failure” path this step triggers a re-review via repository_dispatch but then exits successfully, so the workflow run can show as green even though no valid score/verdict was produced. That can be misleading and can also mask a dispatch failure (which would keep the PR stuck).
Consider exiting non-zero after dispatching the retry (without adding ai-review-failed) so the run status accurately reflects that the review did not complete, while still allowing the auto-retry to proceed.
| echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM" | |
| echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM" | |
| exit 1 |
| gh api repos/${{ github.repository }}/dispatches \ | ||
| -f event_type=review-pr \ | ||
| -f "client_payload[pr_number]=$PR_NUM" | ||
| echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM" |
There was a problem hiding this comment.
After dispatching the auto-retry, this step exits successfully, which means the overall impl-review workflow can report success even though the AI review actually crashed and no verdict was produced. If this workflow is used as a required status check, it could incorrectly unblock merges.
Consider failing the job/run after scheduling the retry (or otherwise emitting a failing conclusion) while still avoiding the ai-review-failed label on the first crash.
…etry Address Copilot review on #5410: - Marker counts (impl-generate / impl-repair / impl-review) used the default `gh api .../comments` page size of 30. On long-lived issues/PRs the marker could fall off the first page, causing repeated "first" auto-retries. Switched to `--paginate --per_page=100` so the count is reliable. - impl-review's auto-retry paths now `exit 1` after dispatching the retry, so the run status accurately reflects that no verdict was produced and any dispatch failure stays visible. The auto-retry runs in a separate workflow run, so this doesn't break the recovery chain. - Also paginate the failure-marker count in impl-generate (same bug class, not flagged by Copilot but same fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ompts (#5520) ## Summary Three workflows (`impl-review.yml`, `spec-create.yml`, `report-validate.yml`) used shell-style `$VAR` inside `with: prompt: |` blocks of `claude-code-action`. That block is a YAML string handed to a Node/Bun action — **no shell ever runs**, so `$VAR` was sent to Claude as a literal placeholder instead of the actual value. Result: Claude couldn't reliably identify the PR / spec / library to review and silently produced no `quality_score.txt`, which the validate step turns into `ai-review-failed`. ## Symptoms observed today (2026-04-29) 5 stuck implementation PRs from 2026-04-27, all with `ai-review-failed` despite the prior fixes branch (#5410) and the audit branch (#5515) landing in between: | PR | Branch | Pre-fix labels | |----|--------|----------------| | #5476 | seaborn/marimekko-basic | `ai-review-failed`, `quality:78` | | #5480 | altair/marimekko-basic | `ai-review-failed`, `quality:82` | | #5481 | letsplot/marimekko-basic | `ai-rejected`, `quality:76` | | #5483 | plotnine/marimekko-basic | `ai-review-failed` | | #5486 | plotly/line-basic | `ai-review-failed` | Re-dispatching review on each confirmed the bug: the run log of `Run AI Quality Review` shows the prompt being passed verbatim: ``` PROMPT: Read prompts/workflow-prompts/ai-quality-review.md and follow those instructions. Variables for this run: - LIBRARY: $LIBRARY # ← literal, never expanded - SPEC_ID: $SPEC_ID - PR_NUMBER: $PR_NUMBER - ATTEMPT: $ATTEMPT ``` Claude's review then either ran for ~20s and exited with no `quality_score.txt` (4 PRs failed), or recovered by inferring values from cwd (1 PR succeeded with `quality:82`). The intermittent pattern is exactly what you'd expect from "the prompt is ambiguous and Claude has to guess from context." ## Root cause Commit `252977cf3` ("chore: fix critical audit findings", 2026-04-28 22:46) routed several `${{ github.event.* }}` and step-output values through step-level `env:` and rewrote the in-prompt references as `$VAR`. That is the correct mitigation for `run:` shell steps and Python heredocs in the same workflows (and those changes stay in place). Inside `with: prompt: |` it is the wrong tool: the value is consumed by a JS action, not a shell, so there is no injection surface to mitigate and `$VAR` does not interpolate. `spec-create.yml` and `report-validate.yml` carry the identical anti-pattern in their `prompt:` blocks. They haven't surfaced as failures yet only because no triggering issue has come in since 2026-04-28. ## The fix Revert **only** the descriptive header lines of each `prompt:` block back to GitHub Actions Expression syntax (`${{ ... }}`), which the runner substitutes into the YAML string before the action receives it. Keep: - All `env:` blocks (harmless; lets future prompt content reference env vars if useful) - All `$VAR` references inside **embedded bash code samples** in the prompt (e.g. `gh issue edit $ISSUE_NUMBER`). Those are executed by Claude's Bash tool which inherits the step `env:` and expands them correctly — and rewriting them would re-enable the injection vector the audit was right to close. ```diff Variables for this run: - - LIBRARY: $LIBRARY - - SPEC_ID: $SPEC_ID - - PR_NUMBER: $PR_NUMBER - - ATTEMPT: $ATTEMPT + - LIBRARY: ${{ steps.pr.outputs.library }} + - SPEC_ID: ${{ steps.pr.outputs.specification_id }} + - PR_NUMBER: ${{ steps.pr.outputs.pr_number }} + - ATTEMPT: ${{ steps.attempts.outputs.display }} ``` (analogous 8-line revert in `spec-create.yml` × 2 prompt blocks and 4-line revert in `report-validate.yml`). Diff total: **3 files, 16 ±**. ## Test plan - [ ] After merge, redispatch `impl-review.yml` for the 4 stuck PRs (`gh workflow run impl-review.yml -f pr_number=<N>` for 5476, 5483, 5486; 5480 already got a 82 in the redispatch and should now stabilize) - [ ] Verify each run's `Run AI Quality Review` step log shows real values (e.g. `- LIBRARY: plotly`) in the PROMPT echo, not `$LIBRARY` - [ ] Verify `quality_score.txt` is produced and `ai-review-failed` label is removed - [ ] On next `spec-request`-labeled issue, verify the spec-create prompt sees the issue title/body - [ ] On next `report-pending`-labeled issue, verify the report-validate prompt sees the issue title/body 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Summary
The implementation pipeline was leaving PRs and issues stuck after a single Claude Code Action hiccup. Three fixes restore self-healing behavior:
impl-generate.yml: cap raised from 2 → 3 generation attempts, aligning with the existingimpl:{lib}:failedlabel description ("max retries exhausted (3 attempts)") and the repair phase's 3-attempt budget. Failure comments now readAttempt N/3.impl-repair.yml: previously had no failure handler — when the Claude Code Action itself crashed, the workflow ended withai-rejectedalready removed and re-review never fired, leaving the PR silently stuck. Added aHandle repair failurestep that restoresai-rejectedand auto-retries the same attempt once via a marker comment, then falls back to manual.impl-review.yml: both failure paths (Claude crash →Handle review failure, and score=0 from missingquality_score.txt→Validate review output) immediately surfacedai-review-failed, requiring manual rerun. Both now auto-retry once viarepository_dispatchwith a shared marker comment before giving up.The
>=50% after 3 attemptsmerge logic inimpl-review.ymlwas already correct and is unchanged — these fixes only ensure PRs reach that gate instead of stalling earlier.Concrete trigger (not added to the PR but motivated it)
Issue #5365 (
area-mountain-panorama) had 4/9 libraries hard-failed without ever creating a PR (transient Claude crashes during generate, capped at 2 attempts), 1 PR stuck withai-review-failed(plotnine #5372), and 1 PR stuck mid-repair (altair #5370 — repair workflow itself crashed on attempt 1). Manual recovery was triggered earlier in the conversation.Test plan
ai-rejectedand auto-retry the same attempt once via marker commentrepository_dispatchonce before addingai-review-failedquality_score.txt— same auto-retry behavior🤖 Generated with Claude Code