Skip to content

fix(workflows): make impl pipeline resilient to transient Claude failures#5410

Merged
MarkusNeusinger merged 2 commits intomainfrom
fix/workflow-retry-resilience
Apr 25, 2026
Merged

fix(workflows): make impl pipeline resilient to transient Claude failures#5410
MarkusNeusinger merged 2 commits intomainfrom
fix/workflow-retry-resilience

Conversation

@MarkusNeusinger
Copy link
Copy Markdown
Owner

Summary

The implementation pipeline was leaving PRs and issues stuck after a single Claude Code Action hiccup. Three fixes restore self-healing behavior:

  • impl-generate.yml: cap raised from 2 → 3 generation attempts, aligning with the existing impl:{lib}:failed label description ("max retries exhausted (3 attempts)") and the repair phase's 3-attempt budget. Failure comments now read Attempt N/3.
  • impl-repair.yml: previously had no failure handler — when the Claude Code Action itself crashed, the workflow ended with ai-rejected already removed and re-review never fired, leaving the PR silently stuck. Added a Handle repair failure step that restores ai-rejected and auto-retries the same attempt once via a marker comment, then falls back to manual.
  • impl-review.yml: both failure paths (Claude crash → Handle review failure, and score=0 from missing quality_score.txtValidate review output) immediately surfaced ai-review-failed, requiring manual rerun. Both now auto-retry once via repository_dispatch with a shared marker comment before giving up.

The >=50% after 3 attempts merge logic in impl-review.yml was already correct and is unchanged — these fixes only ensure PRs reach that gate instead of stalling earlier.

Concrete trigger (not added to the PR but motivated it)

Issue #5365 (area-mountain-panorama) had 4/9 libraries hard-failed without ever creating a PR (transient Claude crashes during generate, capped at 2 attempts), 1 PR stuck with ai-review-failed (plotnine #5372), and 1 PR stuck mid-repair (altair #5370 — repair workflow itself crashed on attempt 1). Manual recovery was triggered earlier in the conversation.

Test plan

  • Trigger a generate that fails twice (e.g., simulate or wait for transient flake) — should auto-retry to attempt 3 instead of stopping at 2
  • Trigger a repair where Claude Code Action crashes — should restore ai-rejected and auto-retry the same attempt once via marker comment
  • Trigger a review where Claude crashes — should auto-retry via repository_dispatch once before adding ai-review-failed
  • Trigger a review where Claude runs but writes no quality_score.txt — same auto-retry behavior
  • Verify markers prevent infinite retry loops (each marker only allows one auto-retry)

🤖 Generated with Claude Code

…ures

Three resilience gaps were leaving PRs stuck after a single Claude Code
Action hiccup:

- impl-generate: only allowed 2 attempts before declaring `failed`. Bumped
  cap to 3, aligning with the existing "max retries exhausted (3 attempts)"
  label description and the repair phase's 3-attempt budget.
- impl-repair: had no failure handler — if the repair workflow itself
  crashed, `ai-rejected` was already removed and re-review never fired,
  leaving the PR silently stuck. Added handler that restores the label and
  auto-retries once via a marker comment, then falls back to manual.
- impl-review: both failure paths (Claude crash + score=0 from missing
  quality_score.txt) immediately surfaced `ai-review-failed`, requiring
  manual rerun. Both now auto-retry once via repository_dispatch with a
  shared marker comment before giving up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 25, 2026 21:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the implementation automation workflows against transient Claude Code Action failures so that generate/review/repair loops self-heal instead of leaving issues/PRs stuck.

Changes:

  • Increase impl-generate failure retry budget to 3 attempts and update failure messaging accordingly.
  • Add a failure handler to impl-repair to restore ai-rejected and auto-retry a crashed repair attempt once.
  • Add one-time auto-retry behavior to impl-review for both Claude crashes and missing/invalid review outputs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
.github/workflows/impl-generate.yml Raises generation retry cap to 3 and updates attempt numbering/messages.
.github/workflows/impl-repair.yml Adds failure handling to reapply ai-rejected and auto-retry a crashed repair attempt once.
.github/workflows/impl-review.yml Adds marker-based one-time auto-retry via repository_dispatch for crash and score=0 failure paths.

Comment thread .github/workflows/impl-review.yml Outdated
Comment on lines +181 to +182
RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \
--jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry guard counts marker comments via gh api .../comments but does not paginate (GitHub API defaults to 30 comments). On PRs with >30 comments, the marker may be older than the first page, causing RETRY_COUNT to incorrectly be 0 and potentially triggering repeated “first” auto-retries.

Consider adding pagination (e.g., gh api --paginate and/or per_page=100) when counting markers, or track the auto-retry via a label/check that can’t be pushed out of the first page.

Suggested change
RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \
--jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")
RETRY_COUNT=$(gh api --paginate --slurp "repos/${{ github.repository }}/issues/${PR_NUM}/comments?per_page=100" \
--jq "[.[][] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/impl-review.yml Outdated
Comment on lines +471 to +472
RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \
--jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pagination issue as above: RETRY_COUNT is computed from the first page of PR comments only. If the marker comment is older than the most recent 30 comments, this step will incorrectly think no auto-retry has occurred and will re-dispatch again.

Recommendation: use a paginated comments fetch when searching for the marker, or persist retry state in a label.

Suggested change
RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \
--jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")
RETRY_COUNT=$(gh api --paginate "repos/${{ github.repository }}/issues/${PR_NUM}/comments" 2>/dev/null \
| jq -s --arg marker "$MARKER" '[.[][] | select(.body != null and (.body | contains($marker)))] | length' \
|| echo "0")

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/impl-repair.yml Outdated
Comment on lines +262 to +263
RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \
--jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RETRY_COUNT is derived from gh api .../comments without pagination (default 30). If the retry marker comment is older than the most recent page, this will miscount retries and can cause extra unintended auto-repair reruns.

Suggestion: fetch comments with pagination when searching for the marker (or store retry state in a label) to keep the retry cap reliable on long-lived PRs.

Suggested change
RETRY_COUNT=$(gh api "repos/${{ github.repository }}/issues/${PR_NUM}/comments" \
--jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")
RETRY_COUNT=$(
(
gh api --paginate "repos/${{ github.repository }}/issues/${PR_NUM}/comments?per_page=100" \
--jq ".[] | select(.body != null and (.body | contains(\"$MARKER\"))) | 1" 2>/dev/null || true
) | wc -l | tr -d '[:space:]'
)

Copilot uses AI. Check for mistakes.
gh api repos/${{ github.repository }}/dispatches \
-f event_type=review-pr \
-f "client_payload[pr_number]=$PR_NUM"
echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM"
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the “first failure” path this step triggers a re-review via repository_dispatch but then exits successfully, so the workflow run can show as green even though no valid score/verdict was produced. That can be misleading and can also mask a dispatch failure (which would keep the PR stuck).

Consider exiting non-zero after dispatching the retry (without adding ai-review-failed) so the run status accurately reflects that the review did not complete, while still allowing the auto-retry to proceed.

Suggested change
echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM"
echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM"
exit 1

Copilot uses AI. Check for mistakes.
Comment on lines +499 to +502
gh api repos/${{ github.repository }}/dispatches \
-f event_type=review-pr \
-f "client_payload[pr_number]=$PR_NUM"
echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM"
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After dispatching the auto-retry, this step exits successfully, which means the overall impl-review workflow can report success even though the AI review actually crashed and no verdict was produced. If this workflow is used as a required status check, it could incorrectly unblock merges.

Consider failing the job/run after scheduling the retry (or otherwise emitting a failing conclusion) while still avoiding the ai-review-failed label on the first crash.

Copilot uses AI. Check for mistakes.
…etry

Address Copilot review on #5410:

- Marker counts (impl-generate / impl-repair / impl-review) used the default
  `gh api .../comments` page size of 30. On long-lived issues/PRs the marker
  could fall off the first page, causing repeated "first" auto-retries.
  Switched to `--paginate --per_page=100` so the count is reliable.
- impl-review's auto-retry paths now `exit 1` after dispatching the retry,
  so the run status accurately reflects that no verdict was produced and
  any dispatch failure stays visible. The auto-retry runs in a separate
  workflow run, so this doesn't break the recovery chain.
- Also paginate the failure-marker count in impl-generate (same bug class,
  not flagged by Copilot but same fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MarkusNeusinger MarkusNeusinger merged commit 048f38f into main Apr 25, 2026
7 checks passed
@MarkusNeusinger MarkusNeusinger deleted the fix/workflow-retry-resilience branch April 25, 2026 22:00
MarkusNeusinger added a commit that referenced this pull request Apr 29, 2026
…ompts (#5520)

## Summary

Three workflows (`impl-review.yml`, `spec-create.yml`,
`report-validate.yml`) used shell-style `$VAR` inside `with: prompt: |`
blocks of `claude-code-action`. That block is a YAML string handed to a
Node/Bun action — **no shell ever runs**, so `$VAR` was sent to Claude
as a literal placeholder instead of the actual value. Result: Claude
couldn't reliably identify the PR / spec / library to review and
silently produced no `quality_score.txt`, which the validate step turns
into `ai-review-failed`.

## Symptoms observed today (2026-04-29)

5 stuck implementation PRs from 2026-04-27, all with `ai-review-failed`
despite the prior fixes branch (#5410) and the audit branch (#5515)
landing in between:

| PR | Branch | Pre-fix labels |
|----|--------|----------------|
| #5476 | seaborn/marimekko-basic | `ai-review-failed`, `quality:78` |
| #5480 | altair/marimekko-basic | `ai-review-failed`, `quality:82` |
| #5481 | letsplot/marimekko-basic | `ai-rejected`, `quality:76` |
| #5483 | plotnine/marimekko-basic | `ai-review-failed` |
| #5486 | plotly/line-basic | `ai-review-failed` |

Re-dispatching review on each confirmed the bug: the run log of `Run AI
Quality Review` shows the prompt being passed verbatim:

```
PROMPT: Read prompts/workflow-prompts/ai-quality-review.md and follow those instructions.

Variables for this run:
- LIBRARY: $LIBRARY    # ← literal, never expanded
- SPEC_ID: $SPEC_ID
- PR_NUMBER: $PR_NUMBER
- ATTEMPT: $ATTEMPT
```

Claude's review then either ran for ~20s and exited with no
`quality_score.txt` (4 PRs failed), or recovered by inferring values
from cwd (1 PR succeeded with `quality:82`). The intermittent pattern is
exactly what you'd expect from "the prompt is ambiguous and Claude has
to guess from context."

## Root cause

Commit `252977cf3` ("chore: fix critical audit findings", 2026-04-28
22:46) routed several `${{ github.event.* }}` and step-output values
through step-level `env:` and rewrote the in-prompt references as
`$VAR`. That is the correct mitigation for `run:` shell steps and Python
heredocs in the same workflows (and those changes stay in place). Inside
`with: prompt: |` it is the wrong tool: the value is consumed by a JS
action, not a shell, so there is no injection surface to mitigate and
`$VAR` does not interpolate.

`spec-create.yml` and `report-validate.yml` carry the identical
anti-pattern in their `prompt:` blocks. They haven't surfaced as
failures yet only because no triggering issue has come in since
2026-04-28.

## The fix

Revert **only** the descriptive header lines of each `prompt:` block
back to GitHub Actions Expression syntax (`${{ ... }}`), which the
runner substitutes into the YAML string before the action receives it.
Keep:

- All `env:` blocks (harmless; lets future prompt content reference env
vars if useful)
- All `$VAR` references inside **embedded bash code samples** in the
prompt (e.g. `gh issue edit $ISSUE_NUMBER`). Those are executed by
Claude's Bash tool which inherits the step `env:` and expands them
correctly — and rewriting them would re-enable the injection vector the
audit was right to close.

```diff
             Variables for this run:
-            - LIBRARY: $LIBRARY
-            - SPEC_ID: $SPEC_ID
-            - PR_NUMBER: $PR_NUMBER
-            - ATTEMPT: $ATTEMPT
+            - LIBRARY: ${{ steps.pr.outputs.library }}
+            - SPEC_ID: ${{ steps.pr.outputs.specification_id }}
+            - PR_NUMBER: ${{ steps.pr.outputs.pr_number }}
+            - ATTEMPT: ${{ steps.attempts.outputs.display }}
```

(analogous 8-line revert in `spec-create.yml` × 2 prompt blocks and
4-line revert in `report-validate.yml`).

Diff total: **3 files, 16 ±**.

## Test plan

- [ ] After merge, redispatch `impl-review.yml` for the 4 stuck PRs (`gh
workflow run impl-review.yml -f pr_number=<N>` for 5476, 5483, 5486;
5480 already got a 82 in the redispatch and should now stabilize)
- [ ] Verify each run's `Run AI Quality Review` step log shows real
values (e.g. `- LIBRARY: plotly`) in the PROMPT echo, not `$LIBRARY`
- [ ] Verify `quality_score.txt` is produced and `ai-review-failed`
label is removed
- [ ] On next `spec-request`-labeled issue, verify the spec-create
prompt sees the issue title/body
- [ ] On next `report-pending`-labeled issue, verify the report-validate
prompt sees the issue title/body

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants