feat(workflows): add review-retry listener and stuck-jobs watchdog#6084
Merged
Conversation
Two new safety nets close the gap where the impl pipeline silently leaves
PRs and issues in a stuck state when a step crashes or times out:
- impl-review-retry.yml: listens for the `ai-review-failed` label and
re-dispatches impl-review.yml exactly once. Bounded by an
`ai-review-rescued` marker label so we never loop.
- watchdog-stuck-jobs.yml: cron every 6h (plus manual dispatch with
optional dry_run). Catches three failure modes the regular workflows
miss:
1. PRs with `ai-review-failed` (rescue-listener safety net)
2. PRs with `ai-attempt-N` + `quality:*` and no decision label —
repair handoff crashed (e.g. PR #6002 today, altair attempt-1
died and the PR sat for 16h with no further action).
3. spec-ready issues with `generate:<lib>` or `impl:<lib>:failed`
and no open PR for that (spec, lib) pair (e.g. #5237 plotnine
failed; #5240 plotnine never generated).
Per-cause retries are bounded by marker labels (`ai-review-rescued`,
`watchdog:repair-rescued-<N>`, `watchdog:retried-<lib>`) so a stuck
state escalates to a `::warning::` for human attention rather than
looping forever.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds two GitHub Actions “safety net” workflows to prevent the implementation pipeline from silently getting stuck by re-dispatching review/repair/generation in bounded, label-marked ways.
Changes:
- Added a PR-label listener workflow that re-dispatches
impl-review.ymlonce whenai-review-failedis applied. - Added a scheduled/manual watchdog workflow that scans open implementation PRs and
spec-readyissues to detect stuck states and dispatch appropriate workflows, using marker labels for bounded retries.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
.github/workflows/watchdog-stuck-jobs.yml |
New scheduled/manual watchdog that scans for stuck PRs/issues and dispatches review/repair/generation, using marker labels to bound retries. |
.github/workflows/impl-review-retry.yml |
New pull_request:labeled listener that re-dispatches impl-review.yml once for ai-review-failed, guarded by an ai-review-rescued marker. |
Comment on lines
+180
to
+184
| if [[ "$open_pr" == "0" ]]; then | ||
| dispatch "Issue #$num: generate (stuck pending) $lib" \ | ||
| bulk-generate.yml \ | ||
| -f specification_id="$spec_id" \ | ||
| -f library="$lib" |
| } | ||
|
|
||
| ensure_label() { | ||
| gh label create "$1" --color "$2" --description "$3" 2>/dev/null || true |
- ensure_label() short-circuits when DRY_RUN=true so dry-runs no longer mutate the repository label namespace. - generate:<lib> rescue path now uses the same watchdog:retried-<lib> marker as impl:<lib>:failed, so a stuck `generate:<lib>` cannot re-dispatch every 6h indefinitely. Already-retried cases escalate to ::warning:: for human attention, matching the documented behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MarkusNeusinger
added a commit
that referenced
this pull request
May 18, 2026
Version bump for the v2.4.0 release. Release notes will be attached to the tag once this lands. ## Highlights since v2.3.0 - **R / ggplot2 added as the 10th library** + multi-language pipeline (#6944, #6961, #7052). 30 ggplot2 implementations landed across foundational plot types. - **In-app feedback widget** (#7143). - **Stats page** with Plausible visitors chart + daily-impl timeline (#6608). - **Language across the site**: `/plots?lang=` filtering, cross-language carousel, language in URLs and titles (#7141, #7142, #7144). - **UI polish**: pseudo-function styling for 404 / footer / empty state / library card (#6436); mobile fixes for `/stats`, `/mcp`, breadcrumb + FAB (#6902, #7283). - **Pipeline**: review-retry listener + stuck-jobs watchdog (#6084); daily-regen 2h → hourly (#6943). - **Dependencies**: mypy 1.20→2.1, urllib3 2.6→2.7, authlib bump, react/mui/python-minor groups. - ~1200 implementation regenerations across all 10 libraries. No SemVer-breaking changes. **Full Changelog:** v2.3.0...main 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two new safety nets so the impl pipeline doesn't leave PRs and issues silently stuck when a step crashes or times out.
impl-review-retry.yml— listener that re-dispatchesimpl-review.ymlexactly once when a PR is labeledai-review-failed. Bounded by anai-review-rescuedmarker so we never loop.watchdog-stuck-jobs.yml— cron every 6h (also manualworkflow_dispatchwithstale_hoursanddry_runinputs). Catches three failure modes today's regular workflows miss:ai-review-failed(acts as listener safety net).ai-attempt-N+quality:*but noai-approved/ai-rejectedafterstale_hours— repair handoff crashed (e.g. PR feat(altair): implement histogram-2d #6002 today: altair attempt-1 died, PR sat for 16 h with no further action).generate:<lib>orimpl:<lib>:failedand no open PR for that (spec, lib) pair (e.g. [shap-waterfall] SHAP Waterfall Plot for Feature Attribution #5237 plotnine failed; [heatmap-adjacency] Network Adjacency Matrix Heatmap #5240 plotnine never generated).Per-cause retries are bounded by marker labels (
ai-review-rescued,watchdog:repair-rescued-<N>,watchdog:retried-<lib>); when a marker is already present, the watchdog emits a::warning::instead of dispatching, so a truly stuck case escalates to a human rather than looping.Test plan
Watchdog: Stuck Jobsmanually withdry_run=trueonce merged → confirm it logs the three open hangers (PR feat(highcharts): implement box-grouped #5997, PR feat(altair): implement histogram-2d #6002, plus any leftover) without dispatchingdry_run=falseif the dry-run looked saneimpl-review-retry.ymllistener fires the next time a PR is labeledai-review-failed(will happen organically the next time review hits a transient blip)ai-review-rescued,watchdog:repair-rescued-<N>,watchdog:retried-<lib>) get auto-created on first use🤖 Generated with Claude Code