Skip to content

feat(workflows): add review-retry listener and stuck-jobs watchdog#6084

Merged
MarkusNeusinger merged 2 commits into
mainfrom
feat/watchdog-and-review-retry
May 8, 2026
Merged

feat(workflows): add review-retry listener and stuck-jobs watchdog#6084
MarkusNeusinger merged 2 commits into
mainfrom
feat/watchdog-and-review-retry

Conversation

@MarkusNeusinger
Copy link
Copy Markdown
Owner

Summary

Two new safety nets so the impl pipeline doesn't leave PRs and issues silently stuck when a step crashes or times out.

  • impl-review-retry.yml — listener that re-dispatches impl-review.yml exactly once when a PR is labeled ai-review-failed. Bounded by an ai-review-rescued marker so we never loop.
  • watchdog-stuck-jobs.yml — cron every 6h (also manual workflow_dispatch with stale_hours and dry_run inputs). Catches three failure modes today's regular workflows miss:
    1. PRs with ai-review-failed (acts as listener safety net).
    2. PRs with ai-attempt-N + quality:* but no ai-approved/ai-rejected after stale_hours — repair handoff crashed (e.g. PR feat(altair): implement histogram-2d #6002 today: altair attempt-1 died, PR sat for 16 h with no further action).
    3. spec-ready issues with generate:<lib> or impl:<lib>:failed and no open PR for that (spec, lib) pair (e.g. [shap-waterfall] SHAP Waterfall Plot for Feature Attribution #5237 plotnine failed; [heatmap-adjacency] Network Adjacency Matrix Heatmap #5240 plotnine never generated).

Per-cause retries are bounded by marker labels (ai-review-rescued, watchdog:repair-rescued-<N>, watchdog:retried-<lib>); when a marker is already present, the watchdog emits a ::warning:: instead of dispatching, so a truly stuck case escalates to a human rather than looping.

Test plan

  • CI green on this PR
  • Run Watchdog: Stuck Jobs manually with dry_run=true once merged → confirm it logs the three open hangers (PR feat(highcharts): implement box-grouped #5997, PR feat(altair): implement histogram-2d #6002, plus any leftover) without dispatching
  • Run again with dry_run=false if the dry-run looked sane
  • Verify impl-review-retry.yml listener fires the next time a PR is labeled ai-review-failed (will happen organically the next time review hits a transient blip)
  • Confirm marker labels (ai-review-rescued, watchdog:repair-rescued-<N>, watchdog:retried-<lib>) get auto-created on first use

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 8, 2026 21:25
Two new safety nets close the gap where the impl pipeline silently leaves
PRs and issues in a stuck state when a step crashes or times out:

- impl-review-retry.yml: listens for the `ai-review-failed` label and
  re-dispatches impl-review.yml exactly once. Bounded by an
  `ai-review-rescued` marker label so we never loop.

- watchdog-stuck-jobs.yml: cron every 6h (plus manual dispatch with
  optional dry_run). Catches three failure modes the regular workflows
  miss:
    1. PRs with `ai-review-failed` (rescue-listener safety net)
    2. PRs with `ai-attempt-N` + `quality:*` and no decision label —
       repair handoff crashed (e.g. PR #6002 today, altair attempt-1
       died and the PR sat for 16h with no further action).
    3. spec-ready issues with `generate:<lib>` or `impl:<lib>:failed`
       and no open PR for that (spec, lib) pair (e.g. #5237 plotnine
       failed; #5240 plotnine never generated).

Per-cause retries are bounded by marker labels (`ai-review-rescued`,
`watchdog:repair-rescued-<N>`, `watchdog:retried-<lib>`) so a stuck
state escalates to a `::warning::` for human attention rather than
looping forever.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two GitHub Actions “safety net” workflows to prevent the implementation pipeline from silently getting stuck by re-dispatching review/repair/generation in bounded, label-marked ways.

Changes:

  • Added a PR-label listener workflow that re-dispatches impl-review.yml once when ai-review-failed is applied.
  • Added a scheduled/manual watchdog workflow that scans open implementation PRs and spec-ready issues to detect stuck states and dispatch appropriate workflows, using marker labels for bounded retries.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/watchdog-stuck-jobs.yml New scheduled/manual watchdog that scans for stuck PRs/issues and dispatches review/repair/generation, using marker labels to bound retries.
.github/workflows/impl-review-retry.yml New pull_request:labeled listener that re-dispatches impl-review.yml once for ai-review-failed, guarded by an ai-review-rescued marker.

Comment on lines +180 to +184
if [[ "$open_pr" == "0" ]]; then
dispatch "Issue #$num: generate (stuck pending) $lib" \
bulk-generate.yml \
-f specification_id="$spec_id" \
-f library="$lib"
}

ensure_label() {
gh label create "$1" --color "$2" --description "$3" 2>/dev/null || true
- ensure_label() short-circuits when DRY_RUN=true so dry-runs no longer
  mutate the repository label namespace.
- generate:<lib> rescue path now uses the same watchdog:retried-<lib>
  marker as impl:<lib>:failed, so a stuck `generate:<lib>` cannot
  re-dispatch every 6h indefinitely. Already-retried cases escalate to
  ::warning:: for human attention, matching the documented behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MarkusNeusinger MarkusNeusinger merged commit 8a1b32d into main May 8, 2026
7 checks passed
@MarkusNeusinger MarkusNeusinger deleted the feat/watchdog-and-review-retry branch May 8, 2026 21:39
MarkusNeusinger added a commit that referenced this pull request May 18, 2026
Version bump for the v2.4.0 release. Release notes will be attached to
the tag once this lands.

## Highlights since v2.3.0

- **R / ggplot2 added as the 10th library** + multi-language pipeline
(#6944, #6961, #7052). 30 ggplot2 implementations landed across
foundational plot types.
- **In-app feedback widget** (#7143).
- **Stats page** with Plausible visitors chart + daily-impl timeline
(#6608).
- **Language across the site**: `/plots?lang=` filtering, cross-language
carousel, language in URLs and titles (#7141, #7142, #7144).
- **UI polish**: pseudo-function styling for 404 / footer / empty state
/ library card (#6436); mobile fixes for `/stats`, `/mcp`, breadcrumb +
FAB (#6902, #7283).
- **Pipeline**: review-retry listener + stuck-jobs watchdog (#6084);
daily-regen 2h → hourly (#6943).
- **Dependencies**: mypy 1.20→2.1, urllib3 2.6→2.7, authlib bump,
react/mui/python-minor groups.
- ~1200 implementation regenerations across all 10 libraries.

No SemVer-breaking changes.

**Full Changelog:**
v2.3.0...main

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants