Skip to content

feat(ci): add _ci-gate.yml shared workflow with queue watchdog#247

Merged
JacobPEvans merged 2 commits intomainfrom
feat/ci-gate-watchdog
Apr 30, 2026
Merged

feat(ci): add _ci-gate.yml shared workflow with queue watchdog#247
JacobPEvans merged 2 commits intomainfrom
feat/ci-gate-watchdog

Conversation

@JacobPEvans
Copy link
Copy Markdown
Owner

Summary

  • New _ci-gate.yml reusable workflow centralizing the conditional-required-check / Merge Gatekeeper pattern that currently lives as ~110 lines of duplicated boilerplate in each of 8 consumer repos.
  • Built-in Queue Watchdog job cancels sibling jobs still in queued state after a configurable timeout, so the required Merge Gate status can never be stuck-pending indefinitely.
  • New scripts/ci-gate-watchdog.sh carries the watchdog logic (extracted from inline YAML so it's shellcheck-testable and stays under the inline-script-guard threshold).
  • Add timeout-minutes: 30 to _nix-validate.yml's validate job — defense-in-depth against the running-job variant of the same symptom.

Why

nix-home#205 exposed the exact failure mode the watchdog now prevents: when a self-hosted runner failed to claim the Nix Validate / Validate job, that job sat in queued for the entire PR lifetime. Because GitHub Actions only schedules a `needs:` dependent once every upstream job reaches a terminal state (success / failure / cancelled / skipped) — and `queued` is not terminal — the dependent `Merge Gate` job never scheduled, even with `if: ${{ !cancelled() }}`. The required check stayed permanently absent and the PR was unmergeable.

The watchdog closes the loop by waiting `queue_timeout_minutes` (default 10) and then cancelling any sibling job still in `queued`. Cancellation is terminal, so the gate always evaluates within a bounded time and `Merge Gate` always reports — pass, fail, or "stuck job cancelled → fail" (the right semantic; investigate the runner instead of letting the PR rot).

A follow-up DRY win: each consumer repo's `ci-gate.yml` will collapse to ~25 lines of thin caller.

Filter convention

Callers define filters under conventional names; the shared workflow ignores anything outside this set:

Filter Gated check
nix nix_validate, also file_size
markdown markdown_lint, also file_size
python python_security

Each conditional check is also gated by an opt-in input boolean (`nix_validate: true`, etc.). All default to `false` so a caller only pays for what it wants.

Branch protection

No rule changes needed. The gate job keeps `name: Merge Gate`, matching existing required_status_checks rulesets across all consumer repos.

Test plan

  • CI on this PR runs cleanly.
  • After merge, migrate `ai-assistant-instructions` as canary. Confirm gate runs and reports SUCCESS on a docs-only PR.
  • Stuck-runner test: open a draft PR in `nix-home` re-adding the dynamic RunsOn label to `nix-validate`. Confirm watchdog cancels at timeout, gate evaluates, `Merge Gate` reports FAILURE (correct).
  • After 48h canary bake, roll out to remaining 6 standard repos. Defer `terraform-runs-on` and `ansible-splunk` (bespoke gates).

Rollback

Callers reference `@main`, so reverting rolls every consumer back automatically.

🤖 Generated with Claude Code

Centralizes the conditional-required-check / Merge Gatekeeper pattern
that previously lived as a duplicated ~110-line `ci-gate.yml` in every
consumer repo. New reusable workflow with four jobs:

- changes: dorny/paths-filter on caller-supplied filter YAML
- conditional checks (nix_validate, markdown_lint, file_size,
  python_security): each gated on (input toggle && filter output)
- watchdog: sleeps queue_timeout_minutes (default 10) then cancels any
  sibling job still in `queued` state via the GitHub API
- gate (name "Merge Gate"): re-actors/alls-green aggregator

The watchdog closes a class of bugs first observed on nix-home#205:
when a `needs:` dependency is stuck queued (e.g. a self-hosted runner
never claims it), the dependent gate job — even with `if: !cancelled()`
— never schedules, leaving the required `Merge Gate` status absent
forever and the PR unmergeable. By forcing a terminal state, the
watchdog guarantees the gate always evaluates within
queue_timeout_minutes + watchdog runtime.

Also adds `timeout-minutes: 30` to _nix-validate.yml's validate job so
a wedged running job (different failure mode, same symptom) also
terminates cleanly.

Watchdog logic lives in scripts/ci-gate-watchdog.sh, sparse-checked-out
at runtime so the YAML stays under the inline-script-guard threshold
and the logic is shellcheck-testable in isolation.

Branch protection unchanged: the gate job's `name: Merge Gate` matches
existing required_status_checks rulesets across all consumer repos.

Follow-up PRs will migrate consumer repos to thin callers of this
workflow, starting with ai-assistant-instructions as the canary.

(claude)
Copilot AI review requested due to automatic review settings April 29, 2026 07:12
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a centralized CI gatekeeper mechanism to standardize and improve the reliability of required status checks across multiple repositories. By implementing a watchdog job that monitors and cancels stalled queued jobs, it prevents PRs from becoming unmergeable due to stuck workflow dependencies, ensuring that the 'Merge Gate' status consistently reports a terminal result.

Highlights

  • Shared CI Gate Workflow: Introduced a reusable _ci-gate.yml workflow to centralize the Merge Gatekeeper pattern, reducing boilerplate across multiple repositories.
  • Queue Watchdog Implementation: Added a watchdog mechanism that cancels sibling jobs stuck in a 'queued' state after a configurable timeout, ensuring the Merge Gate status always reaches a terminal state.
  • Script Extraction: Created scripts/ci-gate-watchdog.sh to handle the watchdog logic, allowing for better testability and cleaner YAML configuration.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/_ci-gate.yml
    • .github/workflows/_nix-validate.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.


The queue was stuck in endless wait, No runner found to seal its fate. But now the watchdog guards the gate, To clear the path and keep it straight.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a reusable CI “Merge Gatekeeper” workflow to centralize conditional required checks across consumer repos, including a watchdog to prevent required status checks from getting stuck when self-hosted runner jobs remain queued.

Changes:

  • Introduce reusable workflow .github/workflows/_ci-gate.yml with change detection, conditional check jobs, a queue watchdog, and an aggregate “Merge Gate”.
  • Add scripts/ci-gate-watchdog.sh to cancel sibling jobs stuck in queued after a configurable timeout.
  • Add timeout-minutes: 30 to _nix-validate.yml’s validate job as defense-in-depth for wedged running jobs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
scripts/ci-gate-watchdog.sh Implements the queue watchdog logic used by the new CI gate workflow.
.github/workflows/_nix-validate.yml Adds a hard timeout to avoid indefinitely running Nix validation blocking the gate.
.github/workflows/_ci-gate.yml New reusable workflow coordinating conditional checks + watchdog + required “Merge Gate” aggregator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/_ci-gate.yml Outdated
Comment thread .github/workflows/_ci-gate.yml Outdated
Comment thread scripts/ci-gate-watchdog.sh Outdated
Comment thread scripts/ci-gate-watchdog.sh Outdated
- gate: `always() && !cancelled()` ensures the gate runs even when a
  dependency fails (not just when not cancelled)
- watchdog: add `contents: read` permission so sparse-checkout succeeds
  when job-level perms override workflow-level perms
- watchdog script: use awk for float-safe minute→second conversion
- watchdog script: `gh api --paginate` to catch queued jobs beyond
  the first 100 in large runs

(claude)
@JacobPEvans JacobPEvans merged commit 4742f1c into main Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants