Skip to content

fix(ci): make Sync-from-Docs auto-merge actually work#143

Merged
acedatacloud-dev merged 1 commit intomainfrom
fix/sync-from-docs-reliable-merge
May 5, 2026
Merged

fix(ci): make Sync-from-Docs auto-merge actually work#143
acedatacloud-dev merged 1 commit intomainfrom
fix/sync-from-docs-reliable-merge

Conversation

@acedatacloud-dev
Copy link
Copy Markdown
Member

Why

The Sync from Docs workflow has been silently broken since it was created — every docs-updated dispatch leaks one or more PRs (stale drafts or 0-commit zombies), causing the open-PR backlog you can see in the queue right now (APIs ~40, Clis ~30, SDK ~7).

The Skills repo had the exact same workflow with the exact same bugs and was fixed in AceDataCloud/Skills#181. This PR ports that fix verbatim. The fix has been running cleanly in Skills for several days.

Root cause — five compounding bugs

# Bug Effect
1 Agent-done detection used gh api .../actions/runs?event=dynamic filtered by workflow-run name. That query regularly returns 0 hits or a null conclusion, so the polling default "pending" sticks for the entire window. 30-min timeout reached without ever attempting a merge → PR left in DRAFT.
2 PR detector grabbed the first copilot/* branch it saw, not the one bound to the current issue. Stale draft from a prior run was tracked instead of the new one; new agent's PR was ignored.
3 create-task closed prior issues but not their orphaned PRs (including 0-commit zombies from agent boot failures). Drafts and zombies pile up indefinitely.
4 timeout-minutes: 35 plus 90×20 s polling = 30-min budget, shorter than typical Copilot sync runs. Job exited before agent finished.
5 gh pr merge had no --delete-branch, no retry, and shared the polling job. Even when the rare success path triggered, branches stayed; one API flake killed the whole job.

What this PR does

  1. Pre-cleanupcreate-task now closes every leftover open copilot/* PR at the start of each dispatch (with branch deletion), not just the prior auto-sync issues. A new Docs commit makes prior attempts stale by definition.
  2. PR-to-issue binding — the wait loop locates the PR via #<issue> in PR body (Copilot's standard linking format), with a single-PR fallback.
  3. Reliable completion signal — replaced the workflow_runs poll with the observable signal isDraft == false. Copilot itself marks PRs ready-for-review when its agent finishes successfully.
  4. Zombie guard — any copilot/* PR open for 30 min with 0 commits is auto-closed (with branch deleted).
  5. Bigger budgettimeout-minutes 35 → 60, polling 90×20 s → 110×30 s = 55 min.
  6. Hardened merge step — new isolated step does gh pr ready (idempotent) → settle checks → gh pr merge --squash --admin --delete-branch with up to 3 retries. On final failure, comments on the PR.
  7. Wider permissionscontents: write, pull-requests: write so the job can actually merge and delete branches under GITHUB_TOKEN (also helpful as fallback if a secret rotates).

Outcome contract

After this lands, every docs-updated dispatch resolves to one of three outcomes:

  • Merged to main — happy path
  • Issue closed by Copilot — no changes needed
  • Surfaced failure — PR commented explaining what broke (zombie / merge-failed / checks-failed)

No more silent draft accumulation.

Backlog cleanup

The existing open copilot/* PRs in this repo will be closed in a follow-up bulk-cleanup pass after this PR lands (the new create-task step would close them on the next dispatch anyway, but doing it explicitly produces a cleaner audit trail).

Test plan

  • YAML validated locally (yaml.safe_load).
  • Same logic is already live in Skills repo with successful runs since 2026-05-04.
  • Will be exercised by the next docs-updated dispatch from PlatformBackend.

Port the Skills-repo fix (AceDataCloud/Skills#181) to APIs. The current
sync-from-docs.yml has 5 compounding bugs that cause every dispatch to leak
one or more PRs:

  1. Agent-done detection used 'gh api .../actions/runs?event=dynamic' filtered
     by run name. Returns 0 hits / null conclusion most of the time, so the
     polling default 'pending' sticks for the full 30-min window.
  2. PR detector grabbed the first copilot/* branch it saw, not the one bound
     to the current issue. Stale PRs from prior runs get tracked instead of
     the new one.
  3. create-task closed prior issues but not their orphaned PRs (including
     0-commit zombies from agent boot failures).
  4. timeout-minutes 35 + 90x20s polling = 30-min budget, shorter than typical
     Copilot sync runs.
  5. gh pr merge had no --delete-branch, no retry, and shared the polling job.

Result: 40+ open copilot/* PRs (mix of stale syncs and 0-commit zombies).

This commit:
  - Pre-cleanup: closes every leftover open copilot/* PR at start of each
    dispatch (with branch deletion).
  - Reliable completion signal: poll isDraft == false (Copilot marks ready
    when its agent finishes).
  - PR-to-issue binding: locate PR via #<issue> in body, single-PR fallback.
  - Zombie guard: auto-close 0-commit PRs after 30 min.
  - Bigger budget: timeout 35->60 min, polling 90x20s -> 110x30s = 55 min.
  - Hardened merge: pr ready + check settle + 3-retry merge --admin --delete-branch.
  - Wider permissions: contents/pull-requests write so GITHUB_TOKEN fallback works.
@acedatacloud-dev acedatacloud-dev merged commit 2dd73d0 into main May 5, 2026
@acedatacloud-dev acedatacloud-dev deleted the fix/sync-from-docs-reliable-merge branch May 5, 2026 07:21
This was referenced May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant