Skip to content

Bound WordPress diagnostic setup/collect in capture-html (fix 25-min hang under contention)#1650

Merged
chubes4 merged 1 commit into
mainfrom
cook/capture-html-nav-reliability
Jun 29, 2026
Merged

Bound WordPress diagnostic setup/collect in capture-html (fix 25-min hang under contention)#1650
chubes4 merged 1 commit into
mainfrom
cook/capture-html-nav-reliability

Conversation

@chubes4

@chubes4 chubes4 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

What

Fixes wordpress.capture-html hanging the full 25-min recipe timeout under contention — the blocker behind the live-WP parity cross-check (all 5 fixtures of an offloaded matrix run hung at steps[3]: capture-html).

Root cause (verified — not the obvious one)

Navigation is NOT the problem: capture-html and browser-probe both route through runSingleBrowserProbeCommand, and page.goto is bounded identically (timeout: wallTimeoutMs + withBrowserProbeLiveness(..., "navigation"), 120s). The real issue is the WordPress diagnostic provider that only capture-html-style calls attach:

  • browser-probe-runner.ts await provider.setup(...) runs BEFORE navigation, unbounded. setup = installBrowserWordPressDiagnostics, which runs the external wordpress.browser-diagnostics-setup PHP command. Under contention (~21 children / ~10GB RSS) that wedges → unbounded await → the whole command hangs and rides the 25-min recipe timeout. (browser-probe failed fast at 120s because navigation timed out; capture-html never reached navigation — it hung in setup.)
  • await provider.collect(...) in the finally block was likewise unbounded.
  • Compounding: runHtmlCapture didn't pass abortSignal (browser-probe did), so the orchestrator couldn't interrupt capture-html.
  • (Blocked external requests are already route.abort("blockedbyclient")'d — they don't leave the page pending; no change needed there.)

Fix

  • runBoundedBrowserDiagnostic: wraps each provider setup/collect in withBrowserCommandLiveness with diagnosticsTimeoutMs (= the 120s nav wall). A stuck/failing provider now fails fast with a clear, non-empty liveness error surfaced as a non-fatal probe error; diagnostics degrade to best-effort instead of hanging. A discriminated result preserves legitimately-resolved falsy values (setup can return false).
  • Threaded abortSignal through runHtmlCaptureCommand + passed this.activeExecutionSignal from playground-runtime.runHtmlCapture, so capture-html is interruptible like browser-probe.
  • Left the nav wait-state as-is (already bounded by goto timeout + wall, so even networkidle against WP heartbeat fails fast at the wall rather than hanging — changing the default would alter the contract without fixing the actual hang).

Verification

  • npm run build (tsc -b) clean.
  • test:browser-diagnostic-providers, test:browser-runner-template, browser-probe-contract-smoke green. New tests/browser-capture-html-diagnostics-reliability.test.ts: hung op fails fast (<5s, names the phase), resolved-falsy preserved as ok, rejection surfaces the real message.

Honest limit

Single-fixture live capture-html confirmation is PENDING — it boots a full Playground+Chromium sandbox (RAM-heavy; site rules require offloaded homeboy-lab, not set up in this scoped pass). Why it resolves deterministically: the only capture-html-specific unbounded paths (provider setup pre-nav, collect in finally) are now bounded by the same 120s wall as navigation → worst case ~setup+nav+collect (a few min), well under the 1,500,000ms recipe timeout; otherwise DOM returns via page.content() normally with diagnostics degraded to best-effort. Unit test proves the bounding + real-error + value-preservation without a sandbox.

AI assistance

  • AI assistance: Yes
  • Tool(s): Claude Code (Claude Opus 4.8, 1M context)
  • Used for: Root-cause investigation, fix, and tests under human review.

wordpress.capture-html attaches the WordPress browser diagnostic provider,
whose setup runs an external playground command (browser-diagnostics-setup)
before navigation and whose collect runs in the finally block. Both were
awaited with no liveness bound, so a provider wedged under runtime contention
rode the recipe-level timeout — observed as capture-html hanging for the full
25-minute recipe budget while the sibling browser-probe navigation path failed
fast at its 120s wall bound (navigation itself is already bounded identically
for both commands).

Wrap the provider setup and collect calls in a bounded liveness helper
(runBoundedBrowserDiagnostic) using the same wall budget as navigation, so a
stuck or failing provider fails fast with a clear, non-empty error surfaced as
a non-fatal probe error instead of blocking the capture. Diagnostics remain
best-effort: a timeout or rejection degrades to "diagnostics unavailable"
rather than aborting the run.

Also thread the runtime abort signal into runHtmlCaptureCommand so the
orchestrator can interrupt capture-html the same way it can interrupt
browser-probe, which previously received no abort signal.

Blocked external requests under network-policy=block are already completed via
route.abort("blockedbyclient"), so they do not leave the page pending — no
change needed there.

Add deterministic coverage proving the bounded helper fails fast with a real,
non-empty error, preserves legitimately resolved (including falsy) values, and
surfaces underlying rejection reasons.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chubes4 chubes4 merged commit 46281d6 into main Jun 29, 2026
@chubes4 chubes4 deleted the cook/capture-html-nav-reliability branch June 29, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant