ci(e2e): fail loud when wait-on times out, capture serve diagnostics by TortoiseWolfe · Pull Request #58 · TortoiseWolfe/ScriptHammer

TortoiseWolfe · 2026-04-27T10:46:45Z

Summary

CI run 24970006226 on commit `00edd23` had 14/26 jobs fail. All 7 firefox shards and all 6 webkit-gen shards failed with `NS_ERROR_CONNECTION_REFUSED` to `http://localhost:3000/account\` in their first `beforeEach`. The static server (`npx serve out -l 3000`) wasn't responding when Playwright tried to connect, but the workflow proceeded to run tests anyway — producing 50 minutes of cascade ECONNREFUSED per shard with no actionable signal.

The bug is in the workflow: `Start server` ran `npx serve ... &; sleep 5; npx wait-on ... --timeout 60000`, and wait-on's failure was not reliably propagating as a step error (likely `-e` interacting badly with the backgrounded job).

Fix

Replace all 6 identical `Start server` blocks with a defensive version that:

Captures serve's PID and tees output to `serve.log`
Verifies serve didn't exit immediately (port in use, missing `out/`, etc.)
Explicitly checks wait-on's exit code via `if ! ... ; then ...`
On failure: dumps `serve.log`, listening sockets (`ss`/`netstat`), serve process state (`ps`), and a direct `curl` probe — then `exit 1`
On success: prints `serve is responding on http://localhost:3000\` for grep-friendly logs

Affected jobs (one block each): `smoke`, `rate-limiting`, `auth-setup`, `e2e`, `e2e-firefox`, `e2e-webkit`. All 6 jobs preserve their existing `env: CI: true` blocks (where present on `e2e`, `e2e-firefox`, `e2e-webkit`).

What this gives us

After merge, the next failing run produces clear diagnostics at the moment of failure, instead of cascading 100s of test failures downstream. Total wall time on a serve-bind failure goes from ~50 minutes per shard to ~90 seconds. That signal will tell us why serve isn't binding (timeout too short? serve crash? port collision?) — currently the cascade noise drowns it out.

What this doesn't do

No timeout bump. We have no evidence yet that 60s is too short — bump speculatively only after diagnostics confirm. Follow-up plan, not this PR.
No source-code changes. Workflow only.
Doesn't address #50/#57 (cross-shard messaging test-user collision) — different root cause, different fix.

Test plan

YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(open('.github/workflows/e2e.yml'))"`); 8 jobs intact.
All 6 `Start server` blocks contain `serve started, PID=` and `wait-on timed out` strings (`grep -c` returns 6).
The 3 `env: CI: true` blocks (e2e/firefox/webkit jobs) are still attached.
Real CI verification on this PR's run: smoke, rate-limiting, auth-setup, and chromium-gen jobs should produce `serve is responding on http://localhost:3000\` log line. Firefox/webkit jobs may either pass or fail loudly (≤90s) with diagnostics; either is the signal we want.

🤖 Generated with Claude Code

CI run 24970006226 on commit 00edd23 had 14/26 jobs fail. All 7 firefox shards and all 6 webkit-gen shards failed with NS_ERROR_CONNECTION_REFUSED to http://localhost:3000/account in their first beforeEach. The static server (npx serve out -l 3000) was never responding when Playwright tried to connect, but the workflow proceeded to run tests anyway. Root cause of the silent failure: the prior 'Start server' step ran 'npx serve ... &; sleep 5; npx wait-on ... --timeout 60000', and despite GitHub Actions' default '-eo pipefail' shell, wait-on's failure was not reliably propagating as a step error — likely an interaction between '-e' and the backgrounded job. So tests ran, every test cascaded to ECONNREFUSED, and 50 minutes of CI per shard produced no actionable signal beyond 'connection refused.' Fix: replace all 6 identical 'Start server' blocks with a defensive version that: 1. Captures serve's PID and tees output to serve.log 2. Verifies serve didn't exit immediately 3. Explicitly checks wait-on's exit code via 'if ! ... ; then ...' 4. On failure, dumps serve.log, listening sockets (ss/netstat), serve process state (ps), and a direct curl probe — then exit 1 5. On success, prints a confirmation line for grep-friendly logs Affected jobs: smoke, rate-limiting, auth-setup, e2e (chromium-gen + chromium-msg), e2e-firefox, e2e-webkit. Six identical blocks updated in place; preserves all existing 'env: CI: true' attachments on the e2e/firefox/webkit jobs. After this lands, the next firefox/webkit cascade will fail in ~90s with captured diagnostics pointing at the actual root cause, instead of a 50-minute silent ECONNREFUSED storm. That signal will tell us whether to bump the wait-on timeout, switch the static server, or fix something else entirely. Currently the cascade noise drowns out the cause. Out of scope: - Cross-shard test-user collisions in messaging shards (#50, #57) - The webkit-gen test failures themselves (theme-switching, etc.) — separate investigations once we know whether they're real or a cascade symptom of the serve-died problem this fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…andoff (#61) Captures end-of-session state after 6 PRs landed (#54, #55, #56, #58, #59, #60). Family A is empty (both stability hotspots resolved). Family D1 done. Recommended next pickup: B1 (#43 /payment/result page). The handoff doc is the load-bearing artifact for the next operator — it lists open issues by family, sharp edges, and a copy-pasteable quick-start. Co-authored-by: TurtleWolfe <TurtleWolfe@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test as shipped called context.setOffline(true) BEFORE page.goto(), which works on a dev server (page is in cache after first hit) but fails in CI's static-export flow with net::ERR_INTERNET_DISCONNECTED — there's no service-worker cache warmed for this URL when the page first loads. Fix: navigate online first, wait for the page to render the loaded / not-found branch (both of which mount OfflineRetryBanner), then flip offline. useOfflineStatus listens for the browser 'offline' event so the banner re-renders without another navigation. Same story as the diagnostic-loud failure modes PR #58 was designed to catch — the failure was clear from the CI log so this is a one-shot fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced Apr 27, 2026

[Stability] Messaging E2E: chromium-msg shard fails on real-time-delivery cross-window propagation #57

Open

[Gap-Audit] RLS test suite wedges on cloud Supabase when prior runs leave residue #50

Closed

TortoiseWolfe merged commit 5de242a into main Apr 27, 2026
28 checks passed

TortoiseWolfe deleted the ci/wait-on-fail-loud branch April 27, 2026 12:12

TortoiseWolfe mentioned this pull request Apr 27, 2026

docs(stability): refresh STATUS + add 2026-04-27 session handoff #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(e2e): fail loud when wait-on times out, capture serve diagnostics#58

ci(e2e): fail loud when wait-on times out, capture serve diagnostics#58
TortoiseWolfe merged 1 commit intomainfrom
ci/wait-on-fail-loud

TortoiseWolfe commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TortoiseWolfe commented Apr 27, 2026

Summary

Fix

What this gives us

What this doesn't do

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants