Skip to content

ci(e2e): fail loud when wait-on times out, capture serve diagnostics#58

Merged
TortoiseWolfe merged 1 commit intomainfrom
ci/wait-on-fail-loud
Apr 27, 2026
Merged

ci(e2e): fail loud when wait-on times out, capture serve diagnostics#58
TortoiseWolfe merged 1 commit intomainfrom
ci/wait-on-fail-loud

Conversation

@TortoiseWolfe
Copy link
Copy Markdown
Owner

Summary

CI run 24970006226 on commit `00edd23` had 14/26 jobs fail. All 7 firefox shards and all 6 webkit-gen shards failed with `NS_ERROR_CONNECTION_REFUSED` to `http://localhost:3000/account\` in their first `beforeEach`. The static server (`npx serve out -l 3000`) wasn't responding when Playwright tried to connect, but the workflow proceeded to run tests anyway — producing 50 minutes of cascade ECONNREFUSED per shard with no actionable signal.

The bug is in the workflow: `Start server` ran `npx serve ... &; sleep 5; npx wait-on ... --timeout 60000`, and wait-on's failure was not reliably propagating as a step error (likely `-e` interacting badly with the backgrounded job).

Fix

Replace all 6 identical `Start server` blocks with a defensive version that:

  1. Captures serve's PID and tees output to `serve.log`
  2. Verifies serve didn't exit immediately (port in use, missing `out/`, etc.)
  3. Explicitly checks wait-on's exit code via `if ! ... ; then ...`
  4. On failure: dumps `serve.log`, listening sockets (`ss`/`netstat`), serve process state (`ps`), and a direct `curl` probe — then `exit 1`
  5. On success: prints `serve is responding on http://localhost:3000\` for grep-friendly logs

Affected jobs (one block each): `smoke`, `rate-limiting`, `auth-setup`, `e2e`, `e2e-firefox`, `e2e-webkit`. All 6 jobs preserve their existing `env: CI: true` blocks (where present on `e2e`, `e2e-firefox`, `e2e-webkit`).

What this gives us

After merge, the next failing run produces clear diagnostics at the moment of failure, instead of cascading 100s of test failures downstream. Total wall time on a serve-bind failure goes from ~50 minutes per shard to ~90 seconds. That signal will tell us why serve isn't binding (timeout too short? serve crash? port collision?) — currently the cascade noise drowns it out.

What this doesn't do

  • No timeout bump. We have no evidence yet that 60s is too short — bump speculatively only after diagnostics confirm. Follow-up plan, not this PR.
  • No source-code changes. Workflow only.
  • Doesn't address #50/#57 (cross-shard messaging test-user collision) — different root cause, different fix.

Test plan

  • YAML parses cleanly (`python3 -c "import yaml; yaml.safe_load(open('.github/workflows/e2e.yml'))"`); 8 jobs intact.
  • All 6 `Start server` blocks contain `serve started, PID=` and `wait-on timed out` strings (`grep -c` returns 6).
  • The 3 `env: CI: true` blocks (e2e/firefox/webkit jobs) are still attached.
  • Real CI verification on this PR's run: smoke, rate-limiting, auth-setup, and chromium-gen jobs should produce `serve is responding on http://localhost:3000\` log line. Firefox/webkit jobs may either pass or fail loudly (≤90s) with diagnostics; either is the signal we want.

🤖 Generated with Claude Code

CI run 24970006226 on commit 00edd23 had 14/26 jobs fail. All 7 firefox
shards and all 6 webkit-gen shards failed with NS_ERROR_CONNECTION_REFUSED
to http://localhost:3000/account in their first beforeEach. The static
server (npx serve out -l 3000) was never responding when Playwright tried
to connect, but the workflow proceeded to run tests anyway.

Root cause of the silent failure: the prior 'Start server' step ran
'npx serve ... &; sleep 5; npx wait-on ... --timeout 60000', and despite
GitHub Actions' default '-eo pipefail' shell, wait-on's failure was not
reliably propagating as a step error — likely an interaction between
'-e' and the backgrounded job. So tests ran, every test cascaded to
ECONNREFUSED, and 50 minutes of CI per shard produced no actionable
signal beyond 'connection refused.'

Fix: replace all 6 identical 'Start server' blocks with a defensive
version that:

1. Captures serve's PID and tees output to serve.log
2. Verifies serve didn't exit immediately
3. Explicitly checks wait-on's exit code via 'if ! ... ; then ...'
4. On failure, dumps serve.log, listening sockets (ss/netstat),
   serve process state (ps), and a direct curl probe — then exit 1
5. On success, prints a confirmation line for grep-friendly logs

Affected jobs: smoke, rate-limiting, auth-setup, e2e (chromium-gen +
chromium-msg), e2e-firefox, e2e-webkit. Six identical blocks updated
in place; preserves all existing 'env: CI: true' attachments on the
e2e/firefox/webkit jobs.

After this lands, the next firefox/webkit cascade will fail in ~90s
with captured diagnostics pointing at the actual root cause, instead
of a 50-minute silent ECONNREFUSED storm. That signal will tell us
whether to bump the wait-on timeout, switch the static server, or fix
something else entirely. Currently the cascade noise drowns out the
cause.

Out of scope:
- Cross-shard test-user collisions in messaging shards (#50, #57)
- The webkit-gen test failures themselves (theme-switching, etc.) —
  separate investigations once we know whether they're real or a
  cascade symptom of the serve-died problem this fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TortoiseWolfe TortoiseWolfe merged commit 5de242a into main Apr 27, 2026
28 checks passed
@TortoiseWolfe TortoiseWolfe deleted the ci/wait-on-fail-loud branch April 27, 2026 12:12
TortoiseWolfe added a commit that referenced this pull request Apr 27, 2026
…andoff (#61)

Captures end-of-session state after 6 PRs landed (#54, #55, #56, #58,
#59, #60). Family A is empty (both stability hotspots resolved).
Family D1 done. Recommended next pickup: B1 (#43 /payment/result page).
The handoff doc is the load-bearing artifact for the next operator —
it lists open issues by family, sharp edges, and a copy-pasteable
quick-start.

Co-authored-by: TurtleWolfe <TurtleWolfe@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TortoiseWolfe pushed a commit that referenced this pull request Apr 28, 2026
The test as shipped called context.setOffline(true) BEFORE page.goto(),
which works on a dev server (page is in cache after first hit) but fails
in CI's static-export flow with net::ERR_INTERNET_DISCONNECTED — there's
no service-worker cache warmed for this URL when the page first loads.

Fix: navigate online first, wait for the page to render the loaded /
not-found branch (both of which mount OfflineRetryBanner), then flip
offline. useOfflineStatus listens for the browser 'offline' event so the
banner re-renders without another navigation.

Same story as the diagnostic-loud failure modes PR #58 was designed to
catch — the failure was clear from the CI log so this is a one-shot fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants