ci(e2e): fail loud when wait-on times out, capture serve diagnostics#58
Merged
TortoiseWolfe merged 1 commit intomainfrom Apr 27, 2026
Merged
ci(e2e): fail loud when wait-on times out, capture serve diagnostics#58TortoiseWolfe merged 1 commit intomainfrom
TortoiseWolfe merged 1 commit intomainfrom
Conversation
CI run 24970006226 on commit 00edd23 had 14/26 jobs fail. All 7 firefox shards and all 6 webkit-gen shards failed with NS_ERROR_CONNECTION_REFUSED to http://localhost:3000/account in their first beforeEach. The static server (npx serve out -l 3000) was never responding when Playwright tried to connect, but the workflow proceeded to run tests anyway. Root cause of the silent failure: the prior 'Start server' step ran 'npx serve ... &; sleep 5; npx wait-on ... --timeout 60000', and despite GitHub Actions' default '-eo pipefail' shell, wait-on's failure was not reliably propagating as a step error — likely an interaction between '-e' and the backgrounded job. So tests ran, every test cascaded to ECONNREFUSED, and 50 minutes of CI per shard produced no actionable signal beyond 'connection refused.' Fix: replace all 6 identical 'Start server' blocks with a defensive version that: 1. Captures serve's PID and tees output to serve.log 2. Verifies serve didn't exit immediately 3. Explicitly checks wait-on's exit code via 'if ! ... ; then ...' 4. On failure, dumps serve.log, listening sockets (ss/netstat), serve process state (ps), and a direct curl probe — then exit 1 5. On success, prints a confirmation line for grep-friendly logs Affected jobs: smoke, rate-limiting, auth-setup, e2e (chromium-gen + chromium-msg), e2e-firefox, e2e-webkit. Six identical blocks updated in place; preserves all existing 'env: CI: true' attachments on the e2e/firefox/webkit jobs. After this lands, the next firefox/webkit cascade will fail in ~90s with captured diagnostics pointing at the actual root cause, instead of a 50-minute silent ECONNREFUSED storm. That signal will tell us whether to bump the wait-on timeout, switch the static server, or fix something else entirely. Currently the cascade noise drowns out the cause. Out of scope: - Cross-shard test-user collisions in messaging shards (#50, #57) - The webkit-gen test failures themselves (theme-switching, etc.) — separate investigations once we know whether they're real or a cascade symptom of the serve-died problem this fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TortoiseWolfe
added a commit
that referenced
this pull request
Apr 27, 2026
…andoff (#61) Captures end-of-session state after 6 PRs landed (#54, #55, #56, #58, #59, #60). Family A is empty (both stability hotspots resolved). Family D1 done. Recommended next pickup: B1 (#43 /payment/result page). The handoff doc is the load-bearing artifact for the next operator — it lists open issues by family, sharp edges, and a copy-pasteable quick-start. Co-authored-by: TurtleWolfe <TurtleWolfe@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TortoiseWolfe
pushed a commit
that referenced
this pull request
Apr 28, 2026
The test as shipped called context.setOffline(true) BEFORE page.goto(), which works on a dev server (page is in cache after first hit) but fails in CI's static-export flow with net::ERR_INTERNET_DISCONNECTED — there's no service-worker cache warmed for this URL when the page first loads. Fix: navigate online first, wait for the page to render the loaded / not-found branch (both of which mount OfflineRetryBanner), then flip offline. useOfflineStatus listens for the browser 'offline' event so the banner re-renders without another navigation. Same story as the diagnostic-loud failure modes PR #58 was designed to catch — the failure was clear from the CI log so this is a one-shot fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CI run 24970006226 on commit `00edd23` had 14/26 jobs fail. All 7 firefox shards and all 6 webkit-gen shards failed with `NS_ERROR_CONNECTION_REFUSED` to `http://localhost:3000/account\` in their first `beforeEach`. The static server (`npx serve out -l 3000`) wasn't responding when Playwright tried to connect, but the workflow proceeded to run tests anyway — producing 50 minutes of cascade ECONNREFUSED per shard with no actionable signal.
The bug is in the workflow: `Start server` ran `npx serve ... &; sleep 5; npx wait-on ... --timeout 60000`, and wait-on's failure was not reliably propagating as a step error (likely `-e` interacting badly with the backgrounded job).
Fix
Replace all 6 identical `Start server` blocks with a defensive version that:
Affected jobs (one block each): `smoke`, `rate-limiting`, `auth-setup`, `e2e`, `e2e-firefox`, `e2e-webkit`. All 6 jobs preserve their existing `env: CI: true` blocks (where present on `e2e`, `e2e-firefox`, `e2e-webkit`).
What this gives us
After merge, the next failing run produces clear diagnostics at the moment of failure, instead of cascading 100s of test failures downstream. Total wall time on a serve-bind failure goes from ~50 minutes per shard to ~90 seconds. That signal will tell us why serve isn't binding (timeout too short? serve crash? port collision?) — currently the cascade noise drowns it out.
What this doesn't do
Test plan
🤖 Generated with Claude Code