Skip to content

chore: fix kv-store browser tests hangs#22721

Merged
mverzilli merged 12 commits intomerge-train/fairiesfrom
martin/fix-kv-store-browser-ci
May 4, 2026
Merged

chore: fix kv-store browser tests hangs#22721
mverzilli merged 12 commits intomerge-train/fairiesfrom
martin/fix-kv-store-browser-ci

Conversation

@mverzilli
Copy link
Copy Markdown
Contributor

@mverzilli mverzilli commented Apr 22, 2026

yarn-project/kv-store/yarn test:browser runs ~11 vitest browser-mode test files in a single vitest process, all sharing one chromium browser. In CI3's ISOLATE environment (docker run --cpus=2 --memory=8g --tmpfs /tmp:exec,size=1g), the run would intermittently hang at a test-file boundary, sit silent for >80s, then get killed by the outer CI timeout. Roughly 25% hit rate on CI; zero hit rate on developer machines.

The bug only manifests under CPU pressure. On unconstrained hardware (developer boxes with many cores), every chromium thread always gets scheduled in time and the bug never surfaces.

Root caused to a CDP teardown deadlock between vitest and chromium at test-file transitions:

  1. Vitest opens 10 CDP TCP connections to chromium's headless_shell --type=utility --utility-sub-type=network.mojom.NetworkService (the chromium "network service" process) at startup.
  2. Between test files, vitest tears down 6 of those 10 connections in a batch, sending FIN on each socket.
  3. The kernel transitions chromium's side of those 6 sockets to CLOSE_WAIT (peer closed, application hasn't called close() yet).
  4. Chromium's network service event loop, starved of CPU under --cpus=2, never gets scheduled to drain those closed FDs and call close().
  5. Vitest's teardown path is awaiting completion of the close handshake on those 6 connections (likely a Promise.all over CDPSession detach), and never gets the response.
  6. Both processes end up with zero on-CPU threads. vitest parked on futex_wait_queue, chromium parked on ep_poll -> deadlock

The hang is at upstream code (@vitest/browser-playwright ↔ playwright ↔ chromium CDP). The 6:4 close/open ratio reproduces exactly across runs.

The fix is to run each test file in a separate vitest invocation, instead of one vitest process iterating over all files.

  • yarn-project/kv-store/scripts/run-browser-tests.sh: new shell loop that finds each *.test.ts under src/indexeddb/ and src/sqlite-opfs/ and runs yarn vitest run "$file" per file, sequentially.
  • yarn-project/kv-store/package.local.json: test:browser overridden to bash scripts/run-browser-tests.sh

This avoids the bug entirely because the cross-file teardown path is never exercised: each vitest process only has to tear down at its own end-of-process, where chromium gets killed outright by the OS rather than asked to close gracefully via CDP.

Cost: ~5–10s of vite/yarn startup per file. With 11 files, that's ~60–100s of extra wall time per yarn test:browser run. Not great, but if we can stabilize the suite and keep it running, we can iteratively look for better ways (eg: reducing the amount of files to reduce overhead where sensible).

Verification

A reliable repro for future regressions is saved at yarn-project/kv-store/scripts/repro-browser-hang.sh. It runs the previous (single-process) test:browser shape under docker_isolate constraints; reproduces consistently (without this fix) in ~3 minutes.

Closes F-589

mverzilli and others added 7 commits April 22, 2026 11:24
Temporary diagnostic wrapper around `yarn test:browser` that runs
with a 90s timeout and captures system state (cgroup cpu.stat, memory,
/tmp usage, process states) every 1s to /tmp/diag/probe.log, then
dumps the tail to stderr on exit.

Goal: diagnose the CI hang documented in PR #22693 where the kv-store
browser tests hang silently around test file 5 only in the CI ISOLATE
container, not reproducible from the same docker args locally.

Revert once root-caused.
Upgrade the temporary CI diagnostic wrapper so the next hang captures
enough ground truth to identify the stall site.

- Dump full probe + stacks log on exit (v1 tail -400 truncated the
  transition moment).
- Hybrid cadence: 1s steady, 0.1s burst when `yarn test:browser` stdout
  has been silent for >3s, capped at ~20s dense sampling.
- Per-snapshot: /proc/net/{tcp,tcp6,unix} state counts.
- Per-burst stacks: lsof -i TCP -U, /proc/\$pid/syscall, per-thread wchan
  histogram, socket/pipe/anon_inode FD symlinks.
- Widen pids_of_interest to node|vitest|esbuild|yarn so we see the full
  vitest process tree, not just chromium.
Two small tweaks after observing a passing CI run (6e65957a4521a43d):

- Capture stacks_snapshot every 10 steady iterations (~10s of wall time)
  so we have "healthy waiting" reference data to diff against
  burst-triggered snapshots during a future hang. Without this, a
  passing run produced zero in-flight stacks because the burst trigger
  never armed.
- Silence the benign "No such file or directory" race when /proc/\$pid/
  disappears mid-snapshot by moving the redirect around the read, not
  around the pipe (the pipe's output-side didn't inherit it).
Force a fixed 100k-iteration awk arithmetic loop under bash's `time`
keyword each snapshot; emit real/user/sys as a "microbench:" line.

The (user+sys)/real ratio disambiguates the two scenarios consistent
with the v1 hang data (flat usage_usec, threads on futex_wait_queue):

  - scenario A (vitest/chromium IPC deadlock): no process in our cgroup
    is demanding CPU, so microbench gets all it asks for -> ratio ~1.0
  - scenario B (host-level preemption / noisy neighbor): microbench
    demands CPU but host scheduler doesn't serve it -> ratio <<1.0

Fires every 5th steady iteration (~0.4% overhead during tests) and
every burst iteration (overhead irrelevant during a hang since tests
aren't running anyway).
…eadlock

Vitest+chromium hit a CDP CLOSE_WAIT deadlock at test-file transitions
under CPU-constrained environments (CI3 ISOLATE: --cpus=2). Vitest closes
6 of 10 CDP TCP connections in a teardown batch; chromium's network
service can't drain them under contention; vitest's teardown blocks
indefinitely on the close-handshake. Both processes end up with zero
on-CPU threads (vitest on futex_wait_queue, chromium on ep_poll). Outer
ci3 timeout fires after ~90s of silence, CI fails.

Fix: run each test file in a separate vitest invocation, so CDP teardown
only happens at process exit (where the OS reaps everything). Cost is
~5-10s of vite startup per file, ~60-100s extra per browser test run.

Also adds:
- scripts/repro-browser-hang.sh: reproduces the hang reliably under
  docker_isolate (--cpus=2 --memory=8g --tmpfs /tmp:exec,size=1g) for
  future regression checks.
- scripts/probe-test-browser.sh: deep diagnostic wrapper with hybrid
  steady/burst sampling of /proc state, cgroup counters, and socket
  state. Used to root-cause this bug; kept for future debugging.
- .test_patterns.yml: transfer kv-store TIMEOUT flake-entry ownership
  to martin (the original owner alex held it from a prior incident).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… martin

Catch-all entry on yarn-project/kv-store with error_regex "vitest" sends
any browser-shaped failure (anything emitting "vitest" in its output) to
martin, so colleagues aren't blocked by residual flakiness while the new
run-browser-tests.sh approach soaks in. Existing narrow entries still
apply — owners are unioned across matching entries — so grego still gets
pinged for his historical patterns.

Intended to be removed after a few weeks of stable CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mverzilli mverzilli changed the base branch from next to merge-train/fairies April 28, 2026 11:09
@mverzilli mverzilli changed the title debug(kv-store): instrument test:browser to diagnose CI hang chore: fix kv-store browser tests hangs Apr 28, 2026
mverzilli and others added 3 commits April 28, 2026 11:51
The merge brought in next's package.json with "test": "yarn test:node"
(the historical disable workaround), overwriting the chained test set in
the fix commit. package.local.json still has the correct override; this
regenerates package.json from it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vitest.config.ts only includes src/sqlite-opfs/**/*.test.ts when
VITE_SQLITE_OPFS=1 (gated since #22693 due to a separate hang). The
wrapper script enumerated both directories unconditionally, so vitest
received a positional path that the include filter excluded, returned
"no test files found", and exited 1 — killing the loop via set -e.

Mirror the gate at file-enumeration time so the script only feeds
vitest paths that the config will accept.
…olation

The VITE_SQLITE_OPFS gate was added in #22693 as a workaround for the
2-CPU CI hang. That hang was a vitest+chromium CDP teardown deadlock
at test-file transitions, now fixed by per-file vitest invocation
(see run-browser-tests.sh). The gating was always meant to be removed
once the hang was root-caused.

Drop the env-flag gate in vitest.config.ts and stop mirroring it in
the wrapper script — vitest now discovers sqlite-opfs tests by default
and the wrapper enumerates them in the same loop as the indexeddb files.

The existing wide-net catch-all in .test_patterns.yml continues to
quarantine any residual flakiness to martin so colleagues aren't blocked.
@mverzilli mverzilli requested a review from Thunkar April 28, 2026 12:49
CI's bootstrap test_cmds machinery distributes each emitted command to a
separate parallel slot. kv-store was previously emitted as a single line
(`cd yarn-project/kv-store && yarn test`), which collapsed all 28 of its
test files into one slot — leaving the rest of the executor idle while
the package serialised through every file.

Add kv-store/bootstrap.sh with a test_cmds function that emits one line
per test file, mirroring the aztec/bootstrap.sh pattern. New
kv-store/scripts/run_test.sh dispatches a single file to vitest (browser
tests under indexeddb, sqlite-opfs) or mocha (everything else). Browser
files keep ISOLATE=1; node files run unisolated like other packages.

The local-dev `yarn test:browser` path (run-browser-tests.sh) now also
goes through run_test.sh, so per-file invocation has a single source of
truth for both CI and local reproduction.
Copy link
Copy Markdown
Contributor

@Thunkar Thunkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing investigation🙏

@mverzilli mverzilli merged commit 847f969 into merge-train/fairies May 4, 2026
28 checks passed
@mverzilli mverzilli deleted the martin/fix-kv-store-browser-ci branch May 4, 2026 10:38
AztecBot pushed a commit that referenced this pull request May 4, 2026
`yarn-project/kv-store/yarn test:browser` runs ~11 vitest browser-mode
test files in a single vitest process, all sharing one chromium browser.
In CI3's `ISOLATE` environment (`docker run
--cpus=2 --memory=8g --tmpfs /tmp:exec,size=1g`), the run would
intermittently hang at a test-file boundary, sit silent for >80s, then
get killed by the outer CI timeout. Roughly 25% hit rate on CI; zero hit
rate on developer machines.
                              
The bug only manifests under CPU pressure. On unconstrained hardware
(developer boxes with many cores), every chromium thread always gets
scheduled in time and the bug never surfaces.
                              
Root caused to a CDP teardown deadlock between vitest and chromium at
test-file transitions:

1. Vitest opens 10 CDP TCP connections to chromium's `headless_shell
--type=utility --utility-sub-type=network.mojom.NetworkService `(the
chromium "network service" process) at startup.
2. Between test files, vitest tears down 6 of those 10 connections in a
batch, sending FIN on each socket.
3. The kernel transitions chromium's side of those 6 sockets to
`CLOSE_WAIT` (peer closed, application hasn't called `close()` yet).
4. Chromium's network service event loop, starved of CPU under
`--cpus=2`, never gets scheduled to drain those closed FDs and call
`close()`.
5. Vitest's teardown path is awaiting completion of the close handshake
on those 6 connections (likely a `Promise.all` over `CDPSession`
detach), and never gets the response.
6. Both processes end up with zero on-CPU threads. `vitest` parked on
`futex_wait_queue`, chromium parked on `ep_poll` -> deadlock
The hang is at upstream code (@vitest/browser-playwright ↔ playwright ↔
chromium CDP). The 6:4 close/open ratio reproduces exactly across runs.

The fix is to run each test file in a separate vitest invocation,
instead of one vitest process iterating over all files.

- `yarn-project/kv-store/scripts/run-browser-tests.sh`: new shell loop
that finds each *.test.ts under src/indexeddb/ and src/sqlite-opfs/ and
runs yarn vitest run "$file" per file, sequentially.
- `yarn-project/kv-store/package.local.json`: test:browser overridden to
bash scripts/run-browser-tests.sh

This avoids the bug entirely because the cross-file teardown path is
never exercised: each vitest process only has to tear down at its own
end-of-process, where chromium gets killed outright by the OS rather
than asked to close gracefully via CDP.
Cost: ~5–10s of vite/yarn startup per file. With 11 files, that's
~60–100s of extra wall time per yarn test:browser run. Not great, but if
we can stabilize the suite and keep it running, we can iteratively look
for better ways (eg: reducing the amount of files to reduce overhead
where sensible).
  
## Verification

A reliable repro for future regressions is saved at
yarn-project/kv-store/scripts/repro-browser-hang.sh. It runs the
previous (single-process) test:browser shape under docker_isolate
constraints; reproduces consistently (without this fix) in ~3 minutes.

Closes F-589

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AztecBot
Copy link
Copy Markdown
Collaborator

AztecBot commented May 4, 2026

✅ Successfully backported to backport-to-v4-next-staging #22924.

mverzilli pushed a commit that referenced this pull request May 4, 2026
Combines both sides of the merge in ensurePool():
- Keeps poolDirectory tracking from #22721 (kv-store browser test hangs fix)
- Adopts the local 's = sqlite3' alias and sqlite3mc_vfs_create call from #22759

The handleDeleteDb codepath inherited from #22721 (async + ensurePool with
remembered poolDirectory) is preserved as-is and is orthogonal to the
encryption wrapper, which only kicks in via MC_SAH_POOL_VFS_NAME in handleInit.
mverzilli added a commit that referenced this pull request May 4, 2026
## Summary

Backport of #22759 (`feat: kv-store sqlite backend with page level
encryption`) to `backport-to-v4-next-staging`.

The auto-cherry-pick failed on
`yarn-project/kv-store/src/sqlite-opfs/worker.ts`. The conflict was
localized to the `ensurePool` function: the v4-next branch already
contains #22721 (kv-store browser test hangs fix), which added a
`poolDirectory = directory;` assignment, while #22759 introduced a local
`s = sqlite3` alias and the sqlite3mc VFS registration call. Both
changes are compatible and have been combined.

## Commit structure

1. `1944712b` — original cherry-pick committed as-is with the conflict
marker preserved in `worker.ts`, so reviewers can see exactly what
conflicted.
2. `0c397e2c` — conflict resolution: keeps `poolDirectory` tracking from
#22721 + the local `s` alias and `sqlite3mc_vfs_create` call from
#22759.

No build-fixes commit was needed — `kv-store` and `sqlite3mc-wasm` both
typecheck clean against the resolved tree.

## Verification

- `yarn install` succeeded.
- `tsgo --noEmit` is clean for both `kv-store` and `sqlite3mc-wasm`.
- Diff stat matches the original PR: 26 files, +699/-14.

## Original PR

- #22759
- Merge commit: `97c6e48fe9bf269d12fa0640e4e9303f5e7cbae5`

ClaudeBox log: https://claudebox.work/s/6446d1f92380c25a?run=1

ClaudeBox log: https://claudebox.work/s/6446d1f92380c25a?run=1
chrismarino pushed a commit to chrismarino/aztec-packages that referenced this pull request May 5, 2026
BEGIN_COMMIT_OVERRIDE
chore: fix kv-store browser tests hangs (AztecProtocol#22721)
feat: kv-store sqlite backend with page level encryption (AztecProtocol#22759)
fix: install node 22 for aztec-cli acceptance test (AztecProtocol#22917)
fix(kv-store): exclude @aztec/sqlite3mc-wasm from vitest optimizeDeps
(AztecProtocol#22929)
feat(txe): add tx private logs to tx side effects oracle (AztecProtocol#22889)
feat(aztec-nr): add call_self stubs for utility functions (AztecProtocol#22885)
END_COMMIT_OVERRIDE
AztecBot added a commit that referenced this pull request May 5, 2026
BEGIN_COMMIT_OVERRIDE
docs: add map and state variable docs  (#22824)
fix: e2e compat should not fail for contracts added after legacy stables
(#22900)
chore: fix kv-store browser tests hangs (#22721)
feat: kv-store sqlite backend with page level encryption (#22759)
fix: install node 22 for aztec-cli acceptance test (#22917)
feat: backport kv-store sqlite encryption (#22759) to v4-next (#22927)
fix(docs): correct llms.txt links for versioned developer docs (#22819)
feat(docs): improve discoverability of Aztec.nr API reference docs
(#22861)
feat(docs): backport improve discoverability of Aztec.nr API reference
docs (#22861) to v4-next (#22931)
feat(aztec-nr): add call_self stubs for utility functions (#22885)
docs: add map and state variable docs (backport #22824) (#22880)
refactor: `getPackageVersion` fn cleanup (#22941)
fix(ci): skip acceptance test for canary -commit. tags (#22951)
fix: closing db, correct stub side effects (#22939)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants