test(e2e_ha_full): parallel HA peer node teardown with per-node deadline#23539
Merged
Conversation
…dline The afterAll hook stopped HA peer nodes serially with no per-node timeout, so a single sequencer.stop() that hangs (e.g. an L1 publish whose tx-timeout was computed on a test-warped clock) burns the entire 20-minute jest hook budget and dequeues the merge train.
PhilWindle
approved these changes
May 24, 2026
This was referenced May 24, 2026
PhilWindle
pushed a commit
that referenced
this pull request
May 24, 2026
Dequeued from merge-train/spartan again: <http://ci.aztec-labs.com/136431da99834194>. The HA full suite keeps failing under proposer pipelining with shifting symptoms. In this run the dashboard log shows recurring `validator:proposal-handler Timed out waiting for block with archive matching checkpoint proposal` warnings (slot 98, 115, …) and an `Error building checkpoint at slot 127: already proposed block for slot 127 index 0` on HA-4 — i.e. the 5 HA peers race on the same proposal. The bundled #23539 (parallel peer teardown) and #23524 (afterAll hook timeout) entries did not catch this run because jest's per-test summary was not reached within the dashboard log capture. This PR adds a broad regex-only entry under `.test_patterns.yml` to flag any failure of `yarn-project/end-to-end/scripts/run_test.sh ha src/composed/ha/e2e_ha_full.test.ts` as a flake. Owner: @PaLLa, matching the existing pipelining-flavoured entries for this suite. The intent is to unblock the merge queue while the HA pipelining stabilisation work continues; narrow the regex (or add a real fix) once the failure modes settle down. --- *Created by [claudebox](https://claudebox.work/v2/sessions/d394ef6145e749ff) · group: `slackbot`*
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
e2e_ha_full.test.tsdequeued PR #23344 from the merge train (log): all 8 tests passed but theafterAllcleanup hook exceeded its 20-minute jest timeout. The hook stops 5 HA peer nodes serially and HA-2'ssequencer.stop()blocked for ~23 minutes waiting on an in-flight L1 publish whose internal tx-timeout was computed on a test-warpeddateProviderclock and never fired.The deeper bug (publish doesn't honor stop()) is being fixed separately. This PR is the minimum change to keep one stuck node from killing the whole hook and the merge train.
What
Replace the serial
forloop withPromise.allSettled(... Promise.race([stop, 30s timeout])), so:node.stop()calls run concurrently.The 30s deadline is comfortably above the ~5ms each healthy node took in the failing log, so this is purely a safety net; if it ever fires we want the explicit error in the log to point at the next investigation.
Scope
Test-only change. No production code touched.
Created by claudebox · group:
slackbot