Skip to content

fix(sandbox): tolerate unsealed inbox in simulation, drop pipelining from docs/playground#23315

Closed
spalladino wants to merge 2 commits into
merge-train/spartanfrom
spl/sandbox-build-empty
Closed

fix(sandbox): tolerate unsealed inbox in simulation, drop pipelining from docs/playground#23315
spalladino wants to merge 2 commits into
merge-train/spartanfrom
spl/sandbox-build-empty

Conversation

@spalladino
Copy link
Copy Markdown
Contributor

Motivation

Two failure modes surfaced on the spartan merge train after PR #23277 enabled SEQ_ENABLE_PROPOSER_PIPELINING=true in the sandbox-based test composes. Both showed up in docs/examples/bootstrap.sh execute (e.g. http://ci.aztec-labs.com/7f325afea4f00b31): (1) aztecjs_advanced deterministically failed in AztecNodeService.simulatePublicCalls with L1ToL2MessagesNotReadyError — the simulator+inboxLag mismatch that's TODO'd in e2e_bot.test.ts:39, e2e_fees/*.test.ts, and e2e_avm_simulator.test.ts; (2) example_swap SIGTERMd at the docs-compose 600s mark while polling getBlockNumber('proven') because the local sandbox's proven tip only advances via the slow-path wall-clock warp once the chain goes idle.

Approach

Two commits. The first is the real bug-fix: AztecNodeService.simulatePublicCalls catches L1ToL2MessagesNotReadyError thrown when querying the not-yet-sealed next-checkpoint's L1→L2 messages, and simulates without those messages. Simulation becomes best-effort across checkpoint boundaries under pipelining; block production continues to use sealed messages as before. The second commit narrows the blast radius for the demo sandboxes: removes SEQ_ENABLE_PROPOSER_PIPELINING=true from docs/examples/ts/docker-compose.yml and playground/docker-compose.yml, drops example_swap from the default docs runner (matching the existing aave_bridge precedent), and bumps docs/examples/bootstrap.sh test_cmds TIMEOUT to 20m to match the bumps from #23275.

Pipelining coverage is retained where it actually exercises sequencer/watcher behaviour: yarn-project/end-to-end/scripts/docker-compose.yml (compose-routed e2e + cli-wallet flows) and aztec-up/test/{amm_flow,basic_install,bridge_and_claim}.sh. The proven-tip stall and re-enabling of example_swap are deferred to a follow-up that gives the sandbox a way to advance the proven tip without a continuous tx stream.

Changes

  • yarn-project/aztec-node (AztecNodeService.simulatePublicCalls): narrow try/catch on L1ToL2MessagesNotReadyError (matched by err.name); rethrow anything else.
  • docs/examples (compose, runner, test_cmds): drop pipelining env, drop example_swap from defaults, bump compose TIMEOUT to 20m.
  • playground (compose): drop pipelining env.

Codex reviewed both rounds of the design; the unsuccessful buildCheckpointIfEmpty + watcher-gate variant was abandoned after a confirmed cascade race / deadlock and reverted before commit.

`AztecNodeService.simulatePublicCalls` opens a fork of world state at
the latest proposed block and, when the next block would start a new
checkpoint, appends that checkpoint's L1->L2 messages to the fork's
message tree so the simulated tx sees them.

Under proposer pipelining with non-trivial `inboxLag`, the
next-checkpoint's messages are not yet sealed on L1 — the archiver's
message store throws `L1ToL2MessagesNotReadyError` when queried for an
in-progress checkpoint (see `message_store.ts:233`). This makes every
public-call simulation at a checkpoint boundary deterministically fail,
which is the issue tracked by the existing
`TODO(palla/pipelining): re-opt-in once public-call simulation handles
inboxLag` comments in `e2e_bot.test.ts`, `e2e_fees/*.test.ts`, and
`e2e_avm_simulator.test.ts`, and which surfaced as the
`aztecjs_advanced` failures on PR #23253's merge-queue run.

Catch the error by name (`L1ToL2MessagesNotReadyError`) and proceed
with no next-checkpoint messages. Simulation becomes best-effort across
checkpoint boundaries under pipelining: a tx that depends on a
not-yet-sealed message may simulate incorrectly, but block production
will use the real (sealed) messages when they are available. All other
errors continue to throw.
Two unrelated failure modes surfaced when PR #23277 enabled
`SEQ_ENABLE_PROPOSER_PIPELINING=true` on the docs-examples and
playground compose sandboxes:

1. `example_swap` polls `getBlockNumber('proven')` after the swap's
   final tx lands and the sandbox goes idle. Under pipelining the
   proven tip only catches up via the watcher's slow-path wall-clock
   warp (~72s/slot), which can SIGTERM the example under merge-queue
   load. See http://ci.aztec-labs.com/b08ac48286302949 (block 86).
2. `aztecjs_advanced` deterministically failed in
   `AztecNodeService.simulatePublicCalls` with
   `L1ToL2MessagesNotReadyError` — that's the simulator+inboxLag
   mismatch fixed in the preceding commit.

The simulator commit lands the actual bug-fix. This commit ships the
narrower workarounds for the docs/playground demo sandboxes:

- Remove `SEQ_ENABLE_PROPOSER_PIPELINING=true` from
  `docs/examples/ts/docker-compose.yml` and
  `playground/docker-compose.yml`. These are developer-facing demos,
  not pipelining test coverage; the real coverage lives in
  `yarn-project/end-to-end/scripts/docker-compose.yml` and the
  `aztec-up/test/*.sh` shell scripts, both untouched.
- Drop `example_swap` from the default docs runner list, matching the
  existing `aave_bridge` precedent, since the proven-tip stall is a
  sandbox-side limitation that needs a separate sequencer-team fix.
- Bump `docs/examples/bootstrap.sh` `test_cmds` TIMEOUT to 20m to
  match the compose/web3signer/ha bumps in #23275 — defense-in-depth
  against cumulative runtime growth, no longer the primary fix.

Re-enable in a follow-up once the sandbox advances the proven tip
without a continuous tx stream.
@spalladino spalladino requested a review from a team as a code owner May 15, 2026 14:10
@spalladino spalladino added the ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure label May 15, 2026
@spalladino spalladino closed this May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant