refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization by PhilWindle · Pull Request #22915 · AztecProtocol/aztec-packages

PhilWindle · 2026-05-03T17:27:57Z

Stacks on #22783. Refactors the prover stack so a reorg that lands after finalisation has started can be handled cleanly, by splitting today's ProvingOrchestrator along the natural state-coupling boundary: per-checkpoint block-level work vs. epoch-level top-tree work.

The original review on #22783 surfaced four issues that all reduced to one cause — EpochProvingState mixed per-checkpoint state with epoch-level state derived from the whole set of checkpoints (previousOutHashHint chain, blob-accumulator chain, totalNumCheckpoints, finalBlobBatchingChallenges). Splitting the orchestrator at that boundary makes each sub-tree self-contained and confines the cross-checkpoint chain to the lifetime of the top-tree run.

Plan and design notes: prover-stack-split-plan.md on this branch.

Architecture

EpochProvingJob (prover-node)
├── Map<checkpointIndex, CheckpointSubTreeOrchestrator>
│       per-checkpoint, owns its own world-state forks, drives
│       chonk-verifier / base / merge / block-root / parity.
│       Result:  Promise<{ blockProofOutputs, previousArchiveSiblingPath }>
│
├── EpochProvingContext (per-epoch shared chonk-verifier cache)
│       Survives sub-tree cancellation: chonk proof for a tx whose
│       checkpoint is reorged out and re-appears in a replacement
│       can be reused.
│
└── TopTreeOrchestrator       (constructed at finalizeAndProve)
        Inputs from sub-tree promises + archiver:
          { blockProofs[],            // unawaited
            l2ToL1MsgsPerBlock[][],
            blobFields[],
            previousBlockHeader,
            previousArchiveSiblingPath } per checkpoint.
        Pre-computes hint chain immediately; pipelines each checkpoint's
        root rollup against its sub-tree's still-pending block proving.
        Drives  checkpoint-root → checkpoint-merge → root-rollup.

Single BrokerCircuitProverFacade per ProverClient, shared across every orchestrator across every concurrent epoch. The broker delivers each completed-job notification exactly once (drained on the first poll), so multiple facades polling the same broker race and lose notifications until the 30s snapshot sync — far longer than the proof deadline for short epochs.

What changes for the user-visible behaviour

Before: removeCheckpoint was rejected once finalizationStarted was set. A late prune had to be ignored (submit a stale proof that L1 would reject) or fail the epoch.

After: removeCheckpoint is allowed at any point until the job reaches a terminal state. If the top tree is in flight when a removal lands, it is cancelled with TopTreeCancelledError; the finalizeAndProve loop catches it, recomputes finalBlobBatchingChallenges and the per-checkpoint hints from the survivors, and the next iteration submits a valid proof. If every checkpoint is removed mid-finalize, the loop throws and the job transitions to failed. Bound only by the existing this.deadline — no retry counter.

The proven e2e test for this scenario (epochs_optimistic_proving.parallel.test.ts → "handles a reorg arriving while proving is in progress") used to rely on a storage cheat to simulate recovery; it now demonstrates the prover-node performing the recovery itself, asserting the in-flight epoch is proven up to the surviving checkpoint range with no cheats.

Commits

#	What
1	Add `CheckpointSubTreeOrchestrator` and `TopTreeOrchestrator` alongside the existing `ProvingOrchestrator`. `EpochProver` flow unchanged.
2a	`EpochProvingJob` switches to the new orchestrators directly. New `EpochProverFactory` interface on `ProverClient`. Hooks added so e2e tests can interpose without monkey-patching internal classes.
2b	Delete `EpochProver`, `ServerEpochProver`, and `ProvingOrchestrator`'s package export. The class itself stays internal as the base for `CheckpointSubTreeOrchestrator` and the driver for the `orchestrator_*.test.ts` integration tests.
Fix	Single shared `BrokerCircuitProverFacade` per `ProverClient` (started at `start()`, stopped at `stop()`). Eliminates the 30s broker-notification race that the multi-facade design introduced.
3	Reorg-after-finalization restart loop in `EpochProvingJob.finalizeAndProve`. `removeCheckpoint` cancels the in-flight top tree; the catch arm rebuilds with the surviving set.
Fix	`TopTreeOrchestrator.prove` short-circuits with `TopTreeCancelledError` if `cancel()` was called before `prove()` ran (otherwise it would hang waiting on a completion promise that nothing would ever resolve).
Test rewrite	The reorg-during-proving e2e drops the storage cheat and asserts the in-flight epoch proves correctly. `ProverNode.epochJobs` is now `protected` so the test can poll the in-flight job's tracked-checkpoint count for a deterministic prune-observed signal.
4	`EpochProvingContext` hoists the chonk-verifier proof cache to per-epoch scope. Chonk-verifier broker jobs use the context's own `AbortController` list, so sub-tree cancellation does not abort them — a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof.

Tests

307 prover-client tests passing.
90 prover-node tests passing (28 → 31 in epoch-proving-job.test.ts after adding the reorg-after-finalize describe block).
7 new top-tree tests, 5 new sub-tree tests, 5 new EpochProvingContext tests.
Existing orchestrator_*.test.ts integration tests unchanged and passing.
E2e: epochs_optimistic_proving.parallel.test.ts rewritten for the new behaviour; the other reorg e2e tests (mid-epoch with replacement, mid-epoch without replacement, last-slot without replacement) are unmodified and continue to exercise the per-sub-tree path.

Test plan

yarn workspace @aztec/prover-client test — local pass.
yarn workspace @aztec/prover-node test — local pass.
yarn build + yarn lint clean.
Watch CI for e2e_prover/full (a tx-gathering flake hit once on this branch; not caused by the refactor — see analysis on the PR for details).

…reeOrchestrator Adds the two new orchestrators that will replace the monolithic ProvingOrchestrator in subsequent commits. ProvingOrchestrator stays unchanged so the existing EpochProver flow keeps working. CheckpointSubTreeOrchestrator extends ProvingOrchestrator and stops at the checkpoint root rollup boundary, resolving a Promise<SubTreeResult> once block-level proving completes. TopTreeOrchestrator drives checkpoint-root through root rollup. Inputs include per-checkpoint Promise<BlockProofs> so checkpoint root rollups pipeline against in-flight sub-tree proving — out-hash and blob accumulator hint chains are precomputed synchronously from archiver-derivable data. Also exposes getSubTreeOutputProofs / getLastArchiveSiblingPath on CheckpointProvingState, makes ProvingOrchestrator's checkAndEnqueueCheckpointRootRollup protected, and surfaces the checkpoint object on TestContext.makeCheckpoint. 11 new unit tests; the existing 86 orchestrator tests still pass.

…rators directly EpochProvingJob now holds a CheckpointSubTreeOrchestrator per checkpoint and constructs a TopTreeOrchestrator inside finalizeAndProve, instead of holding a single EpochProver. The previous waitForAllCheckpointsReady step is gone — the top tree starts as soon as every checkpoint is tracked and pipelines its checkpoint root rollups against the still-pending sub-tree result promises. Each sub-tree owns its own per-checkpoint state, so removing one (e.g. via a prune) is now atomic and does not affect the others — the cross-checkpoint state coupling that triggered Palla's review concerns on #22783 is contained to the top-tree's lifetime. Also: - ProverClient implements a new EpochProverFactory interface with createCheckpointSubTreeOrchestrator and createTopTreeOrchestrator. The legacy createEpochProver remains for the orchestrator_*.test.ts (deleted in commit 2b). - EpochProvingJob accepts an EpochProvingJobHooks bag (beforeTopTreeProve, afterTopTreeProve, topTreeProveOverride) that gives the e2e tests a clean patch surface — but the four affected tests migrate to spying on createTopTreeOrchestrator and patching prove(), which is the closer analog to the legacy finalizeEpoch patch. - BrokerCircuitProverFacade is exported from @aztec/prover-client/broker so the job can manage its lifecycle. - CheckpointSubTreeOrchestrator gains getPreviousArchiveSiblingPath() so the top-tree data assembly is synchronous (no awaiting block-level proving). epoch-proving-job.test.ts is rewritten end-to-end to mock the new factory and the per-checkpoint sub-trees (28 tests, all passing). The four e2e tests that used to spy on createEpochProver().finalizeEpoch are migrated to spy on createTopTreeOrchestrator().orchestrator.prove.

…orting ProvingOrchestrator Now that EpochProvingJob talks to CheckpointSubTreeOrchestrator and TopTreeOrchestrator directly, the wrapper layer has no production users: - Delete the EpochProver interface (stdlib/src/interfaces/epoch-prover.ts) and its re-export from interfaces/server.ts. - Delete ServerEpochProver, the adapter that translated EpochProver calls onto ProvingOrchestrator + a broker facade. - Drop createEpochProver from EpochProverManager and from ProverClient. ProverClient now exposes only the split factories. - Drop ProvingOrchestrator from prover-client/orchestrator's package exports, and remove its `implements EpochProver` clause. The class file stays as the base for CheckpointSubTreeOrchestrator (which extends it) and as the single-class end-to-end driver used by orchestrator_*.test.ts; it is no longer reachable from outside the package. - Switch test_context.ts to import ProvingOrchestrator via its relative module path (the orchestrator-internal test driver TestProvingOrchestrator still extends it). All 301 prover-client tests, 87 prover-node tests, and 801 stdlib tests still pass.

…ors and epoch jobs CI on the previous commit caught the symptom — finalize timed out at ~30s while waiting for sub-tree results. Root cause: the broker maintains a single global `completedJobNotifications` queue that is drained by the first caller of `getCompletedJobs([])`. When multiple `BrokerCircuitProverFacade` instances poll the same broker, the first one to poll consumes every notification — including notifications for jobs the others care about. The losers only catch up via the periodic 30-second snapshot sync, which is far longer than the proof deadline for short epochs. Commit 2a turned this into a fast-path bug by accidentally creating N+1 facades per epoch (one per sub-tree, one for the top-tree). The same race also exists across concurrent epoch jobs, so the right fix is one shared facade for the whole prover-client lifetime, not just one per job. - `ProverClient` now owns a single `BrokerCircuitProverFacade`, started in `start()` and stopped in `stop()`. - `createCheckpointSubTreeOrchestrator()` and `createTopTreeOrchestrator()` no longer take a facade argument — they wire the orchestrator to the shared facade. - `EpochProvingJob` no longer creates or manages a facade. - The facade's job map deletes entries on resolve/reject, so memory growth is bounded by concurrent in-flight work, not by lifetime jobs. All 28 epoch-proving-job tests, 87 prover-node tests, and 301 prover-client tests still pass.

Previously, `removeCheckpoint` was a no-op once `finalizationStarted` was set — a late prune (e.g. an L1 reorg) couldn't be reflected in the proof that had already been kicked off, so the only options were "ignore the prune and submit a stale proof" or "fail the epoch". Both bad. Now `removeCheckpoint` is allowed at any point until the job reaches a terminal state. If the top tree is in flight when a removal lands, the removal cancels it (`cancel({ abortJobs: true })`); the catch arm in `finalizeAndProve` recognises `TopTreeCancelledError`, drops the cancelled top tree, and the surrounding loop rebuilds `CheckpointTopTreeData[]` and the blob batching challenges from the surviving sub-trees and tries again. The only bound on retries is the job's existing deadline. No retry counter: a pathological reorg loop fails the epoch via the deadline path it would have taken anyway, with one less knob to tune. If every checkpoint is removed mid-finalize, the next loop iteration sees `survivors === 0` and throws — the catch arm transitions the job to `failed`, no proof is published. Three new tests in `reorg-after-finalize`: - `removeCheckpoint` after finalize-start cancels the top tree and the loop restarts with the surviving set; second prove is given the smaller count; epoch completes. - A middle-of-the-list checkpoint pruned mid-prove; submitted proof carries the surviving from/to range. - All checkpoints removed mid-finalize → state transitions to `failed`, nothing is published. 90 prover-node tests pass; 301 prover-client tests still pass.

…rove The reorg-during-proving e2e test (`epochs_optimistic_proving.parallel.test.ts`, "handles a reorg arriving while proving is in progress") gates `topTree.prove` via a test patch and fires the L1 reorg while the gate is held. The prover-node receives the L2BlockStream prune events and calls `removeCheckpoint`, which after commit 3 cancels the in-flight top tree. But the patch had not yet released the gate, so `cancel()` runs *before* `prove()`. The previous code only set `this.cancelled = true` and then, when prove eventually ran, the per-checkpoint `.then` handlers all bailed on the flag and the completion promise never resolved — prove hung forever. Fix: check `this.cancelled` at the top of `prove()` and short-circuit with `TopTreeCancelledError` immediately. Adds a unit test that constructs a top-tree, cancels it, then calls prove — expecting the immediate rejection.

Before commit 3 the prover-node could not handle a reorg that removed a checkpoint after `finalizationStarted` — the in-flight proof referenced a checkpoint that no longer existed on L1, so its submission was rejected. The test simulated recovery by writing the new `proven` pointer directly into the rollup's `stf` storage slot, releasing the gate, and asserting that a *subsequent* epoch eventually proved. With commit 3 the prover-node now cancels the in-flight top tree when a prune lands and rebuilds with the surviving checkpoints. The test should demonstrate that correct recovery, not the storage-cheat workaround. Changes: - After firing the L1 reorg, poll the in-flight job's tracked-checkpoint count until it drops. This is the deterministic signal that the prover- node observed the prune and called `removeCheckpoint`, which cancelled the in-flight top tree. (Without this we'd race the L2BlockStream poll and risk top tree #1 starting its real prove before cancellation lands.) - Drop the storage cheat, the post-cheat block-production resume, and the wait-for-next-epoch sequence. - Assert the *in-flight* epoch is proven up to `afterReorgCheckpoint` — the surviving range — directly on L1, no cheats needed. - Make `ProverNode.epochJobs` `protected` and expose it on `TestProverNode` so the test can poll per-job tracked counts.

Until now each `CheckpointSubTreeOrchestrator` carried its own chonk-verifier proof cache (on the sub-tree's internal `EpochProvingState`). When a checkpoint was reorged out and a replacement landed in the same epoch, every public tx in the replacement re-proved its chonk circuit, even though the proof had already been computed for the original. Introduces `EpochProvingContext`: a small per-epoch holder for the cache that: - Submits chonk-verifier broker jobs through its own AbortController list (not the sub-tree's). Sub-tree cancellation (e.g. `removeCheckpoint` with `abortJobs: true`) does **not** abort context-owned chonk jobs, so a replacement sub-tree can pick up the cached promise. - Self-cleans cache entries on rejection so a future caller can re-enqueue. - Exposes `stop()` to abort every in-flight chonk job at job teardown. Plumbing: - New `EpochProverFactory.createEpochProvingContext()` returns a context wired to `ProverClient`'s shared broker facade. `EpochProvingJob` constructs one per epoch and passes it to every sub-tree it creates. - `CheckpointSubTreeOrchestrator` accepts an optional context. When supplied, its overrides for `startChonkVerifierCircuits` and `getOrEnqueueChonkVerifier` route through `context.enqueue` / `context.getCached` instead of the inherited per-sub-tree path. - The legacy `ProvingOrchestrator` (test-only) is unchanged: it continues to use `EpochProvingState.cachedChonkVerifierProofs`. 5 new unit tests on `EpochProvingContext` cover dedup, get-after-enqueue, reject-then-retry, abort-on-stop, and post-stop enqueue. All 307 prover-client tests, 90 prover-node tests, and existing e2e build pass.

…ry call `CheckpointSubTreeOrchestrator` now requires an `EpochProvingContext` (no fallback to a private chonk cache) and starts its single epoch in the constructor by reading the epoch number from the context. A new static `start(...)` factory does the construction plus the single internal `startNewCheckpoint(0, ...)` and stops the orchestrator cleanly if the start fails. `ProverClient.createCheckpointSubTreeOrchestrator` becomes async and takes the per-checkpoint args, replacing the old three-step dance of factory + `startNewEpoch` + `startNewCheckpoint` in `EpochProvingJob`. `createEpochProvingContext` now takes an `epochNumber`, which is also the per-call arg dropped from `EpochProvingContext.enqueue`.

AztecBot · 2026-05-06T10:21:38Z

Flakey Tests

🤖 says: This CI run detected 2 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/85037e802ad5ce17�85037e802ad5ce178;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_mbps.pipeline.parallel.test.ts "pipelining builds blocks using slot plus 1 proposer and proves them" (272s) (code: 0) group:e2e-p2p-epoch-flakes
\033FLAKED\033 (8;;http://ci.aztec-labs.com/b6b54bb7d18ed549�b6b54bb7d18ed5498;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_high_tps_block_building.test.ts (343s) (code: 0) group:e2e-p2p-epoch-flakes

… enumeration API Restructure `EpochProvingJob` as an orchestrator over a registry of self-contained jobs: - `CheckpointJob` — identified by `(checkpoint number, slot)` so a stale orphan and a re-org replacement at the same number coexist in the parent's map without colliding. Owns its own register-time data, sub-tree, blockProofs resolvers, abort controller, and tx-processing loop. `cancel()` is idempotent and fire-and-forget; `whenDone()` resolves once `provideTxs` and the cancel-driven teardown have unwound. - `TopTreeJob` — built from a snapshot of `CheckpointJob`s. `start()` runs `topTree.prove(...)`; `cancel()` rejects with `TopTreeCancelledError`. Hooks (`beforeTopTreeProve` / `afterTopTreeProve` / `topTreeProveOverride`) forward to it from `EpochProvingJobHooks`. - `EpochProvingJob` — slimmed from ~1020 lines to ~530. The job is now a thin driver over `Map<string, CheckpointJob>` and a single `topTreeJob`. The old `CheckpointStatus` (pending/tracked), `addCheckpointPromise` synchronisation, `accumulatedTxs` / `accumulatedL1ToL2Messages`, and inline restart-loop are gone. `EpochProvingJob`'s public API now reflects intent rather than registry internals. Three intent-level methods replace the tracked/pending list pair the prover-node had to enumerate: - `removeCheckpointsAfter(threshold): number` — bulk remove for prune - `getCheckpointCount(): number` — total registered (live, uncancelled) - `cancelPendingCheckpoints(): void` — drop registered jobs that never got txs `registerPendingCheckpoint` / `addCheckpoint` are renamed to `registerCheckpoint` / `provideTxs` to reflect that registration carries all data the top tree needs and txs arrive later. `removeCheckpoint` is now synchronous and idempotent — the `(number, slot)` identity means multiple removes for the same number don't need the old "await addCheckpointPromise" serialisation. Test coverage: 89 prover-node tests pass (down from 90 — two "doesn't finalize while gathering" tests merged into one that asserts the new early-start invariant).

…estrator infra `ProvingOrchestrator` and `TopTreeOrchestrator` had duplicated copies of the same broker-job submission envelope (~80 lines apiece): the `pendingProvingJobs` controller list, the `SerialQueue` lifecycle, the `cancel`/`stop` plumbing, and the `deferredProving<T>(state, request, callback)` wrapper that drops obsolete results and routes errors to `state.reject(...)`. Lift these to a new abstract `ProvingScheduler` base class. The minimal state contract is `ProvingStateLike { verifyState(): boolean; reject(reason: string): void }`, which both `EpochProvingState` / `CheckpointProvingState` / `BlockProvingState` and `TopTreeProvingState` satisfy. The base owns: - `pendingProvingJobs`, `deferredJobQueue`, `getNumPendingProvingJobs` - `resetSchedulerState(abortJobs)` — drain + recreate queue, optionally abort in-flight jobs (the per-call abort flag covers both parent's `cancelJobsOnStop` config and top-tree's `{abortJobs}` arg) - `stop()` — standard "grab old queue, cancelInternal, await drain" - `deferredProving<S, T>(state, request, callback, isCancelled?)` — unified submit envelope. The `isCancelled` predicate covers top-tree's `cancelled` flag; the parent uses the default `() => false` and relies on `verifyState`. Subclasses define `cancelInternal()` for their own cleanup (closing world-state forks for the parent, propagating cancel into the proving state for top-tree). Net code reduction: ~120 lines across the two orchestrators. The merge / padding / root rollup methods stay subclass-specific — they depend on state-class methods that aren't unified here.

PhilWindle added ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 3, 2026

PhilWindle force-pushed the phil/proving-orchestrator-split branch from 142d0b2 to 19d3a3c Compare May 3, 2026 18:03

PhilWindle force-pushed the phil/proving-orchestrator-split branch from e4507a5 to 47f9709 Compare May 3, 2026 19:02

PhilWindle force-pushed the phil/proving-orchestrator-split branch from 47f9709 to 8d3d7c6 Compare May 3, 2026 19:08

PhilWindle added 4 commits May 3, 2026 20:00

PhilWindle changed the title ~~feat(prover-client): split ProvingOrchestrator into sub-tree + top-tree (commit 1)~~ refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization May 3, 2026

PhilWindle mentioned this pull request May 3, 2026

feat(stdlib): deterministic execution-result job IDs and BLOCK_EXECUTION job type #22916

Open

3 tasks

PhilWindle added 2 commits May 6, 2026 10:00

test(e2e): trim stale comments in epochs_optimistic_proving and friends

78b8380

PhilWindle added 2 commits May 6, 2026 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization#22915

refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization#22915
PhilWindle wants to merge 12 commits intophil/a-955-optimistic-proving-checkpoint-driven-triggerfrom
phil/proving-orchestrator-split

PhilWindle commented May 3, 2026 •

edited

Loading

Uh oh!

AztecBot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PhilWindle commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

What changes for the user-visible behaviour

Commits

Tests

Test plan

Uh oh!

AztecBot commented May 6, 2026

Flakey Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PhilWindle commented May 3, 2026 •

edited

Loading