Skip to content

refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization#22915

Open
PhilWindle wants to merge 12 commits intophil/a-955-optimistic-proving-checkpoint-driven-triggerfrom
phil/proving-orchestrator-split
Open

refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization#22915
PhilWindle wants to merge 12 commits intophil/a-955-optimistic-proving-checkpoint-driven-triggerfrom
phil/proving-orchestrator-split

Conversation

@PhilWindle
Copy link
Copy Markdown
Collaborator

@PhilWindle PhilWindle commented May 3, 2026

Stacks on #22783. Refactors the prover stack so a reorg that lands after finalisation has started can be handled cleanly, by splitting today's ProvingOrchestrator along the natural state-coupling boundary: per-checkpoint block-level work vs. epoch-level top-tree work.

The original review on #22783 surfaced four issues that all reduced to one cause — EpochProvingState mixed per-checkpoint state with epoch-level state derived from the whole set of checkpoints (previousOutHashHint chain, blob-accumulator chain, totalNumCheckpoints, finalBlobBatchingChallenges). Splitting the orchestrator at that boundary makes each sub-tree self-contained and confines the cross-checkpoint chain to the lifetime of the top-tree run.

Plan and design notes: prover-stack-split-plan.md on this branch.

Architecture

EpochProvingJob (prover-node)
├── Map<checkpointIndex, CheckpointSubTreeOrchestrator>
│       per-checkpoint, owns its own world-state forks, drives
│       chonk-verifier / base / merge / block-root / parity.
│       Result:  Promise<{ blockProofOutputs, previousArchiveSiblingPath }>
│
├── EpochProvingContext (per-epoch shared chonk-verifier cache)
│       Survives sub-tree cancellation: chonk proof for a tx whose
│       checkpoint is reorged out and re-appears in a replacement
│       can be reused.
│
└── TopTreeOrchestrator       (constructed at finalizeAndProve)
        Inputs from sub-tree promises + archiver:
          { blockProofs[],            // unawaited
            l2ToL1MsgsPerBlock[][],
            blobFields[],
            previousBlockHeader,
            previousArchiveSiblingPath } per checkpoint.
        Pre-computes hint chain immediately; pipelines each checkpoint's
        root rollup against its sub-tree's still-pending block proving.
        Drives  checkpoint-root → checkpoint-merge → root-rollup.

Single BrokerCircuitProverFacade per ProverClient, shared across every orchestrator across every concurrent epoch. The broker delivers each completed-job notification exactly once (drained on the first poll), so multiple facades polling the same broker race and lose notifications until the 30s snapshot sync — far longer than the proof deadline for short epochs.

What changes for the user-visible behaviour

Before: removeCheckpoint was rejected once finalizationStarted was set. A late prune had to be ignored (submit a stale proof that L1 would reject) or fail the epoch.

After: removeCheckpoint is allowed at any point until the job reaches a terminal state. If the top tree is in flight when a removal lands, it is cancelled with TopTreeCancelledError; the finalizeAndProve loop catches it, recomputes finalBlobBatchingChallenges and the per-checkpoint hints from the survivors, and the next iteration submits a valid proof. If every checkpoint is removed mid-finalize, the loop throws and the job transitions to failed. Bound only by the existing this.deadline — no retry counter.

The proven e2e test for this scenario (epochs_optimistic_proving.parallel.test.ts → "handles a reorg arriving while proving is in progress") used to rely on a storage cheat to simulate recovery; it now demonstrates the prover-node performing the recovery itself, asserting the in-flight epoch is proven up to the surviving checkpoint range with no cheats.

Commits

# What
1 Add CheckpointSubTreeOrchestrator and TopTreeOrchestrator alongside the existing ProvingOrchestrator. EpochProver flow unchanged.
2a EpochProvingJob switches to the new orchestrators directly. New EpochProverFactory interface on ProverClient. Hooks added so e2e tests can interpose without monkey-patching internal classes.
2b Delete EpochProver, ServerEpochProver, and ProvingOrchestrator's package export. The class itself stays internal as the base for CheckpointSubTreeOrchestrator and the driver for the orchestrator_*.test.ts integration tests.
Fix Single shared BrokerCircuitProverFacade per ProverClient (started at start(), stopped at stop()). Eliminates the 30s broker-notification race that the multi-facade design introduced.
3 Reorg-after-finalization restart loop in EpochProvingJob.finalizeAndProve. removeCheckpoint cancels the in-flight top tree; the catch arm rebuilds with the surviving set.
Fix TopTreeOrchestrator.prove short-circuits with TopTreeCancelledError if cancel() was called before prove() ran (otherwise it would hang waiting on a completion promise that nothing would ever resolve).
Test rewrite The reorg-during-proving e2e drops the storage cheat and asserts the in-flight epoch proves correctly. ProverNode.epochJobs is now protected so the test can poll the in-flight job's tracked-checkpoint count for a deterministic prune-observed signal.
4 EpochProvingContext hoists the chonk-verifier proof cache to per-epoch scope. Chonk-verifier broker jobs use the context's own AbortController list, so sub-tree cancellation does not abort them — a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof.

Tests

  • 307 prover-client tests passing.
  • 90 prover-node tests passing (28 → 31 in epoch-proving-job.test.ts after adding the reorg-after-finalize describe block).
  • 7 new top-tree tests, 5 new sub-tree tests, 5 new EpochProvingContext tests.
  • Existing orchestrator_*.test.ts integration tests unchanged and passing.
  • E2e: epochs_optimistic_proving.parallel.test.ts rewritten for the new behaviour; the other reorg e2e tests (mid-epoch with replacement, mid-epoch without replacement, last-slot without replacement) are unmodified and continue to exercise the per-sub-tree path.

Test plan

  • yarn workspace @aztec/prover-client test — local pass.
  • yarn workspace @aztec/prover-node test — local pass.
  • yarn build + yarn lint clean.
  • Watch CI for e2e_prover/full (a tx-gathering flake hit once on this branch; not caused by the refactor — see analysis on the PR for details).

…reeOrchestrator

Adds the two new orchestrators that will replace the monolithic
ProvingOrchestrator in subsequent commits. ProvingOrchestrator stays
unchanged so the existing EpochProver flow keeps working.

CheckpointSubTreeOrchestrator extends ProvingOrchestrator and stops at
the checkpoint root rollup boundary, resolving a Promise<SubTreeResult>
once block-level proving completes.

TopTreeOrchestrator drives checkpoint-root through root rollup. Inputs
include per-checkpoint Promise<BlockProofs> so checkpoint root rollups
pipeline against in-flight sub-tree proving — out-hash and blob
accumulator hint chains are precomputed synchronously from
archiver-derivable data.

Also exposes getSubTreeOutputProofs / getLastArchiveSiblingPath on
CheckpointProvingState, makes ProvingOrchestrator's
checkAndEnqueueCheckpointRootRollup protected, and surfaces the
checkpoint object on TestContext.makeCheckpoint.

11 new unit tests; the existing 86 orchestrator tests still pass.
@PhilWindle PhilWindle added ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 3, 2026
…rators directly

EpochProvingJob now holds a CheckpointSubTreeOrchestrator per checkpoint and
constructs a TopTreeOrchestrator inside finalizeAndProve, instead of holding a
single EpochProver. The previous waitForAllCheckpointsReady step is gone — the
top tree starts as soon as every checkpoint is tracked and pipelines its
checkpoint root rollups against the still-pending sub-tree result promises.

Each sub-tree owns its own per-checkpoint state, so removing one (e.g. via a
prune) is now atomic and does not affect the others — the cross-checkpoint
state coupling that triggered Palla's review concerns on #22783 is contained
to the top-tree's lifetime.

Also:
- ProverClient implements a new EpochProverFactory interface with
  createCheckpointSubTreeOrchestrator and createTopTreeOrchestrator. The
  legacy createEpochProver remains for the orchestrator_*.test.ts (deleted
  in commit 2b).
- EpochProvingJob accepts an EpochProvingJobHooks bag (beforeTopTreeProve,
  afterTopTreeProve, topTreeProveOverride) that gives the e2e tests a clean
  patch surface — but the four affected tests migrate to spying on
  createTopTreeOrchestrator and patching prove(), which is the closer analog
  to the legacy finalizeEpoch patch.
- BrokerCircuitProverFacade is exported from @aztec/prover-client/broker so
  the job can manage its lifecycle.
- CheckpointSubTreeOrchestrator gains getPreviousArchiveSiblingPath() so the
  top-tree data assembly is synchronous (no awaiting block-level proving).

epoch-proving-job.test.ts is rewritten end-to-end to mock the new factory
and the per-checkpoint sub-trees (28 tests, all passing). The four e2e tests
that used to spy on createEpochProver().finalizeEpoch are migrated to spy on
createTopTreeOrchestrator().orchestrator.prove.
@PhilWindle PhilWindle force-pushed the phil/proving-orchestrator-split branch from 142d0b2 to 19d3a3c Compare May 3, 2026 18:03
…orting ProvingOrchestrator

Now that EpochProvingJob talks to CheckpointSubTreeOrchestrator and
TopTreeOrchestrator directly, the wrapper layer has no production users:

- Delete the EpochProver interface (stdlib/src/interfaces/epoch-prover.ts)
  and its re-export from interfaces/server.ts.
- Delete ServerEpochProver, the adapter that translated EpochProver calls
  onto ProvingOrchestrator + a broker facade.
- Drop createEpochProver from EpochProverManager and from ProverClient.
  ProverClient now exposes only the split factories.
- Drop ProvingOrchestrator from prover-client/orchestrator's package
  exports, and remove its `implements EpochProver` clause. The class file
  stays as the base for CheckpointSubTreeOrchestrator (which extends it)
  and as the single-class end-to-end driver used by orchestrator_*.test.ts;
  it is no longer reachable from outside the package.
- Switch test_context.ts to import ProvingOrchestrator via its relative
  module path (the orchestrator-internal test driver TestProvingOrchestrator
  still extends it).

All 301 prover-client tests, 87 prover-node tests, and 801 stdlib tests
still pass.
@PhilWindle PhilWindle force-pushed the phil/proving-orchestrator-split branch from e4507a5 to 47f9709 Compare May 3, 2026 19:02
…ors and epoch jobs

CI on the previous commit caught the symptom — finalize timed out at ~30s
while waiting for sub-tree results. Root cause: the broker maintains a
single global `completedJobNotifications` queue that is drained by the
first caller of `getCompletedJobs([])`. When multiple
`BrokerCircuitProverFacade` instances poll the same broker, the first one
to poll consumes every notification — including notifications for jobs the
others care about. The losers only catch up via the periodic 30-second
snapshot sync, which is far longer than the proof deadline for short
epochs.

Commit 2a turned this into a fast-path bug by accidentally creating N+1
facades per epoch (one per sub-tree, one for the top-tree). The same race
also exists across concurrent epoch jobs, so the right fix is one shared
facade for the whole prover-client lifetime, not just one per job.

- `ProverClient` now owns a single `BrokerCircuitProverFacade`, started in
  `start()` and stopped in `stop()`.
- `createCheckpointSubTreeOrchestrator()` and `createTopTreeOrchestrator()`
  no longer take a facade argument — they wire the orchestrator to the
  shared facade.
- `EpochProvingJob` no longer creates or manages a facade.
- The facade's job map deletes entries on resolve/reject, so memory growth
  is bounded by concurrent in-flight work, not by lifetime jobs.

All 28 epoch-proving-job tests, 87 prover-node tests, and 301 prover-client
tests still pass.
@PhilWindle PhilWindle force-pushed the phil/proving-orchestrator-split branch from 47f9709 to 8d3d7c6 Compare May 3, 2026 19:08
PhilWindle added 4 commits May 3, 2026 20:00
Previously, `removeCheckpoint` was a no-op once `finalizationStarted` was
set — a late prune (e.g. an L1 reorg) couldn't be reflected in the proof
that had already been kicked off, so the only options were "ignore the
prune and submit a stale proof" or "fail the epoch". Both bad.

Now `removeCheckpoint` is allowed at any point until the job reaches a
terminal state. If the top tree is in flight when a removal lands, the
removal cancels it (`cancel({ abortJobs: true })`); the catch arm in
`finalizeAndProve` recognises `TopTreeCancelledError`, drops the cancelled
top tree, and the surrounding loop rebuilds `CheckpointTopTreeData[]` and
the blob batching challenges from the surviving sub-trees and tries again.

The only bound on retries is the job's existing deadline. No retry counter:
a pathological reorg loop fails the epoch via the deadline path it would
have taken anyway, with one less knob to tune.

If every checkpoint is removed mid-finalize, the next loop iteration sees
`survivors === 0` and throws — the catch arm transitions the job to
`failed`, no proof is published.

Three new tests in `reorg-after-finalize`:
- `removeCheckpoint` after finalize-start cancels the top tree and the
  loop restarts with the surviving set; second prove is given the smaller
  count; epoch completes.
- A middle-of-the-list checkpoint pruned mid-prove; submitted proof carries
  the surviving from/to range.
- All checkpoints removed mid-finalize → state transitions to `failed`,
  nothing is published.

90 prover-node tests pass; 301 prover-client tests still pass.
…rove

The reorg-during-proving e2e test (`epochs_optimistic_proving.parallel.test.ts`,
"handles a reorg arriving while proving is in progress") gates `topTree.prove`
via a test patch and fires the L1 reorg while the gate is held. The
prover-node receives the L2BlockStream prune events and calls
`removeCheckpoint`, which after commit 3 cancels the in-flight top tree.

But the patch had not yet released the gate, so `cancel()` runs *before*
`prove()`. The previous code only set `this.cancelled = true` and then, when
prove eventually ran, the per-checkpoint `.then` handlers all bailed on the
flag and the completion promise never resolved — prove hung forever.

Fix: check `this.cancelled` at the top of `prove()` and short-circuit with
`TopTreeCancelledError` immediately. Adds a unit test that constructs a
top-tree, cancels it, then calls prove — expecting the immediate rejection.
Before commit 3 the prover-node could not handle a reorg that removed a
checkpoint after `finalizationStarted` — the in-flight proof referenced a
checkpoint that no longer existed on L1, so its submission was rejected.
The test simulated recovery by writing the new `proven` pointer directly
into the rollup's `stf` storage slot, releasing the gate, and asserting
that a *subsequent* epoch eventually proved.

With commit 3 the prover-node now cancels the in-flight top tree when a
prune lands and rebuilds with the surviving checkpoints. The test should
demonstrate that correct recovery, not the storage-cheat workaround.

Changes:
- After firing the L1 reorg, poll the in-flight job's tracked-checkpoint
  count until it drops. This is the deterministic signal that the prover-
  node observed the prune and called `removeCheckpoint`, which cancelled
  the in-flight top tree. (Without this we'd race the L2BlockStream poll
  and risk top tree #1 starting its real prove before cancellation lands.)
- Drop the storage cheat, the post-cheat block-production resume, and the
  wait-for-next-epoch sequence.
- Assert the *in-flight* epoch is proven up to `afterReorgCheckpoint` —
  the surviving range — directly on L1, no cheats needed.
- Make `ProverNode.epochJobs` `protected` and expose it on `TestProverNode`
  so the test can poll per-job tracked counts.
Until now each `CheckpointSubTreeOrchestrator` carried its own chonk-verifier
proof cache (on the sub-tree's internal `EpochProvingState`). When a
checkpoint was reorged out and a replacement landed in the same epoch,
every public tx in the replacement re-proved its chonk circuit, even though
the proof had already been computed for the original.

Introduces `EpochProvingContext`: a small per-epoch holder for the cache
that:

- Submits chonk-verifier broker jobs through its own AbortController list
  (not the sub-tree's). Sub-tree cancellation (e.g. `removeCheckpoint`
  with `abortJobs: true`) does **not** abort context-owned chonk jobs, so
  a replacement sub-tree can pick up the cached promise.
- Self-cleans cache entries on rejection so a future caller can re-enqueue.
- Exposes `stop()` to abort every in-flight chonk job at job teardown.

Plumbing:

- New `EpochProverFactory.createEpochProvingContext()` returns a context
  wired to `ProverClient`'s shared broker facade. `EpochProvingJob`
  constructs one per epoch and passes it to every sub-tree it creates.
- `CheckpointSubTreeOrchestrator` accepts an optional context. When
  supplied, its overrides for `startChonkVerifierCircuits` and
  `getOrEnqueueChonkVerifier` route through `context.enqueue` /
  `context.getCached` instead of the inherited per-sub-tree path.
- The legacy `ProvingOrchestrator` (test-only) is unchanged: it continues
  to use `EpochProvingState.cachedChonkVerifierProofs`.

5 new unit tests on `EpochProvingContext` cover dedup, get-after-enqueue,
reject-then-retry, abort-on-stop, and post-stop enqueue. All 307
prover-client tests, 90 prover-node tests, and existing e2e build pass.
@PhilWindle PhilWindle changed the title feat(prover-client): split ProvingOrchestrator into sub-tree + top-tree (commit 1) refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization May 3, 2026
PhilWindle added 2 commits May 6, 2026 10:00
…ry call

`CheckpointSubTreeOrchestrator` now requires an `EpochProvingContext`
(no fallback to a private chonk cache) and starts its single epoch in
the constructor by reading the epoch number from the context. A new
static `start(...)` factory does the construction plus the single
internal `startNewCheckpoint(0, ...)` and stops the orchestrator
cleanly if the start fails.

`ProverClient.createCheckpointSubTreeOrchestrator` becomes async and
takes the per-checkpoint args, replacing the old three-step dance of
factory + `startNewEpoch` + `startNewCheckpoint` in `EpochProvingJob`.
`createEpochProvingContext` now takes an `epochNumber`, which is also
the per-call arg dropped from `EpochProvingContext.enqueue`.
@AztecBot
Copy link
Copy Markdown
Collaborator

AztecBot commented May 6, 2026

Flakey Tests

🤖 says: This CI run detected 2 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/85037e802ad5ce17�85037e802ad5ce178;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_mbps.pipeline.parallel.test.ts "pipelining builds blocks using slot plus 1 proposer and proves them" (272s) (code: 0) group:e2e-p2p-epoch-flakes
\033FLAKED\033 (8;;http://ci.aztec-labs.com/b6b54bb7d18ed549�b6b54bb7d18ed5498;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_high_tps_block_building.test.ts (343s) (code: 0) group:e2e-p2p-epoch-flakes

PhilWindle added 2 commits May 6, 2026 14:44
… enumeration API

Restructure `EpochProvingJob` as an orchestrator over a registry of self-contained
jobs:

  - `CheckpointJob` — identified by `(checkpoint number, slot)` so a stale orphan
    and a re-org replacement at the same number coexist in the parent's map without
    colliding. Owns its own register-time data, sub-tree, blockProofs resolvers,
    abort controller, and tx-processing loop. `cancel()` is idempotent and
    fire-and-forget; `whenDone()` resolves once `provideTxs` and the cancel-driven
    teardown have unwound.

  - `TopTreeJob` — built from a snapshot of `CheckpointJob`s. `start()` runs
    `topTree.prove(...)`; `cancel()` rejects with `TopTreeCancelledError`.
    Hooks (`beforeTopTreeProve` / `afterTopTreeProve` / `topTreeProveOverride`)
    forward to it from `EpochProvingJobHooks`.

  - `EpochProvingJob` — slimmed from ~1020 lines to ~530. The job is now a thin
    driver over `Map<string, CheckpointJob>` and a single `topTreeJob`. The old
    `CheckpointStatus` (pending/tracked), `addCheckpointPromise` synchronisation,
    `accumulatedTxs` / `accumulatedL1ToL2Messages`, and inline restart-loop are
    gone.

`EpochProvingJob`'s public API now reflects intent rather than registry internals.
Three intent-level methods replace the tracked/pending list pair the prover-node
had to enumerate:

  - `removeCheckpointsAfter(threshold): number` — bulk remove for prune
  - `getCheckpointCount(): number` — total registered (live, uncancelled)
  - `cancelPendingCheckpoints(): void` — drop registered jobs that never got txs

`registerPendingCheckpoint` / `addCheckpoint` are renamed to `registerCheckpoint` /
`provideTxs` to reflect that registration carries all data the top tree needs and
txs arrive later. `removeCheckpoint` is now synchronous and idempotent — the
`(number, slot)` identity means multiple removes for the same number don't need
the old "await addCheckpointPromise" serialisation.

Test coverage: 89 prover-node tests pass (down from 90 — two "doesn't finalize
while gathering" tests merged into one that asserts the new early-start invariant).
…estrator infra

`ProvingOrchestrator` and `TopTreeOrchestrator` had duplicated copies of the same
broker-job submission envelope (~80 lines apiece): the `pendingProvingJobs`
controller list, the `SerialQueue` lifecycle, the `cancel`/`stop` plumbing, and
the `deferredProving<T>(state, request, callback)` wrapper that drops obsolete
results and routes errors to `state.reject(...)`.

Lift these to a new abstract `ProvingScheduler` base class. The minimal state
contract is `ProvingStateLike { verifyState(): boolean; reject(reason: string):
void }`, which both `EpochProvingState` / `CheckpointProvingState` /
`BlockProvingState` and `TopTreeProvingState` satisfy. The base owns:

  - `pendingProvingJobs`, `deferredJobQueue`, `getNumPendingProvingJobs`
  - `resetSchedulerState(abortJobs)` — drain + recreate queue, optionally abort
    in-flight jobs (the per-call abort flag covers both parent's
    `cancelJobsOnStop` config and top-tree's `{abortJobs}` arg)
  - `stop()` — standard "grab old queue, cancelInternal, await drain"
  - `deferredProving<S, T>(state, request, callback, isCancelled?)` — unified
    submit envelope. The `isCancelled` predicate covers top-tree's `cancelled`
    flag; the parent uses the default `() => false` and relies on `verifyState`.

Subclasses define `cancelInternal()` for their own cleanup (closing world-state
forks for the parent, propagating cancel into the proving state for top-tree).

Net code reduction: ~120 lines across the two orchestrators. The merge / padding
/ root rollup methods stay subclass-specific — they depend on state-class
methods that aren't unified here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants