refactor(prover): split ProvingOrchestrator into sub-tree + top-tree; handle reorg-after-finalization#22915
Open
PhilWindle wants to merge 12 commits intophil/a-955-optimistic-proving-checkpoint-driven-triggerfrom
Conversation
…reeOrchestrator Adds the two new orchestrators that will replace the monolithic ProvingOrchestrator in subsequent commits. ProvingOrchestrator stays unchanged so the existing EpochProver flow keeps working. CheckpointSubTreeOrchestrator extends ProvingOrchestrator and stops at the checkpoint root rollup boundary, resolving a Promise<SubTreeResult> once block-level proving completes. TopTreeOrchestrator drives checkpoint-root through root rollup. Inputs include per-checkpoint Promise<BlockProofs> so checkpoint root rollups pipeline against in-flight sub-tree proving — out-hash and blob accumulator hint chains are precomputed synchronously from archiver-derivable data. Also exposes getSubTreeOutputProofs / getLastArchiveSiblingPath on CheckpointProvingState, makes ProvingOrchestrator's checkAndEnqueueCheckpointRootRollup protected, and surfaces the checkpoint object on TestContext.makeCheckpoint. 11 new unit tests; the existing 86 orchestrator tests still pass.
…rators directly EpochProvingJob now holds a CheckpointSubTreeOrchestrator per checkpoint and constructs a TopTreeOrchestrator inside finalizeAndProve, instead of holding a single EpochProver. The previous waitForAllCheckpointsReady step is gone — the top tree starts as soon as every checkpoint is tracked and pipelines its checkpoint root rollups against the still-pending sub-tree result promises. Each sub-tree owns its own per-checkpoint state, so removing one (e.g. via a prune) is now atomic and does not affect the others — the cross-checkpoint state coupling that triggered Palla's review concerns on #22783 is contained to the top-tree's lifetime. Also: - ProverClient implements a new EpochProverFactory interface with createCheckpointSubTreeOrchestrator and createTopTreeOrchestrator. The legacy createEpochProver remains for the orchestrator_*.test.ts (deleted in commit 2b). - EpochProvingJob accepts an EpochProvingJobHooks bag (beforeTopTreeProve, afterTopTreeProve, topTreeProveOverride) that gives the e2e tests a clean patch surface — but the four affected tests migrate to spying on createTopTreeOrchestrator and patching prove(), which is the closer analog to the legacy finalizeEpoch patch. - BrokerCircuitProverFacade is exported from @aztec/prover-client/broker so the job can manage its lifecycle. - CheckpointSubTreeOrchestrator gains getPreviousArchiveSiblingPath() so the top-tree data assembly is synchronous (no awaiting block-level proving). epoch-proving-job.test.ts is rewritten end-to-end to mock the new factory and the per-checkpoint sub-trees (28 tests, all passing). The four e2e tests that used to spy on createEpochProver().finalizeEpoch are migrated to spy on createTopTreeOrchestrator().orchestrator.prove.
142d0b2 to
19d3a3c
Compare
…orting ProvingOrchestrator Now that EpochProvingJob talks to CheckpointSubTreeOrchestrator and TopTreeOrchestrator directly, the wrapper layer has no production users: - Delete the EpochProver interface (stdlib/src/interfaces/epoch-prover.ts) and its re-export from interfaces/server.ts. - Delete ServerEpochProver, the adapter that translated EpochProver calls onto ProvingOrchestrator + a broker facade. - Drop createEpochProver from EpochProverManager and from ProverClient. ProverClient now exposes only the split factories. - Drop ProvingOrchestrator from prover-client/orchestrator's package exports, and remove its `implements EpochProver` clause. The class file stays as the base for CheckpointSubTreeOrchestrator (which extends it) and as the single-class end-to-end driver used by orchestrator_*.test.ts; it is no longer reachable from outside the package. - Switch test_context.ts to import ProvingOrchestrator via its relative module path (the orchestrator-internal test driver TestProvingOrchestrator still extends it). All 301 prover-client tests, 87 prover-node tests, and 801 stdlib tests still pass.
e4507a5 to
47f9709
Compare
…ors and epoch jobs CI on the previous commit caught the symptom — finalize timed out at ~30s while waiting for sub-tree results. Root cause: the broker maintains a single global `completedJobNotifications` queue that is drained by the first caller of `getCompletedJobs([])`. When multiple `BrokerCircuitProverFacade` instances poll the same broker, the first one to poll consumes every notification — including notifications for jobs the others care about. The losers only catch up via the periodic 30-second snapshot sync, which is far longer than the proof deadline for short epochs. Commit 2a turned this into a fast-path bug by accidentally creating N+1 facades per epoch (one per sub-tree, one for the top-tree). The same race also exists across concurrent epoch jobs, so the right fix is one shared facade for the whole prover-client lifetime, not just one per job. - `ProverClient` now owns a single `BrokerCircuitProverFacade`, started in `start()` and stopped in `stop()`. - `createCheckpointSubTreeOrchestrator()` and `createTopTreeOrchestrator()` no longer take a facade argument — they wire the orchestrator to the shared facade. - `EpochProvingJob` no longer creates or manages a facade. - The facade's job map deletes entries on resolve/reject, so memory growth is bounded by concurrent in-flight work, not by lifetime jobs. All 28 epoch-proving-job tests, 87 prover-node tests, and 301 prover-client tests still pass.
47f9709 to
8d3d7c6
Compare
Previously, `removeCheckpoint` was a no-op once `finalizationStarted` was
set — a late prune (e.g. an L1 reorg) couldn't be reflected in the proof
that had already been kicked off, so the only options were "ignore the
prune and submit a stale proof" or "fail the epoch". Both bad.
Now `removeCheckpoint` is allowed at any point until the job reaches a
terminal state. If the top tree is in flight when a removal lands, the
removal cancels it (`cancel({ abortJobs: true })`); the catch arm in
`finalizeAndProve` recognises `TopTreeCancelledError`, drops the cancelled
top tree, and the surrounding loop rebuilds `CheckpointTopTreeData[]` and
the blob batching challenges from the surviving sub-trees and tries again.
The only bound on retries is the job's existing deadline. No retry counter:
a pathological reorg loop fails the epoch via the deadline path it would
have taken anyway, with one less knob to tune.
If every checkpoint is removed mid-finalize, the next loop iteration sees
`survivors === 0` and throws — the catch arm transitions the job to
`failed`, no proof is published.
Three new tests in `reorg-after-finalize`:
- `removeCheckpoint` after finalize-start cancels the top tree and the
loop restarts with the surviving set; second prove is given the smaller
count; epoch completes.
- A middle-of-the-list checkpoint pruned mid-prove; submitted proof carries
the surviving from/to range.
- All checkpoints removed mid-finalize → state transitions to `failed`,
nothing is published.
90 prover-node tests pass; 301 prover-client tests still pass.
…rove The reorg-during-proving e2e test (`epochs_optimistic_proving.parallel.test.ts`, "handles a reorg arriving while proving is in progress") gates `topTree.prove` via a test patch and fires the L1 reorg while the gate is held. The prover-node receives the L2BlockStream prune events and calls `removeCheckpoint`, which after commit 3 cancels the in-flight top tree. But the patch had not yet released the gate, so `cancel()` runs *before* `prove()`. The previous code only set `this.cancelled = true` and then, when prove eventually ran, the per-checkpoint `.then` handlers all bailed on the flag and the completion promise never resolved — prove hung forever. Fix: check `this.cancelled` at the top of `prove()` and short-circuit with `TopTreeCancelledError` immediately. Adds a unit test that constructs a top-tree, cancels it, then calls prove — expecting the immediate rejection.
Before commit 3 the prover-node could not handle a reorg that removed a checkpoint after `finalizationStarted` — the in-flight proof referenced a checkpoint that no longer existed on L1, so its submission was rejected. The test simulated recovery by writing the new `proven` pointer directly into the rollup's `stf` storage slot, releasing the gate, and asserting that a *subsequent* epoch eventually proved. With commit 3 the prover-node now cancels the in-flight top tree when a prune lands and rebuilds with the surviving checkpoints. The test should demonstrate that correct recovery, not the storage-cheat workaround. Changes: - After firing the L1 reorg, poll the in-flight job's tracked-checkpoint count until it drops. This is the deterministic signal that the prover- node observed the prune and called `removeCheckpoint`, which cancelled the in-flight top tree. (Without this we'd race the L2BlockStream poll and risk top tree #1 starting its real prove before cancellation lands.) - Drop the storage cheat, the post-cheat block-production resume, and the wait-for-next-epoch sequence. - Assert the *in-flight* epoch is proven up to `afterReorgCheckpoint` — the surviving range — directly on L1, no cheats needed. - Make `ProverNode.epochJobs` `protected` and expose it on `TestProverNode` so the test can poll per-job tracked counts.
Until now each `CheckpointSubTreeOrchestrator` carried its own chonk-verifier proof cache (on the sub-tree's internal `EpochProvingState`). When a checkpoint was reorged out and a replacement landed in the same epoch, every public tx in the replacement re-proved its chonk circuit, even though the proof had already been computed for the original. Introduces `EpochProvingContext`: a small per-epoch holder for the cache that: - Submits chonk-verifier broker jobs through its own AbortController list (not the sub-tree's). Sub-tree cancellation (e.g. `removeCheckpoint` with `abortJobs: true`) does **not** abort context-owned chonk jobs, so a replacement sub-tree can pick up the cached promise. - Self-cleans cache entries on rejection so a future caller can re-enqueue. - Exposes `stop()` to abort every in-flight chonk job at job teardown. Plumbing: - New `EpochProverFactory.createEpochProvingContext()` returns a context wired to `ProverClient`'s shared broker facade. `EpochProvingJob` constructs one per epoch and passes it to every sub-tree it creates. - `CheckpointSubTreeOrchestrator` accepts an optional context. When supplied, its overrides for `startChonkVerifierCircuits` and `getOrEnqueueChonkVerifier` route through `context.enqueue` / `context.getCached` instead of the inherited per-sub-tree path. - The legacy `ProvingOrchestrator` (test-only) is unchanged: it continues to use `EpochProvingState.cachedChonkVerifierProofs`. 5 new unit tests on `EpochProvingContext` cover dedup, get-after-enqueue, reject-then-retry, abort-on-stop, and post-stop enqueue. All 307 prover-client tests, 90 prover-node tests, and existing e2e build pass.
3 tasks
…ry call `CheckpointSubTreeOrchestrator` now requires an `EpochProvingContext` (no fallback to a private chonk cache) and starts its single epoch in the constructor by reading the epoch number from the context. A new static `start(...)` factory does the construction plus the single internal `startNewCheckpoint(0, ...)` and stops the orchestrator cleanly if the start fails. `ProverClient.createCheckpointSubTreeOrchestrator` becomes async and takes the per-checkpoint args, replacing the old three-step dance of factory + `startNewEpoch` + `startNewCheckpoint` in `EpochProvingJob`. `createEpochProvingContext` now takes an `epochNumber`, which is also the per-call arg dropped from `EpochProvingContext.enqueue`.
Collaborator
Flakey Tests🤖 says: This CI run detected 2 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
… enumeration API
Restructure `EpochProvingJob` as an orchestrator over a registry of self-contained
jobs:
- `CheckpointJob` — identified by `(checkpoint number, slot)` so a stale orphan
and a re-org replacement at the same number coexist in the parent's map without
colliding. Owns its own register-time data, sub-tree, blockProofs resolvers,
abort controller, and tx-processing loop. `cancel()` is idempotent and
fire-and-forget; `whenDone()` resolves once `provideTxs` and the cancel-driven
teardown have unwound.
- `TopTreeJob` — built from a snapshot of `CheckpointJob`s. `start()` runs
`topTree.prove(...)`; `cancel()` rejects with `TopTreeCancelledError`.
Hooks (`beforeTopTreeProve` / `afterTopTreeProve` / `topTreeProveOverride`)
forward to it from `EpochProvingJobHooks`.
- `EpochProvingJob` — slimmed from ~1020 lines to ~530. The job is now a thin
driver over `Map<string, CheckpointJob>` and a single `topTreeJob`. The old
`CheckpointStatus` (pending/tracked), `addCheckpointPromise` synchronisation,
`accumulatedTxs` / `accumulatedL1ToL2Messages`, and inline restart-loop are
gone.
`EpochProvingJob`'s public API now reflects intent rather than registry internals.
Three intent-level methods replace the tracked/pending list pair the prover-node
had to enumerate:
- `removeCheckpointsAfter(threshold): number` — bulk remove for prune
- `getCheckpointCount(): number` — total registered (live, uncancelled)
- `cancelPendingCheckpoints(): void` — drop registered jobs that never got txs
`registerPendingCheckpoint` / `addCheckpoint` are renamed to `registerCheckpoint` /
`provideTxs` to reflect that registration carries all data the top tree needs and
txs arrive later. `removeCheckpoint` is now synchronous and idempotent — the
`(number, slot)` identity means multiple removes for the same number don't need
the old "await addCheckpointPromise" serialisation.
Test coverage: 89 prover-node tests pass (down from 90 — two "doesn't finalize
while gathering" tests merged into one that asserts the new early-start invariant).
…estrator infra
`ProvingOrchestrator` and `TopTreeOrchestrator` had duplicated copies of the same
broker-job submission envelope (~80 lines apiece): the `pendingProvingJobs`
controller list, the `SerialQueue` lifecycle, the `cancel`/`stop` plumbing, and
the `deferredProving<T>(state, request, callback)` wrapper that drops obsolete
results and routes errors to `state.reject(...)`.
Lift these to a new abstract `ProvingScheduler` base class. The minimal state
contract is `ProvingStateLike { verifyState(): boolean; reject(reason: string):
void }`, which both `EpochProvingState` / `CheckpointProvingState` /
`BlockProvingState` and `TopTreeProvingState` satisfy. The base owns:
- `pendingProvingJobs`, `deferredJobQueue`, `getNumPendingProvingJobs`
- `resetSchedulerState(abortJobs)` — drain + recreate queue, optionally abort
in-flight jobs (the per-call abort flag covers both parent's
`cancelJobsOnStop` config and top-tree's `{abortJobs}` arg)
- `stop()` — standard "grab old queue, cancelInternal, await drain"
- `deferredProving<S, T>(state, request, callback, isCancelled?)` — unified
submit envelope. The `isCancelled` predicate covers top-tree's `cancelled`
flag; the parent uses the default `() => false` and relies on `verifyState`.
Subclasses define `cancelInternal()` for their own cleanup (closing world-state
forks for the parent, propagating cancel into the proving state for top-tree).
Net code reduction: ~120 lines across the two orchestrators. The merge / padding
/ root rollup methods stay subclass-specific — they depend on state-class
methods that aren't unified here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on #22783. Refactors the prover stack so a reorg that lands after finalisation has started can be handled cleanly, by splitting today's
ProvingOrchestratoralong the natural state-coupling boundary: per-checkpoint block-level work vs. epoch-level top-tree work.The original review on #22783 surfaced four issues that all reduced to one cause —
EpochProvingStatemixed per-checkpoint state with epoch-level state derived from the whole set of checkpoints (previousOutHashHintchain, blob-accumulator chain,totalNumCheckpoints,finalBlobBatchingChallenges). Splitting the orchestrator at that boundary makes each sub-tree self-contained and confines the cross-checkpoint chain to the lifetime of the top-tree run.Plan and design notes:
prover-stack-split-plan.mdon this branch.Architecture
Single
BrokerCircuitProverFacadeperProverClient, shared across every orchestrator across every concurrent epoch. The broker delivers each completed-job notification exactly once (drained on the first poll), so multiple facades polling the same broker race and lose notifications until the 30s snapshot sync — far longer than the proof deadline for short epochs.What changes for the user-visible behaviour
Before:
removeCheckpointwas rejected oncefinalizationStartedwas set. A late prune had to be ignored (submit a stale proof that L1 would reject) or fail the epoch.After:
removeCheckpointis allowed at any point until the job reaches a terminal state. If the top tree is in flight when a removal lands, it is cancelled withTopTreeCancelledError; thefinalizeAndProveloop catches it, recomputesfinalBlobBatchingChallengesand the per-checkpoint hints from the survivors, and the next iteration submits a valid proof. If every checkpoint is removed mid-finalize, the loop throws and the job transitions tofailed. Bound only by the existingthis.deadline— no retry counter.The proven e2e test for this scenario (
epochs_optimistic_proving.parallel.test.ts→ "handles a reorg arriving while proving is in progress") used to rely on a storage cheat to simulate recovery; it now demonstrates the prover-node performing the recovery itself, asserting the in-flight epoch is proven up to the surviving checkpoint range with no cheats.Commits
CheckpointSubTreeOrchestratorandTopTreeOrchestratoralongside the existingProvingOrchestrator.EpochProverflow unchanged.EpochProvingJobswitches to the new orchestrators directly. NewEpochProverFactoryinterface onProverClient. Hooks added so e2e tests can interpose without monkey-patching internal classes.EpochProver,ServerEpochProver, andProvingOrchestrator's package export. The class itself stays internal as the base forCheckpointSubTreeOrchestratorand the driver for theorchestrator_*.test.tsintegration tests.BrokerCircuitProverFacadeperProverClient(started atstart(), stopped atstop()). Eliminates the 30s broker-notification race that the multi-facade design introduced.EpochProvingJob.finalizeAndProve.removeCheckpointcancels the in-flight top tree; the catch arm rebuilds with the surviving set.TopTreeOrchestrator.proveshort-circuits withTopTreeCancelledErrorifcancel()was called beforeprove()ran (otherwise it would hang waiting on a completion promise that nothing would ever resolve).ProverNode.epochJobsis nowprotectedso the test can poll the in-flight job's tracked-checkpoint count for a deterministic prune-observed signal.EpochProvingContexthoists the chonk-verifier proof cache to per-epoch scope. Chonk-verifier broker jobs use the context's ownAbortControllerlist, so sub-tree cancellation does not abort them — a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof.Tests
epoch-proving-job.test.tsafter adding the reorg-after-finalize describe block).EpochProvingContexttests.orchestrator_*.test.tsintegration tests unchanged and passing.epochs_optimistic_proving.parallel.test.tsrewritten for the new behaviour; the other reorg e2e tests (mid-epoch with replacement,mid-epoch without replacement,last-slot without replacement) are unmodified and continue to exercise the per-sub-tree path.Test plan
yarn workspace @aztec/prover-client test— local pass.yarn workspace @aztec/prover-node test— local pass.yarn build+yarn lintclean.e2e_prover/full(a tx-gathering flake hit once on this branch; not caused by the refactor — see analysis on the PR for details).