Skip to content

fix(epoch-cache): use finalized L1 block and correct lag for committee guard#22153

Closed
spalladino wants to merge 12 commits intomerge-train/spartanfrom
palla/fix/epoch-cache-finalized-guard
Closed

fix(epoch-cache): use finalized L1 block and correct lag for committee guard#22153
spalladino wants to merge 12 commits intomerge-train/spartanfrom
palla/fix/epoch-cache-finalized-guard

Conversation

@spalladino
Copy link
Copy Markdown
Contributor

Motivation

The computeCommittee guard in epoch-cache had two bugs: it used lagInEpochsForValidatorSet (the looser constraint) instead of lagInEpochsForRandao (the binding one), and it queried the latest L1 block instead of finalized. This meant an L1 reorg could change the RANDAO seed for a committee we'd already cached, and when the two lag values differed the guard was less strict than the L1 contract.

Approach

Switch the guard to use the finalized block tag and lagInEpochsForRandao. Compute the sampling timestamp from the epoch start (not the individual slot timestamp) to match L1 contract logic. Introduce typed error classes (EpochNotFinalizedError, EpochNotStableError) so callers can distinguish between "not yet finalized" and "not yet stable on L1". Extract types and errors into separate files.

Changes

  • epoch-cache: Fix computeCommittee guard to use lagInEpochsForRandao, finalized block tag, and epoch-start-based sampling timestamp. Add EpochCacheConstants type and getEpochCacheConstants() accessor. Extract errors to errors.ts and types/interfaces to types.ts.
  • epoch-cache (tests): Use different lag values (lagInEpochsForValidatorSet=2, lagInEpochsForRandao=1) to exercise the fix. Add unit test for EpochNotStableError wrapping. Add integration tests against real Anvil: happy path (committee, caching, proposer selection) and two guard tests that independently trigger each error class.
  • epoch-cache (docs): Rewrite README with committee computation, LAG values, RANDAO, proposer selection, escape hatch, finalized block guard, and caching strategy.

Fixes A-680

@spalladino spalladino added ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure backport-to-v4-next labels Mar 30, 2026
…e guard

The computeCommittee guard was using lagInEpochsForValidatorSet (the looser
constraint) instead of lagInEpochsForRandao (the binding constraint), and
queried the latest L1 block instead of the finalized one. This could allow
caching a committee whose RANDAO seed is not yet finalized on L1.

Fixes the guard to use lagInEpochsForRandao and the finalized block tag,
computes sampling timestamp from epoch start (not slot timestamp), introduces
EpochNotFinalizedError and EpochNotStableError, and adds integration tests
against a real Anvil instance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@spalladino spalladino force-pushed the palla/fix/epoch-cache-finalized-guard branch from eb6dcfa to c362c0e Compare March 30, 2026 17:19
spalladino and others added 4 commits March 30, 2026 14:24
Increases the Anvil slotsInAnEpoch from 1 to 8 so finalized = latest - 16
blocks, making tests less likely to pass due to off-by-one near the
finality boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ests

Only interval mining needs to be stopped/restored to control the gap
between latest and finalized blocks. Automine is not used since Anvil
is started with l1BlockTime (interval mining).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llup cheat codes

Adds mineUntilTimestamp which mines real L1 blocks (via hardhat_mine with
a timestamp interval) so finalized block timestamps advance alongside latest.
This prevents epoch-cache's finalized guard from rejecting committees after
time advances in tests.

The method derives the block interval from the last two block timestamps
(to handle anvil_setBlockTimestampInterval overrides), stops interval mining
before the burst, and leaves it stopped so the caller controls when to resume.

Updates rollup cheat codes (advanceToEpoch, advanceToNextEpoch, advanceToNextSlot,
advanceSlots) to use mineUntilTimestamp with automatic interval restore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@spalladino spalladino force-pushed the palla/fix/epoch-cache-finalized-guard branch from a013a5b to 97da230 Compare March 30, 2026 21:54
spalladino and others added 2 commits March 30, 2026 19:15
…finalized block

warpL2TimeAtLeastTo used warp (single block jump), causing the finalized
L1 block to lag behind after large time jumps. This triggered
EpochNotFinalizedError in the epoch cache, blocking the sequencer from
building blocks after the warp.

Switches to mineUntilTimestamp which mines real blocks at the ethereum
slot interval so finalized advances alongside latest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eouts

For large time jumps (e.g., 1 day in crossTimestampOfChange), mining at
the ethereum slot interval (12s) would require thousands of blocks,
causing Anvil to time out. Caps at ~100 blocks and spreads the interval
to cover the full jump.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
spalladino and others added 3 commits March 31, 2026 09:43
…eouts

For large time jumps (e.g., 1 day in crossTimestampOfChange), mining at
the ethereum slot interval (12s) would require thousands of blocks,
causing Anvil to time out. Caps at ~1000 blocks and spreads the interval
to cover the full jump.

Also fixes mineUntilTimestamp to use evm_setNextBlockTimestamp + evm_mine
per block instead of hardhat_mine, because Anvil's hardhat_mine ignores
the interval parameter when anvil_setBlockTimestampInterval has been set.
Adds a unit test validating this workaround.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…to catch up

Anvil computes finalized = latest - slotsInAnEpoch * 2 blocks. When
querying the committee for the next epoch right after advancing, the
sampling timestamp can be beyond the finalized block. Mine 3 extra
blocks past the target so finalized also advances past it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When querying the next epoch's committee after advancing, the finalized
block may not have caught up to the sampling timestamp yet (Anvil
computes finalized = latest - slotsInAnEpoch * 2). Catch the error
and mine extra blocks to push finalized forward before retrying.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@spalladino spalladino requested a review from a team as a code owner March 31, 2026 18:29
Anvil defaults to slotsInAnEpoch=32 (Ethereum mainnet), which means
finalized = latest - 64 blocks (~768s behind). This causes the epoch
cache finalized-block guard to reject all committee queries in test
environments. Setting slotsInAnEpoch=1 keeps finalized close to
latest (only 2 blocks behind).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@spalladino spalladino force-pushed the palla/fix/epoch-cache-finalized-guard branch from 9d60565 to 89e2520 Compare March 31, 2026 18:56
The HA test relies on slow finalization to keep attestations in the
P2P pool long enough for verification. With --slots-in-an-epoch 1,
Anvil finalizes every block immediately, triggering aggressive pool
cleanup that deletes attestations before the test can read them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@spalladino spalladino marked this pull request as draft April 1, 2026 00:28
spalladino added a commit that referenced this pull request Apr 8, 2026
…d correct lag (#22204)

## Motivation

PR #22153 introduced a hard "finalized block guard" that refuses to
compute committees if L1 data isn't finalized. While the safety goal is
valid (preventing L1 reorgs from invalidating cached committees), it
breaks many tests that don't properly set L1 finalized time and would
cause the chain to stall if L1 stops finalizing. This PR takes a
different approach that preserves safety while maintaining liveness.

Also fixes the lag parameter: the old code used
`lagInEpochsForValidatorSet` (the looser constraint) instead of
`lagInEpochsForRandao` (the binding one), and computed the sampling
timestamp from the slot rather than the epoch start.

Fixes A-680

## Approach

Instead of refusing to serve committee data that isn't finalized, use a
TTL-based cache: finalized entries are cached permanently, non-finalized
entries expire after one Ethereum slot (12s) and get re-fetched from L1.
The cache map stores both resolved entries and in-flight promises
directly, so concurrent callers for the same epoch coalesce on a single
L1 query. On fetch failure, the previous stale entry is restored so the
next caller retries cleanly.

## Changes

- **epoch-cache**: Replaced the simple `Map<EpochNumber,
EpochCommitteeInfo>` cache with `Map<EpochNumber, CachedEpochEntry |
Promise<CachedEpochEntry>>`. Each resolved entry carries L1 block
provenance metadata (number, hash, timestamp) and a `finalized` flag.
Switched from `lagInEpochsForValidatorSet` to `lagInEpochsForRandao` and
compute sampling timestamp from epoch start via
`getStartTimestampForEpoch`. Simplified `isEscapeHatchOpen` to delegate
cache management to `getCommittee`.
- **epoch-cache (tests)**: Updated unit tests for the new cache
structure. Added 4 new TTL tests: re-query after TTL, no re-query for
finalized, concurrent coalescing, eventual finalization promotion.
- **epoch-cache (integration tests)**: New integration test suite
against real Anvil with deployed L1 contracts and 4 validators. Tests
finalized committee retrieval, non-finalized TTL refresh, and cache
re-fetch after L1 reorg.
- **epoch-cache (README)**: Added comprehensive documentation covering
committee computation, LAG values, RANDAO seed, proposer selection,
escape hatch, TTL caching with finalization tracking, and configuration.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@spalladino
Copy link
Copy Markdown
Contributor Author

Closing in favor of #22204

@spalladino spalladino closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v4-next ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants