Skip to content

feat(slasher): per-slot data-withholding watcher (A-523, A-525)#23116

Merged
PhilWindle merged 7 commits into
merge-train/spartanfrom
phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2
May 15, 2026
Merged

feat(slasher): per-slot data-withholding watcher (A-523, A-525)#23116
PhilWindle merged 7 commits into
merge-train/spartanfrom
phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2

Conversation

@PhilWindle
Copy link
Copy Markdown
Collaborator

Summary

Moves the data-withholding slash from the L1-prune path to a per-slot check at slotStart(checkpoint.slot + slashDataWithholdingToleranceSlots), and removes the now-unnecessary VALID_EPOCH_PRUNED offense and EpochPruneWatcher.

Per AZIP-7: validators are responsible for making tx data available, not for ensuring proofs land. The new DataWithholdingWatcher ticks at quarter-eth-slot cadence and, for each published checkpoint older than dataWithholdingToleranceSlots (default 3), probes the local mempool for the txs in the checkpoint's blocks. Missing txs trigger a DATA_WITHHOLDING slash for the validators who actually attested to that checkpoint.

Highlights

  • New DataWithholdingWatcher (yarn-project/slasher/src/watchers/data_withholding_watcher.ts) with full unit-test coverage. Sentinel-style tick + restart floor (no KV).
  • Slot-keyed DATA_WITHHOLDING — moved from 'epoch' to 'slot' in getTimeUnitForOffense. Offense identity is now per-checkpoint, not per-epoch.
  • Single source of truth for tolerance. P2PClient.collectingMissingTxs anchors its tx-collection deadline to slotStart(block.slot + slashDataWithholdingToleranceSlots) so the collection effort runs to exactly the wall-clock instant the watcher renders its verdict. The ad-hoc p2pMissingTxCollectionDeadlineMs is removed.
  • A-525 deletions bundled in: OffenseType.VALID_EPOCH_PRUNED, slashPrunePenalty config + env var + spartan plumbing, EpochPruneWatcher class + tests, valid_epoch_pruned_slash.test.ts.
  • e2e test rewritten in data_withholding_slash.test.ts: 4 validators, slashSelfAllowed, tx is mined normally then stubbed missing on every node, watcher fires, slash executes on-chain, committee is kicked. Asserts slot-keyed offense identity + on-chain effect.

Test plan

  • yarn workspace @aztec/slasher test — 76 tests pass (incl. 9 new DataWithholdingWatcher tests).
  • yarn workspace @aztec/stdlib test src/slashing — 55 tests pass after the keying flip.
  • yarn workspace @aztec/p2p test src/client/p2p_client.test.ts — 20 tests pass with the new slot-anchored deadline.
  • yarn build clean across the monorepo.
  • The e2e (e2e_p2p_data_withholding_slash) is in place but only runs in CI.

Out of scope

  • L1 contract changes — none needed (offense type code is purely off-chain).
  • The L2PruneUnproven event/emitter is left in place; nothing subscribes to it after this PR but the event itself stays available for future observers.
  • Re-execution of checkpoints (covered by sibling A-1022).

Closes A-523.
Closes A-525.

@PhilWindle PhilWindle added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 8, 2026
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch 2 times, most recently from 78d0044 to ec8293d Compare May 12, 2026 19:34
@PhilWindle PhilWindle marked this pull request as ready for review May 12, 2026 19:37
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from ec8293d to 51fe9e3 Compare May 13, 2026 08:53
Copy link
Copy Markdown
Contributor

@spalladino spalladino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few comments. Aside from them, something I'm worried about are race conditions that cause the TxPool to evict txs before the data withholding watcher has had the time to check availability. For instance: let's say an epoch gets proven really fast, so its txs get evicted from the mempool (since they are no longer needed) faster than the data withholding tolerance. When the watcher kicks in, the txs will have been removed from the tx pool, and the slash will kick on. Or even worse: in the event of a prune due to missing proof, the txs for all blocks in the epoch become unprotected (or removed if they're invalid, as they may have been anchored to pruned blocks), and the data withholding check will kick in a few slots after the prune and want to slash all attestors of the last few checkpoints in the epoch. I'm not sure what's the best way to fix this though, and apologies for not catching this earlier.

Comment on lines +617 to +620
// Staleness: the proposer can only publish this checkpoint within its wallclock slot,
// which is `slot - pipeliningOffset`.
const currentSlot = this.epochCache.getSlotNow();
const lastAttestableWallclockSlot = slotNumber - this.epochCache.pipeliningOffset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same helpers as in the attestation_validator here, which I think are more lenient.

More in general, when in doubt, we should probably broadcast our attestation anyway. Better be penalized on p2p than slashed due to inactivity.

// before we declare its data missing. For checkpoint slot S, we therefore process S
// only once we are in slot `S + tolerance + 1` or later.
const tolerance = this.config.slashDataWithholdingToleranceSlots;
const currentSlot = this.epochCache.getSlotNow();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change to L2BlockSource.getSyncedL2SlotNumber in case the archiver is lagging behind for some reason

Comment on lines +140 to +143
this.log.warn(`Detected ${missingTxs.length} missing txs at slot ${slot} but no recoverable attesters`, {
slot,
missingTxs: missingTxs.map(h => h.toString()),
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, the publisher itself should at least be an attester. But fine to leave this in just as a guard.

Comment on lines +222 to +223
// 4. Wait until the proposer is the designated current proposer.
t.logger.warn(`Waiting for ${proposerAddress} to be the current proposer`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With pipelining you should wait until proposerAddress is to be the proposer in the next slot according to the rollup contract. I'm surprised the tests passes anyway.

Comment on lines +266 to +267
// 7. Wait for the slash to execute on L1. Advance to the slashing tally epoch so B/C/D
// can start voting on the offense round.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue we should start dropping this part from all e2e slashing tests, so they run faster, and keep it in only one of them.

Comment on lines +190 to +198
const proposerP2pService: any = (proposerNode as any).p2pClient.p2pService;
const originalPropagate = proposerP2pService.propagate.bind(proposerP2pService);
jest.spyOn(proposerP2pService, 'propagate').mockImplementation(((msg: any) => {
if (msg instanceof Tx) {
t.logger.info(`Suppressing outbound tx gossip from proposer ${proposerAddress}`);
return Promise.resolve();
}
return originalPropagate(msg);
}) as any);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer adding test-only config flags to control this behaviour, so e2e tests don't need to mess with internals. But it's a personal preference.

@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch 3 times, most recently from e48f310 to cdf0290 Compare May 13, 2026 16:40
A-523: Replace the L1-prune-time data-withholding check with a per-slot
watcher that runs once `slashDataWithholdingToleranceSlots` full slots
have elapsed after a published checkpoint, probes the local mempool via
ITxProvider.hasTxs, and emits a slot-keyed DATA_WITHHOLDING offense for
every attester (union of L1-published + p2p-pool) that signed the
checkpoint when any of its txs are still missing. Restart-floor at boot
slot, ticks at quarter-eth-slot cadence.

A-525: Remove the legacy VALID_EPOCH_PRUNED slashing path:
EpochPruneWatcher, slashPrunePenalty, and OffenseType.VALID_EPOCH_PRUNED.

Protocol fixes that close the data-withholding loophole:
- validateCheckpointProposal: refuse to attest if the block's enclosing
  checkpoint is already on L1 — we cannot legitimately attest based on
  L1-synced data we did not witness on p2p.
- shouldAttestToSlot: refuse a proposal whose proposer's wallclock window
  has elapsed (`currentSlot > slot - pipeliningOffset`).
- block-sync timeout in validateCheckpointProposal: use
  `getReexecutionDeadline` so the deadline is the end of the proposal's
  own slot when pipelining is off, and the start of the target slot when
  it is on.

P2P:
- ITxProvider.hasTxs(txHashes) added for the watcher's mempool probe.
- Optional p2pMissingTxCollectionDeadlineSlots clamped up to the
  tolerance window, so collection never gives up before the slash verdict.

E2E:
- Rewrite data_withholding_slash to model the realistic attack: stub
  proposer outbound tx gossip + two blind attesters + one honest member
  who naturally refuses via the new gates. Retry epoch advance until a
  full committee is found.
- Patch duplicate_attestation_slash: stub `notifyOwnCheckpointProposal`
  on the malicious nodes to break the self-loopback into local validation
  (with skipPushProposedBlocksToArchiver the proposer's block never lands
  in its archiver, so the loopback hung the proposer past the staleness
  gate and suppressed all duplicate attestations).
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from cdf0290 to 7c1cfff Compare May 13, 2026 17:29
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from 7f1b01c to 425d9bf Compare May 14, 2026 17:37
…-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from 425d9bf to cd34200 Compare May 14, 2026 17:39
}

public start(): Promise<void> {
this.initialSlot = this.epochCache.getSlotNow();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be l2BlockSource.getSyncedL2SlotNumber() here as well, sorry I missed it on the first pass

* driven externally via `removeBefore` (typically by the proposal handler, once a
* checkpoint reaches L1 finality).
*/
export class InMemoryCheckpointReexecutionTracker implements CheckpointReexecutionTracker {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I've started removing intermediate interfaces when we have a single implementation, since it makes code navigation more annoying. We should agree on whether it's a good policy or not.

Comment on lines +975 to +976
// We successfully re-executed every block in this checkpoint locally, record for any observers
this.reexecutionTracker.recordReexecuted(checkpointNumber, proposal.archive);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a recordTxsCollected before we reexecute, in case we do manage to collect the txs but fail to reexecute within the deadline?

Mirror the same archiver-synced-slot logic used in work(): use
l2BlockSource.getSyncedL2SlotNumber() rather than the wallclock to
floor processing on start. Avoids treating slots as already-started
before the archiver has ingested them. Per PR review.
@PhilWindle PhilWindle enabled auto-merge (squash) May 15, 2026 10:38
…kpoint_proposal_slash test

slashPrunePenalty was removed from SetupOptions as part of A-523/A-525.
This e2e test still referenced it, breaking the TS build.
@PhilWindle PhilWindle merged commit 58acdae into merge-train/spartan May 15, 2026
14 checks passed
@PhilWindle PhilWindle deleted the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch May 15, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants