feat(slasher): per-slot data-withholding watcher (A-523, A-525)#23116
Conversation
78d0044 to
ec8293d
Compare
ec8293d to
51fe9e3
Compare
spalladino
left a comment
There was a problem hiding this comment.
Looks good! Left a few comments. Aside from them, something I'm worried about are race conditions that cause the TxPool to evict txs before the data withholding watcher has had the time to check availability. For instance: let's say an epoch gets proven really fast, so its txs get evicted from the mempool (since they are no longer needed) faster than the data withholding tolerance. When the watcher kicks in, the txs will have been removed from the tx pool, and the slash will kick on. Or even worse: in the event of a prune due to missing proof, the txs for all blocks in the epoch become unprotected (or removed if they're invalid, as they may have been anchored to pruned blocks), and the data withholding check will kick in a few slots after the prune and want to slash all attestors of the last few checkpoints in the epoch. I'm not sure what's the best way to fix this though, and apologies for not catching this earlier.
| // Staleness: the proposer can only publish this checkpoint within its wallclock slot, | ||
| // which is `slot - pipeliningOffset`. | ||
| const currentSlot = this.epochCache.getSlotNow(); | ||
| const lastAttestableWallclockSlot = slotNumber - this.epochCache.pipeliningOffset(); |
There was a problem hiding this comment.
We should use the same helpers as in the attestation_validator here, which I think are more lenient.
More in general, when in doubt, we should probably broadcast our attestation anyway. Better be penalized on p2p than slashed due to inactivity.
| // before we declare its data missing. For checkpoint slot S, we therefore process S | ||
| // only once we are in slot `S + tolerance + 1` or later. | ||
| const tolerance = this.config.slashDataWithholdingToleranceSlots; | ||
| const currentSlot = this.epochCache.getSlotNow(); |
There was a problem hiding this comment.
I'd change to L2BlockSource.getSyncedL2SlotNumber in case the archiver is lagging behind for some reason
| this.log.warn(`Detected ${missingTxs.length} missing txs at slot ${slot} but no recoverable attesters`, { | ||
| slot, | ||
| missingTxs: missingTxs.map(h => h.toString()), | ||
| }); |
There was a problem hiding this comment.
This is weird, the publisher itself should at least be an attester. But fine to leave this in just as a guard.
| // 4. Wait until the proposer is the designated current proposer. | ||
| t.logger.warn(`Waiting for ${proposerAddress} to be the current proposer`); |
There was a problem hiding this comment.
With pipelining you should wait until proposerAddress is to be the proposer in the next slot according to the rollup contract. I'm surprised the tests passes anyway.
| // 7. Wait for the slash to execute on L1. Advance to the slashing tally epoch so B/C/D | ||
| // can start voting on the offense round. |
There was a problem hiding this comment.
I'd argue we should start dropping this part from all e2e slashing tests, so they run faster, and keep it in only one of them.
| const proposerP2pService: any = (proposerNode as any).p2pClient.p2pService; | ||
| const originalPropagate = proposerP2pService.propagate.bind(proposerP2pService); | ||
| jest.spyOn(proposerP2pService, 'propagate').mockImplementation(((msg: any) => { | ||
| if (msg instanceof Tx) { | ||
| t.logger.info(`Suppressing outbound tx gossip from proposer ${proposerAddress}`); | ||
| return Promise.resolve(); | ||
| } | ||
| return originalPropagate(msg); | ||
| }) as any); |
There was a problem hiding this comment.
I prefer adding test-only config flags to control this behaviour, so e2e tests don't need to mess with internals. But it's a personal preference.
e48f310 to
cdf0290
Compare
A-523: Replace the L1-prune-time data-withholding check with a per-slot watcher that runs once `slashDataWithholdingToleranceSlots` full slots have elapsed after a published checkpoint, probes the local mempool via ITxProvider.hasTxs, and emits a slot-keyed DATA_WITHHOLDING offense for every attester (union of L1-published + p2p-pool) that signed the checkpoint when any of its txs are still missing. Restart-floor at boot slot, ticks at quarter-eth-slot cadence. A-525: Remove the legacy VALID_EPOCH_PRUNED slashing path: EpochPruneWatcher, slashPrunePenalty, and OffenseType.VALID_EPOCH_PRUNED. Protocol fixes that close the data-withholding loophole: - validateCheckpointProposal: refuse to attest if the block's enclosing checkpoint is already on L1 — we cannot legitimately attest based on L1-synced data we did not witness on p2p. - shouldAttestToSlot: refuse a proposal whose proposer's wallclock window has elapsed (`currentSlot > slot - pipeliningOffset`). - block-sync timeout in validateCheckpointProposal: use `getReexecutionDeadline` so the deadline is the end of the proposal's own slot when pipelining is off, and the start of the target slot when it is on. P2P: - ITxProvider.hasTxs(txHashes) added for the watcher's mempool probe. - Optional p2pMissingTxCollectionDeadlineSlots clamped up to the tolerance window, so collection never gives up before the slash verdict. E2E: - Rewrite data_withholding_slash to model the realistic attack: stub proposer outbound tx gossip + two blind attesters + one honest member who naturally refuses via the new gates. Retry epoch advance until a full committee is found. - Patch duplicate_attestation_slash: stub `notifyOwnCheckpointProposal` on the malicious nodes to break the self-loopback into local validation (with skipPushProposedBlocksToArchiver the proposer's block never lands in its archiver, so the loopback hung the proposer past the staleness gate and suppressed all duplicate attestations).
cdf0290 to
7c1cfff
Compare
7f1b01c to
425d9bf
Compare
…-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2
425d9bf to
cd34200
Compare
| } | ||
|
|
||
| public start(): Promise<void> { | ||
| this.initialSlot = this.epochCache.getSlotNow(); |
There was a problem hiding this comment.
Should be l2BlockSource.getSyncedL2SlotNumber() here as well, sorry I missed it on the first pass
| * driven externally via `removeBefore` (typically by the proposal handler, once a | ||
| * checkpoint reaches L1 finality). | ||
| */ | ||
| export class InMemoryCheckpointReexecutionTracker implements CheckpointReexecutionTracker { |
There was a problem hiding this comment.
Personally I've started removing intermediate interfaces when we have a single implementation, since it makes code navigation more annoying. We should agree on whether it's a good policy or not.
| // We successfully re-executed every block in this checkpoint locally, record for any observers | ||
| this.reexecutionTracker.recordReexecuted(checkpointNumber, proposal.archive); |
There was a problem hiding this comment.
Should we add a recordTxsCollected before we reexecute, in case we do manage to collect the txs but fail to reexecute within the deadline?
Mirror the same archiver-synced-slot logic used in work(): use l2BlockSource.getSyncedL2SlotNumber() rather than the wallclock to floor processing on start. Avoids treating slots as already-started before the archiver has ingested them. Per PR review.
…kpoint_proposal_slash test slashPrunePenalty was removed from SetupOptions as part of A-523/A-525. This e2e test still referenced it, breaking the TS build.
Summary
Moves the data-withholding slash from the L1-prune path to a per-slot check at
slotStart(checkpoint.slot + slashDataWithholdingToleranceSlots), and removes the now-unnecessaryVALID_EPOCH_PRUNEDoffense andEpochPruneWatcher.Per AZIP-7: validators are responsible for making tx data available, not for ensuring proofs land. The new
DataWithholdingWatcherticks at quarter-eth-slot cadence and, for each published checkpoint older thandataWithholdingToleranceSlots(default 3), probes the local mempool for the txs in the checkpoint's blocks. Missing txs trigger aDATA_WITHHOLDINGslash for the validators who actually attested to that checkpoint.Highlights
DataWithholdingWatcher(yarn-project/slasher/src/watchers/data_withholding_watcher.ts) with full unit-test coverage. Sentinel-style tick + restart floor (no KV).DATA_WITHHOLDING— moved from'epoch'to'slot'ingetTimeUnitForOffense. Offense identity is now per-checkpoint, not per-epoch.P2PClient.collectingMissingTxsanchors its tx-collection deadline toslotStart(block.slot + slashDataWithholdingToleranceSlots)so the collection effort runs to exactly the wall-clock instant the watcher renders its verdict. The ad-hocp2pMissingTxCollectionDeadlineMsis removed.OffenseType.VALID_EPOCH_PRUNED,slashPrunePenaltyconfig + env var + spartan plumbing,EpochPruneWatcherclass + tests,valid_epoch_pruned_slash.test.ts.data_withholding_slash.test.ts: 4 validators, slashSelfAllowed, tx is mined normally then stubbed missing on every node, watcher fires, slash executes on-chain, committee is kicked. Asserts slot-keyed offense identity + on-chain effect.Test plan
yarn workspace @aztec/slasher test— 76 tests pass (incl. 9 newDataWithholdingWatchertests).yarn workspace @aztec/stdlib test src/slashing— 55 tests pass after the keying flip.yarn workspace @aztec/p2p test src/client/p2p_client.test.ts— 20 tests pass with the new slot-anchored deadline.yarn buildclean across the monorepo.e2e_p2p_data_withholding_slash) is in place but only runs in CI.Out of scope
L2PruneUnprovenevent/emitter is left in place; nothing subscribes to it after this PR but the event itself stays available for future observers.Closes A-523.
Closes A-525.