eth/protocols/wit, consensus/bor: WIT2 — BP-signed witness announcements with transitive relay and pre-import serving#2208
eth/protocols/wit, consensus/bor: WIT2 — BP-signed witness announcements with transitive relay and pre-import serving#2208
Conversation
…ncements with transitive relay and pre-import serving Adds WIT2 (protocol version 3): block producers sign a chunked-parallel commitment over each witness, peers verify the signature and relay the announcement at network-RTT speed without execution, and any peer holding the body can serve it pre-import from an in-memory cache. Byte-correctness is verified by requesters against the BP-signed WitnessHash, attaching tampering blame to the server; content-correctness (state-root) failures attach to the BP. Removes the per-hop ~500 ms execution gate that today serialises witness propagation through stateless validators. Witness commitment uses 1 MiB chunked-parallel keccak (keccak256 of the concatenation of per-chunk hashes), measured at ~13.5 ms wall-clock for 50 MiB witnesses on 8 cores vs ~88 ms single-shot. Wire format and signature shape are unchanged from a single-keccak commitment; only the function mapping bytes to the 32-byte commitment changes. Producer-side signing reuses the engine SignerFn via consensus/bor.SignBytes with a dedicated mimetype (application/x-bor-wit2-announce) and a domain-separated digest tag, replay-resistant at both the digest and signer-call levels. Receivers verify ecrecover against the scheduled producer for the announced block; announces for blocks whose header is not yet locally available are deferred (no strike) so the block-cosend race does not punish honest relayers. Pre-import serving cache (capacity 10) is fed from the paged-fetch path the moment byte-correctness check passes, before chain write. Cache entries are gated on a BP-signed WitnessHash being on file — relayers never cache unverified bytes, and WIT1 fallback paths skip the cache entirely. handleGetWitness consults the cache before chain storage. Wire: new protocol version WIT2 = 3, new message SignedNewWitnessHashesMsg = 0x06 with up to 64 announcements per packet. WitnessMetadataResponse extended with WitnessHash. WIT1 peers continue using NewWitnessHashes; mixed mesh tolerated. Rate-limits: 200 ms per-(blockHash, peer) relay rate-limit, 30 s announce TTL, per-peer token bucket (burst 256, refill 64/s), strike disconnect at 5 invalid signed announces per minute. Conflicting WitnessHash for the same BlockHash is rejected via signedWitnessCache.putIfNewer. Operator note: validators running Clef as their signer must whitelist the mimetype application/x-bor-wit2-announce; without it the producer falls back to unsigned WIT1 announces.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
test |
Code Review3 issues found. Checked for bugs and CLAUDE.md compliance. 1. Performance: redundant witness encodingFile:
On a 50 MiB witness this adds ~100–300 ms of redundant CPU work per verified fetch — meaningful given WIT2's goal of eliminating per-hop latency. Suggested fix: Have 2. Performance: unconditional encode+hash before signed-announcement checkFile:
Every witness broadcast — including from WIT1 peers — pays the full encode+hash cost (~150–450 ms on 50 MiB witnesses) even when the result is never used. Suggested fix: Check 3. Bug: peer dropped on local EncodeRLP failureFile: When This is inconsistent with the pattern in Suggested fix: Change m.handleWitnessFetchFailureExt(hash, "", fmt.Errorf("witness encode failed: %w", err), false) |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (60.72%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2208 +/- ##
=========================================
Coverage 52.29% 52.30%
=========================================
Files 884 886 +2
Lines 155571 156296 +725
=========================================
+ Hits 81355 81745 +390
- Misses 68989 69291 +302
- Partials 5227 5260 +33
... and 21 files with indirect coverage changes
🚀 New features to boost your workflow:
|
- eth/handler_wit2.go: remove unused errInvalidSigner, contextBackground, wit2SpanLookupMissMeter, and now-unused context import - core/stateless/witness_commit_bench_test.go: drop redundant c := c loop-var copies (Go 1.22+ copyloopvar) - goimports formatting on accounts/accounts.go, witness_commit_bench_test.go, witness_commit_helpers_test.go, eth/fetcher/witness_manager.go, eth/fetcher/witness_manager_wit2_test.go, eth/handler_wit2.go, eth/protocols/wit/protocol.go
… drop - eth/fetcher/witness_manager.go: verifyAgainstSignedHash now returns the canonically-encoded body and signed hash on success, so the pre-import serving cache no longer re-encodes the same witness (~14 ms saved per verified fetch on 50 MiB witnesses). cacheVerifiedWitnessForServing takes the precomputed body directly. - eth/fetcher/witness_manager.go: local EncodeRLP failure inside verifyAgainstSignedHash no longer drops the peer — re-encoding bytes the peer already delivered as valid RLP is a local invariant violation, not peer misbehavior. Mirrors the pattern already used by the cache path. - eth/handler_wit.go: hoist signedWitnesses.get(hash) above the EncodeRLP + WitnessCommitHash work in handleBroadcastWitness. WIT1 broadcasts (no signed announcement on file) used to pay the full encode+hash cost only to discard the result; now they short-circuit. - eth/fetcher/witness_manager_wit2_test.go: rename + retarget the no-signed-hash regression test onto verifyAgainstSignedHash, where the invariant now lives.
Review responsesAll 3 review issues addressed in 12368a3:
The CI status
|
Three high-severity issues from the Codex adversarial review on PR #2208, each with a TDD regression test added first then fixed: 1. Header-race: signed announce arriving before header was silently downgraded. handleSignedWitnessAnnouncements called peer.AddKnownAnnounce unconditionally before the verification gate, leaving a peer marked announce-known even on bad-signature / header-unknown rejection paths. That suppressed our own re-relay back to that peer if a valid version of the same hash arrived from someone else, killing the natural recovery path. Fix: gate AddKnownAnnounce on acceptSignedAnnouncement success so the announce-known bit only reflects verified delivery. Test: TestHandleSignedWitnessAnnouncementsBadSigDoesNotMarkAnnounceKnown. 2. pendingWitnessBodies TTL didn't actually evict. get() observed expiry and returned false but left the entry in the map; gcLocked only ran from put(), so a node that stopped receiving witnesses retained up to capacity (10) ~50MB blobs indefinitely. Fix: when get() observes an expired entry, upgrade to write lock and delete it (re-checking under the write lock to avoid clobbering a concurrent put). Test: TestPendingWitnessBodyCacheGetEvictsExpired. 3. Honest body-server dropped on bad producer commitment. verifyAgainstSignedHash dropped the byte-server on every signed-hash mismatch, but the announcement only proves *some* BP signed *some* hash — not that the hash matches the canonical witness. A faulty or malicious scheduled producer that signed a bogus hash would weaponise this to disconnect every honest peer serving the real witness. Fix: reject the bytes (don't cache for serving) and back off the request without dropping the byte-server. TODO comment left for follow-up signer-quarantine work, which needs (signer, relayer) provenance the manager doesn't currently have. Test: TestProcessWitnessResponseDoesNotDropOnByteMismatch (replaces the previous TestProcessWitnessResponseDropsOnHashMismatch, whose policy this commit reverses).
Adversarial-review findings — addressed via TDDCodex flagged 3 high-severity issues in an adversarial pass. For each I wrote a failing regression test first, then implemented the smallest correct fix, then confirmed all 919 tests across
What's still out of scope (filed as TODOs at the call site, not in this PR)
|
Fourth and final adversarial-review item. Block + signed-announce gossip streams travel independently and can reach a node in either order. When the announce arrives first, isScheduledProducer returns (ok=false, headerAvailable=false) and the previous code dropped the announcement on the floor — relying on mesh re-gossip to reconstruct the signed-hash for that block. In sparse meshes (single-cosend window, small fanout) re-gossip never fires and subsequent witness fetches silently fall back to the unsigned WIT1 path, leaking the WIT2 byte-verification guarantee for that block. This commit holds the announcement instead of dropping it: - New deferredAnnounceCache mirrors pendingWitnessBodyCache: capacity 256, TTL = wit2AnnounceTTL (30s), oldest-evict, in-place expiry on take(). - acceptSignedAnnouncement's deferral branch now puts the announcement into deferredAnnounces. - New drainDeferredAnnouncesFor(blockHash) re-runs verification for the matching announcement, caches it on success, credits the original sender as announce-known, and relays. On still-header-unknown (rare: the chain-head fired but the indexed header isn't reachable yet by hash) the entry is re-stashed to ride the next chain-head event. - handler.Start subscribes to ChainHeadEvent and runs deferredAnnouncesLoop, which calls drainDeferredAnnouncesFor on each imported block. handler.Stop unsubscribes via quitSync. isScheduledProducer was reordered to check header presence first regardless of consensus engine. The previous early-return for non-bor test chains skipped the header check entirely, which was incorrect on its own (an announce we can't tie to a local block is unverifiable here) and prevented unit tests from exercising the deferral path. Bor producer recovery still runs only when a bor engine is present. Test: TestDeferredSignedAnnounceDrainedAfterHeaderArrives covers the full lifecycle — announce arrives header-unknown (deferred, not cached, sender not credited), header lands, drain runs, announcement is now cached and the deferred entry is consumed.
Adversarial review — all 4 items now closedPushed b3cb00e: deferred-announce queue closes the cosend race.
A small in-memory cache (capacity 256, TTL 30s) holds signed announcements whose producer-binding could not be checked at receive time because the matching block header wasn't local yet. handler.Start subscribes to ChainHeadEvent; on each new block, drainDeferredAnnouncesFor(blockHash) re-runs verification, caches the announcement on success, credits the original sender as announce-known, and relays. Mirrors the existing pendingWitnessBodyCache lifecycle. isScheduledProducer was also tightened: header presence is now checked first regardless of consensus engine. The previous non-bor early-return skipped the header check entirely, which was incorrect on its own — an announce we can't tie to a local block is unverifiable here. 920/920 tests pass across eth, eth/fetcher, eth/protocols/wit, consensus/bor, core/stateless. go build ./... clean. Still deferred (not adversarial-review-flagged): signer-quarantine ban-list (needs signer/relayer provenance plumbed to the witness manager — TODO at the call site); byte-budgeted cap on pendingWitnessBodies (count cap of 10 with working TTL is bounded; future-proofing item). |
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |


Summary
Adds WIT2 (witness protocol version 3): block producers sign a commitment over each witness, peers verify the signature and relay the announce at network-RTT speed without executing the block, and any peer that has fetched the body can serve it pre-import from an in-memory cache. The slow part of witness propagation — re-execution before relay — is removed from the critical path. Mixed mesh with WIT1 nodes is tolerated; no flag-day rollout required.
Devnet result (4 scenarios, post-fork-only window, hop-chain topology with +300 ms per-hop import knob):
What we're solving
Today on Polygon mainnet, witness propagation through a stateless validator that is multiple hops away from a block producer accumulates a per-hop ~500 ms execution gate: each intermediate node must finish executing the block before it will relay the witness downstream. This serialises along the path and shows up at the receiver as milestone-voting latency — slow milestone votes on a fraction of blocks at multi-hop stateless validators. Adding more peers does not help; the chain of dependencies is fan-in × execution time.
The deliverable is to detach announce from execute so witness availability propagates at gossip speed, while keeping the same byte-correctness guarantee (hash check at the requester, with on-chain blame) and the same content-correctness guarantee (state-root, with BP blame).
How the code achieves it
1. BP-signed witness commitment
The producer needs to commit to which witness bytes are correct without paying ~88 ms of single-thread keccak on the announce path (otherwise we re-introduce the same gate we're trying to remove, just on a different node). See Signing-scheme evaluation below — short version: chunked-parallel keccak at 1 MiB chunks beats the next-best viable candidate by a clear margin and keeps the WIT1 wire format intact.
core/stateless/witness_commit.go::WitnessCommitHash(bytes)=keccak256(concat(per-1MiB-chunk-keccak)). Each 1 MiB chunk is hashed in parallel; final aggregate is one extra keccak over <1 KiB of chunk hashes. ~13.5 ms wall-clock for 50 MiB witnesses on 8 cores vs ~88 ms single-shot keccak — 6.5× speedup, no wire-format change. Producer and verifier agree on the chunk size as a protocol constant.consensus/bor.SignBytesreusing the engine'sSignerFn, with a dedicated mimetypeapplication/x-bor-wit2-announceand a domain-separated digest tag — replay-resistant at both the digest and signer-call levels.2. Verify-and-relay without execution
WIT2 = 3(eth/protocols/wit/protocol.go), new messageSignedNewWitnessHashesMsg = 0x06carrying up to 64 announcements per packet.eth/handler_wit2.go::handleSignedWitnessAnnouncementsdoes ecrecover against the scheduled producer for the announced block; on success the announce is cached and immediately relayed to peers that have not seen this hash. No state execution is touched.3. Pre-import serving cache
pendingWitnessBodies(capacity 10) in the WIT2 handler is fed from the paged-fetch path the moment byte-correctness verification against the BP-signedWitnessHashpasses — i.e. before chain write.handleGetWitnessconsults this cache before chain storage, so a peer that just received the body can serve it to a downstream stateless node before it has finished executing.4. Blame model preserved
WitnessHash; failure attaches to the server that returned the bytes.WitnessHashfor the sameBlockHashis rejected viasignedWitnessCache.putIfNewer, so a peer cannot equivocate witnesses across announcements.5. Rate-limits & DoS shape
6. Compatibility
NewWitnessHashes. Mixed WIT1/WIT2 mesh is tolerated: WIT2 nodes downgrade to WIT1 wire when peering with WIT1 peers (relay handler skips peers withVersion() < wit.WIT2).WitnessHashfield onWitnessMetadataResponseis set by WIT2 servers and ignored by WIT1 readers — wire forward-compatible.Signing-scheme evaluation
Picking the right commitment function for the announce signature is load-bearing for the whole PR: too slow on the producer and we just move the per-hop gate from "execute the block" to "hash the witness"; too weak and we lose the byte-blame property that lets a downstream node disconnect a peer that returned tampered bytes. Four candidates were evaluated end-to-end on synthetic 1–50 MiB witnesses (Apple M4 Pro, Go 1.26.2,
go test -benchtime=3s -count=3, median of three).Candidates
keccak256(canonical_RLP(witness))single-threadkeccak(chunk0_hash ‖ … ‖ chunkN_hash), chunks hashed concurrentlyheader.StateRootto detect bad bytesResult at 50 MiB — verifier wall-clock (best parallel config)
D(intrinsic, 4 cores)44 ms2.0×0 msWhy D was rejected post-bench
D had the most attractive numbers (zero producer cost, 2× verifier speedup, no signature on the announce path) — but a peer can serve a truncated witness whose included nodes all hash consistently up to the BP-signed
header.StateRoot. Branch nodes embed child references as 32-byte hashes inside their own bytes, so dropping a subtree leaves the parent branch nodes' hashes unchanged. The intrinsic walker has no way to distinguish "this hash-reference belongs to a path that was never touched and is intentionally absent" from "this hash-reference belongs to a path that was touched and was adversarially omitted" — only attempting execution would. That destroys pre-execute byte-blame, which is the whole reason WIT2 introduced a content commitment in the first place. A/B/C all preserve byte-blame because they sign over content; truncation changes the commitment, signature mismatch, peer dropped pre-execute.Why B at 1 MiB chunks won
A chunk-size sweep at 50 MiB / 8 cores:
512 KiB shaves a tenth of a ms over 1 MiB at the cost of doubling the chunk count and the per-chunk overhead — 1 MiB is the knee of the curve. Below 512 KiB, per-chunk setup starts dominating. The 4 GB/s ceiling is the M4 Pro's aggregate keccak throughput across 8 P-cores; further parallelism doesn't help with the current keccak primitive.
Verifier-side scaling — B beats A non-trivially only ≥ 30 MiB
For the small witnesses Polygon emits today (typically 1–10 MiB) B is comparable to A; for the large witnesses we already see at the upper tail (30–50 MiB) B is the difference between the producer/verifier paying a ~90 ms gate vs ~14 ms. The fix is most impactful exactly where the problem is worst.
Why not C
C is dominated by every other viable candidate on these numbers: slower verifier than A (122 ms vs 88 ms), 91 MiB / 614 k allocations per verify at 50 MiB, no wire saving. C only becomes interesting if a future design needs sub-witness proofs (proving a specific node belongs to the committed set without sending the full body) — that's not on the roadmap, so C is a no-vote here.
Sensitivity caveats
Full bench artifact (raw numbers, reproduction commands, allocation breakdown):
agent-zero/investigations/witness-propagation/witness-commit-bench.md.Local devnet validation
A 9-node hop-chain devnet on
kurtosis-pos: 4 BPs full-mesh, two relay full-nodes (F1/F2) carrying a +300 ms per-hop import-delay knob to amplify the gate without heavy tx loads, and three stateless validators at hop distances 1 / 2 / 3 from the closest BP (S1 ↔ BP1, S2 ↔ F1, S3 ↔ F2). Topology was enforced post-launch viaadmin_removePeerafter every node imported past Giugliano (block 128 + 72-block settle), so the measurement window is post-fork and post-prune only — pre-fork blocks (different code path) are excluded.Four scenarios, ~30 measured blocks each:
bor:develop(control)bor:wit2bor:wit2, rest =bor:developbor:wit2, rest =bor:developF2import-lag (the relay just before S3) shows the mechanism: median drops 805 → 305 ms in scenario 2 — one full per-hop inject overlapped with WIT2 announcement-driven pre-fetch, exactly what the design predicts.S3's residual p95 of 260 ms in scenario 2 is the single +300 ms inject on F2 still in the critical path: WIT2 lets the F1 hop overlap, but F2 still has to receive and execute the block before serving S3. Without the artificial knob (i.e., on mainnet), the natural per-hop gate is ~50–100 ms and this residual shrinks proportionally.
Full report (per-scenario logs, lag tables, errors/warnings, peer-count snapshots, prune timestamps, image map):
agent-zero/investigations/witness-propagation/devnet-validation-2026-04-30b.md.Backward compatibility — explicit checks
pendingWitnessBodiesskipped when no signed WitnessHash on fileeth/handler_wit2.go::resolveWitnessBytesWitnessHashfield onWitnessMetadataResponseignored by WIT1 readersconsensus/bor.SignBytesconsensus/bor/signbytes_test.goTest plan
core/stateless/witness_commit_test.go,witness_commit_bench_test.go,consensus/bor/signbytes_test.go,eth/handler_wit2_test.go,eth/handler_wit_test.go,eth/peerset_test.go,eth/protocols/wit/protocol_wit2_test.go,eth/fetcher/witness_manager_wit2_test.go.pendingWitnessBodies. We don't expect more than a few in-flight unique witnesses at a time, but worth a second opinion under burst conditions.application/x-bor-wit2-announce.Diffguard / quality-gate notes
eth/handler_wit2.go(errInvalidSigner,contextBackground,wit2SpanLookupMissMeter) — left over from earlier iterations; worth removing before merge.eth/handler_wit2.gois 504 lines (4 over the 500 threshold). Up to reviewers whether to split.WitnessCommitHashcognitive complexity is 18 vs the 10 threshold — driven by the parallel-keccak fan-out with bounded goroutines; not naturally simplifiable below ~12 without losing the parallelism. Open to suggestions.