[pull] main from erigontech:main#5
Merged
pull[bot] merged 538 commits intoDustin4444:mainfrom May 4, 2026
Merged
Conversation
Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`) to pick up erigontech/fastkeccak#8
Using &result (double indirection on interface{} already holding a
pointer) prevented correct unmarshalling of JSON null responses. Fix
mirrors go-ethereum PR #26723.
For erigon fixes:
https://github.com/orgs/erigontech/projects/21?pane=issue&itemId=94164858&issue=erigontech%7Csecurity%7C7
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
## Summary The `--chain=dev` mode for Erigon has been updated so that it creates a dev beacon chain rather than relying on Clique consensus. In dev mode Erigon now runs the same PoS process as mainnet including the same forking and block production code. A single command starts a fully operational PoS node: ./build/bin/erigon --chain=dev --beacon.api=beacon,validator,node,config ### What's included - **Programmatic beacon genesis** (`cl/clparams/devgenesis/`): builds a valid Deneb genesis state with deterministic BLS validators, sync committees, participation lists, and execution payload header — no external tooling needed. - **Embedded dev validator** (`cl/validator/devvalidator/`): polls the Beacon API for proposer duties, fetches block templates, signs with proper BLS domain separation (RANDAO, block, attestation), and submits via the standard `/eth/v2/beacon/blocks` endpoint. - **Same code paths as mainnet**: blocks flow through Caplin's fork choice, `NewPayload`, `ForkChoiceUpdated`, and the staged sync pipeline. No mock consensus or special-cased execution. ### New flags | Flag | Default | Description | |------|---------|-------------| | `--dev-validator-seed` | `devnet` | BLS key derivation seed | | `--dev-validator-count` | `64` | Number of genesis validators | | `--dev.slot-time` | `6` | Seconds per slot (minimum: 2) | `--beacon.api=beacon,validator,node,config` is required to enable the Beacon API endpoints used by the embedded validator. ### Removed - `--mine` and `--dev.period` are no longer used for `--chain=dev` - Clique genesis configuration removed from dev mode ### Dependencies Built on top of #20190 (caplin minimal preset support), which is now merged. ## Test plan - [x] Blocks produced every slot (tested at 6s and 2s slot times) - [x] Epoch boundaries crossed cleanly (slots 1-16+, minimal preset 8 slots/epoch) - [x] EL validates and commits each block (head advances) - [x] No panics or errors over sustained runs - [x] `make lint` clean - [x] `make erigon integration` builds - [ ] CI passes --------- Co-authored-by: kewei-bot <kewei.train@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Close btree cursor when `Next()` advances past the prefix boundary in the FILE_CURSOR merge loop - Drain heap on function return, closing any remaining btree cursors
- no `ToUpper()` call on each format - no `logEnvBool` call on each format
an alternative for: #20148 Move mutex from log handler to file writer. Seems we used mutex in handler - to reduce allocs (by using shared buf in the past) - but now we don't use any shared buffer there. `TestStreamHandlerNoContention` 5K goroutines: `5s -> 7ms`
## Summary - add an early `new-step-size == 0` guard in `stepRebase` - prevent division-by-zero panic paths in modulo/division checks - return a clear validation error before any filesystem operations begin
**Since the 'sync from scratch' test creates an archive node, we can run the RPC test suite afterwards.** This approach has two benefits: 1. Firstly, we will not be reliant on a pre-built database, thus avoiding the fragilities inherent in this scenario. 2. Secondly, we will be testing a database that is entirely the product of the code under test. Please note that when a pre-built database is used for RPC testing, only the code that responds to requests is tested, not the code that builds the database.
…20481) - Replace `ERIGON_REBUILD_CONCURRENT_COMMITMENT` env var with the existing `statecfg.ExperimentalConcurrentCommitment` global in `RebuildCommitmentFiles()` - Both rebuild paths (with-history and no-history) now use the `--experimental.concurrent-commitment` CLI flag as the single source of truth for trie variant selection - Update test to set/restore the global directly instead of using `t.Setenv`
…st (#20444) Fixes #20355 - Skip DB entries whose step falls within the file range (`step.ToTxNum < files.EndTxNum()`) in both the initial DB cursor push and the advance loop — they are never loaded into the heap - Matches the "don't query DB for data in files" pattern from #20360, #20361, #20362 - Add `TestDomain_IteratePrefix_PrefersFilesOverDB` that simulates partial lexicographic prune and verifies file values win Follow-up: #20474 applies the same fix to `DomainLatestIterFile.initCursorMDBX()` and `advanceInFiles()` ## Test plan - [x] New test passes with fix, fails without it - [x] `make lint` clean - [x] `make erigon integration` builds
`allocs 8 -> 6`
…ure (#20486) ### Problem When Erigon is running at chain tip, `MergeLoop` executes merge steps back-to-back with no pause between iterations. Each merge step involves heavy disk I/O (reading, compressing, and writing state files). Running these steps consecutively saturates the disk, starving block execution of I/O bandwidth. The result is periodic block processing stalls: the node's reported block number freezes for minutes at a time while background merges consume all available I/O, then bursts forward when a merge step completes. During these stalls the node falls behind the chain tip and is marked unhealthy by load balancers. ### Observed behavior On a production fleet running Erigon v3.3.x on AWS Graviton instances (64GB RAM, EBS gp3 volumes), we observed the following pattern during MergeLoop activity on individual nodes: - Block execution throughput drops from ~20 Mgas/s to 1-5 Mgas/s - Node block number freezes for 8-16 minutes per merge step - Page cache eviction of 16GB+ as merge I/O displaces cached state data - Lag accumulates at ~5 blocks/minute during each stall - Worst observed: 164 blocks behind over a 188-minute period of continuous merge activity The node always recovers eventually, but the stalls cause the node to be removed from load balancer rotation, reducing fleet capacity. ### Solution Add a configurable delay between `MergeLoop` iterations via the `MERGE_THROTTLE_MS` environment variable (default 0, preserving current behavior). The delay is inserted after each successful `mergeLoopStep`, giving block execution a window to access the disk before the next merge step begins. ``` Before (current): mergeLoopStep() → heavy I/O mergeLoopStep() → immediately, more heavy I/O mergeLoopStep() → immediately, more heavy I/O After (with ERIGON_MERGE_THROTTLE_MS=2000): mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up ``` ### Production results We have been running this patch on a 3-node production fleet since December 2025. Results: - Individual node availability during merge-heavy periods improved from ~90% to >99% - Block execution stalls reduced from 8-16 minutes to under 5 minutes - Nodes maintain chain tip proximity during merge activity - No negative impact on merge completion time (merges still finish, just spread over a slightly longer window) - Fleet-wide availability (via load-balanced proxy) is near 99.99%, with the remaining downtime caused by synchronized stalls that this patch and `AGGREGATION_DELAY_MS` (PR #20391) address together Recommended values based on our testing: | Use case | Value | Effect | |----------|-------|--------| | Default (no throttle) | 0 | Current behavior, no change | | Light throttle | 500 | Slight breathing room between merges | | Production RPC nodes | 2000 | Good balance of merge progress and block execution | | Heavy RPC workload | 5000 | Prioritize block execution over merge speed | ### Notes - This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces I/O pressure *within* each merge step by limiting worker parallelism. This PR addresses I/O pressure *between* merge steps. - This is also complementary to `AGGREGATION_DELAY_MS` (PR #20391, merged) which staggers the *start time* of aggregation across fleet nodes. - No impact on single-node deployments or initial sync (default delay is 0). Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
## Summary
Fixes a panic seen in the wild:
```
panic: runtime error: index out of range [4] with length 1
cl/sentinel.(*Sentinel).findPeersForSubnets.func1
cl/sentinel/discovery.go:80
```
A malformed/short `attnets` ENR entry decodes into a
`bitfield.Bitvector64` shorter than the expected 8 bytes, so indexing
`peerSubnets[subnetIdx/8]` goes out of range when `subnetIdx >= 8`.
Adds a `len(peerSubnets) == 8` guard at all three call sites in
`cl/sentinel/discovery.go` (`findPeersForSubnets` filter, post-connect
coverage update, and `onConnection` underserved-subnet check).
Fixes #20238 ## Summary - Treat deletion entries (`len(v)==0`) as authoritative in `getLatestFromDb`, preventing fallthrough to `getLatestFromFiles` which returns stale pre-deletion data from frozen snapshots (root cause of `gas used mismatch` diffs of exactly 17100 = `SSTORE_SET - SSTORE_RESET`) - Fix `unwind()` to distinguish `nil` (different step, skip) from `[]byte{}` (key was deleted, write tombstone) — previously both were skipped via `len(value) > 0`, causing deletion markers to be lost after unwind - Add LargeValues cross-check in `getLatest` for CodeDomain/RCacheDomain where interrupted `PruneSmallBatches` can leave stale old-step tombstones ## Test plan - [x] Added `TestDomain_DeletedKeyNotResurrectedByFiles` — verifies `getLatestFromDb` returns deletion entries as authoritative regardless of step age - [x] Added `TestDomain_UnwindRestoresDeletionMarker` — verifies `unwind()` writes empty tombstones when reverting a re-write of a previously deleted key (covers both DupSort and LargeValues paths, with serialization round-trip through `ChangeSets3`) - [x] `make lint && make test-short` pass - [x] Verified on Sepolia full re-sync from snapshots: block 10624211 (previously failing with 17100 gas mismatch) passes cleanly, node tracks chain tip with no errors
```
1. Node struct is cache-unfriendly (biggest win potential)
type node struct {
val any // 16 bytes (interface = type ptr + data ptr)
n0 *node // 8 bytes
n1 *node // 8 bytes
p0 uint32 // 4 bytes
p1 uint32 // 4 bytes
} // = 40 bytes per node
Every mf2.top.n0, mf2.top.p0 etc. is a pointer chase to a heap-scattered node. With thousands of nodes, this hammers L1/L2 cache.
Fix: Flatten into a []node arena, replace *node with uint32 index:
type node struct {
p [2]uint32 // paths (indexed by bit)
n [2]uint32 // child indices (0 = nil)
val uint32 // index into code2pattern, 0 = no value
} // = 20 bytes, no pointers → GC-invisible
This gives contiguous memory, eliminates GC scanning, and halves node size.
```
Bench showing 20% faster on 64K dict size
Reason - during large merge (large file word-level compression).
MatchFinder.unfold is a bottleneck:
<img width="2172" height="697" alt="Screenshot 2026-03-24 at 22 41 24"
src="https://github.com/user-attachments/assets/5a35614d-61d3-4bdb-9c86-06020565302f"
/>
---------
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
…0495) ## Summary - `shouldGenerateChangeSets` no longer short-circuits on `initialCycle` — changesets are generated for blocks within `MaxReorgDepth` of the batch end regardless, so the node can always handle reorgs at the tip - Removes the `initialCycle` parameter from `shouldGenerateChangeSets` Cherry-picked from #20445. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#20432) This fixes a crash path when decoding an empty stateless witness. Some downstream witness logic assumes at least one header is present and may access `Headers[0]`, so accepting an empty witness can lead to a panic later. The change adds a small validation in `fromExtWitness` to reject witnesses with no headers during decode, and includes a regression test for the RLP decode path.
…tion (#20287) ## Summary Adds `TrieReader` — a lightweight, stateless read-only navigator for the Patricia trie. Each `Lookup` call starts from the root and descends independently without any mutable grid state. ## Motivation `HexPatriciaHashed` carries ~200KB of mutable grid state (`grid[64][16]cell`, `touchMap`, `afterMap`, etc.) designed for update processing. For read-only path queries (proofs, witnesses, integrity checks, tooling) this is unnecessary. ## Design ```go type TrieReader struct { ctx PatriciaContext accountKeyLen int16 keccak keccak.KeccakState hashBuf [length.Hash]byte } func NewTrieReader(ctx PatriciaContext, accountKeyLen int) *TrieReader func (tr *TrieReader) Lookup(hashedKey []byte) (cell, bool, error) ``` - **Stateless**: no grid, no activeRows. Each call allocates only a small nibble prefix buffer - **Direct descent**: calls `ctx.Branch(compact(prefix))` at each level, parses cells from branch data - **`deriveHashedKeys`**: reconstructs hashed nibble extensions from plain keys (accountAddr/storageAddr) via keccak — the same step `unfoldBranchNode` does. This is critical for navigating past leaf cells into storage sub-tries - **Storage support**: account cells with hash + more key to consume continue into the storage sub-trie at depth 64+ - **Nibble validation**: rejects invalid nibbles (> 0x0f) with a clear error ### Exported cell accessors ```go func (c *cell) AccountAddrLen() int func (c *cell) StorageAddrLen() int func (c *cell) HashLen() int func (c *cell) GetAccountAddr() []byte func (c *cell) GetStorageAddr() []byte func (c *cell) CellHash() []byte func (c *cell) Extension() []byte ``` ## Tests | Test | What it verifies | |---|---| | `TestTrieReader_RoundTripWithHPH` | 7 accounts + 3 storage keys through real `HexPatriciaHashed.Process` → TrieReader lookup | | `TestTrieReader_RoundTripWithHPH_ManyAccounts` | 100 accounts, all 100 found | | `TestTrieReader_IntegrationWithRealData` (db/state) | 200 accounts + 100 storage via full domain/aggregator/snapshot stack | | 9 mock unit tests | Account/storage hits, misses, extensions, multi-level descent, error propagation | ## Files - `execution/commitment/trie_reader.go` — implementation + cell accessors - `execution/commitment/trie_reader_test.go` — 12 tests (mock + HPH round-trip) - `db/state/trie_reader_integration_test.go` — full-stack integration test - `execution/commitment/commitmentdb/commitment_context.go` — `NewTrieContextRo` for read-only branch access ## Related - Part of commitment evolution: #13884
SimulationIntraBlockStateReader, used by eth_simulateV1, implements HasStorage by querying only RangeAsOf(firstMinTxNum), which reflects the canonical state at the start of the first simulated block. Storage written by prior simulated blocks lives exclusively in the RAM batch (sd.GetMemBatch()) and is invisible to RangeAsOf. When a subsequent simulated block calls HasStorage on a newly deployed contract, it incorrectly returns false. The concrete impact: EIP-7610 requires CREATE/CREATE2 to revert if the target address already has storage but no code/nonce (collision detection). With the bug, the collision is not detected and CREATE2 succeeds where it should fail. Fix: added HasPrefixInRAM to the TemporalMemBatch interface, scanning only the in-memory btree for StorageDomain without touching disk. HasStorage now checks the RAM batch first before falling back to RangeAsOf. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary of changes for compliance (ethereum/execution-apis PR 762) see ethereum/execution-apis#762) The changes fix the EVM tracer JSON output (structLog) to align with the Ethereum JSON-RPC specification. --- 1. Last memory word truncation Before: the loop used i+32 <= len(memData), silently dropping the last chunk when memory size was not an exact multiple of 32 bytes. After: the loop iterates over all bytes; the last partial chunk is zero-padded to 32 bytes, as required by the spec (each memory word is always 32 bytes). --- 2. Missing 0x prefix on storage and memory Before: storage keys/values and memory words were serialized as raw hex (e.g. "0000...0001"). After: all hex values are emitted with the 0x prefix (e.g. "0x0000...0001"), as required by the Ethereum JSON-RPC specification. --- 3. error field always present even without errors Before: the "error" field was serialized as "error": "" even when no error occurred. After: with omitempty the field is omitted entirely when there is no error, making the output cleaner and consistent with the behavior expected by clients and Hive tests. --- Impacted APIs * debug_traceTransaction * debug_traceBlockByNumber * debug_traceBlockByHash * debug_traceCall * debug_traceCallMany All three fixes are covered by unit tests in the new json_stream_test.go file. Out of this PR is keyword to anable/disable and default value for them ### Tracer Configuration Comparison | Field | Erigon | Geth | Spec #762 | Aligned | | :--- | :--- | :--- | :--- | :---: | | **Memory** | `disable=false` (ON) | `Enable=false` (OFF) | `enable=false` (OFF) | ❌ | | **Stack** | `disable=false` (ON) | `Disable=false` (ON) | `disable=false` (ON) | ✅ | | **Storage** | `disable=false` (ON) | `Disable=false` (ON) | `disable=false` (ON) | ✅ | | **Return** | `disable=false` (ON) | `Enable=false` (OFF) | `enable=false` (OFF) | ❌ | --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…he tests (#20498) ## Summary - Remove `TestStateCacheDeletedStorageSSTOREGas` — it duplicated `TestStateCache_PutEmpty_ThenGet_IsCacheHit` and incorrectly attributed the SSTORE gas mismatch (#20169) to the StateCache layer - Simplify the comment on `TestStateCache_PutEmpty_ThenGet_IsCacheHit` to frame it as a cache correctness property, not a regression test for #20169 The StateCache fix (PR #20265) is a valid cache correctness improvement, but the root cause of #20169 was in the domain/snapshot layer — a cache miss falling through to the membatch/MDBX path should return the correct value. The real fix was PR #20483. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This PR updates `eth_getLogs` so that when the requested block range is larger than the configured range limit, Erigon returns `-32602 `(`invalid params`) instead of `-32000` (`server error`). This makes the behavior consistent with the other parameter validation errors already used in the same `eth_getLogs` flow and aligns Erigon with other clients such as Reth. The change is implemented in `getLogsV3` by returning an RPC invalid-params error for the block range limit check, while keeping the existing error message format unchanged. The related test was also updated to verify the exact RPC behavior, including that the returned error code is `-32602` and that the message remains `query block range exceeds server limit, narrow your filter: 5.`
- Fix `SnapBlocksRead` to cap the iteration upper bound with `to` instead of the hardcoded `2` - Keep the existing function signature and all other integrity flows unchanged - Restore the intended bounded `[from, to)` behavior for future or direct bounded checks
Goal: reduce mutex-contention and make txs cheaper to create (increase throughput) In PR: - add `agg.visible` atomic field - all visible data recalculated and stored there - all existing atomics (like `visibleFilesMinimaxTxNum`) moved inside `agg.visible` object. So, they re-calcuclated together with other data (consistency). - `Aggregator.BeginFilesRo()` is now lock-free. - I guess salt can move there - but later - `domain.BeginFilesRo()` is now private method and accept `visible` object as a parameter (`domain._visible` field removed) --------- Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Cherry-pick from `release/3.4` to `main`: - #19677 agg: workers presets. ressplit workers - #19919 Revert "flush: use etl.IdentityLoadFunc instead custom. part2" - #19780 etl: zero-copy memDataProvider - #19941 d_lru: disable for commitment - #19942 TemporalMemBatch: re-use vals-slice when can - #19996 etl: pool of bufwriter - #19995 collate: replace bitmap by array - #20002 skill creator review results - #20033 seg: revert global limiter - #20046 Caplin: prevent calling `glob` per file in `BuildMissingIndices` - #20113 seg: more usage of bufio - #20194 execution/state: revert CodeSizePath in codeChange journal entry - #20431 remove `flush complete` log line - #20440 etl: munmap temp files in Dispose to prevent disk space leak 93 commits skipped due to conflicts (branches diverged significantly). --------- Co-authored-by: lystopad <oleksandr.lystopad@erigon.tech> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: moskud <sudeepdino008@gmail.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Replace the boolean --rpc.testing flag (engine-port-only, JWT-required) with the standard --http.api mechanism: adding "testing" to the list (e.g. --http.api eth,erigon,testing) registers testing_buildBlockV1 on the plain HTTP port (8545) without JWT, matching Geth's behaviour. The namespace is disabled by default to prevent accidental production exposure. In standalone rpcdaemon mode the entry is silently skipped with a Warn log, since block assembly requires direct execution service access unavailable over gRPC. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
) ## Summary - Fix BAL (Block Access List) recording bug where `coinbase == transaction target` caused storage/code/nonce reads to be dropped from the BAL - `finalizeTx` used `delete(result.TxIn, result.Coinbase)` which removed the entire address from the shared read-set map, destroying execution reads needed by `AsBlockAccessList` - Replace with `result.TxIn.Delete(addr, AccountKey{Path: BalancePath})` to surgically remove only the stale balance read, preserving all other reads ## Test plan - [x] Unskipped `stBugs/random_statetest_default_minus_tue_07_58_41_minus_15153_minus_575192` in `TestExecutionSpecBlockchainDevnet` - [x] Both amsterdam variants pass (base + london) - [x] `make lint` clean - [x] `make erigon integration` builds - [ ] Full `TestExecutionSpecBlockchainDevnet` suite on Linux CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
## Summary - Wire the existing `RevertWithDiffset` into `unwindExec3` so the state cache is surgically invalidated during unwinds — keys touched by the unwound blocks are evicted, untouched keys remain cached - Previously `RevertWithDiffset` was implemented but never called; the cache survived unwinds only because `ValidateAndPrepare` cleared it on the next block (hash mismatch fallback) - The `lastExecHash` (already fetched with a comment anticipating this wiring) detects if the cache was modified by a different execution path (e.g. rolled-back `ValidatePayload`), falling back to a full clear ## Context Follow-up to #20265 and #20483. Those PRs fixed correctness bugs in the cache layer and domain unwind; this PR completes the picture by actually invalidating the cache during unwinds rather than relying on the next-block safety net. ## Test plan - [x] `TestRevertWithDiffset_SurgicalEviction` — touched keys evicted, untouched preserved, blockHash updated - [x] `TestRevertWithDiffset_HashMismatch_ClearsAll` — mismatched revertFromHash triggers full clear - [x] `TestRevertWithDiffset_AccountChange_EvictsCode` — account diffs also evict corresponding code entries - [x] `TestRevertWithDiffset_ThenValidateAndPrepare_Continuity` — end-to-end: surgical revert then re-execution preserves surviving cache data - [x] `make lint` clean - [ ] Full sync test on a testnet with unwinds 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Story: as part of `bt.Get()` optimization. as part of blocks exec optimization. Now bt.Get does much allocs because calling `.Next(nil)`. But it can call `.MatchCmp()`. `MatchCmp` is zero-copy (even in compressed case) and allows partial-key-decompression. MatchCmp: `58ns 1 alloc -> 24ns 0 alloc` ``` ┌──────────────────────┬──────────┬───────────┐ │ Method │ hit (ns) │ miss (ns) │ ├──────────────────────┼──────────┼───────────┤ │ MatchCmp (streaming) │ 43.5 │ 24.1 │ ├──────────────────────┼──────────┼───────────┤ │ MatchPrefix │ 40.7 │ 7.9 │ ├──────────────────────┼──────────┼───────────┤ │ Next + buf reuse │ 39.4 │ — │ └──────────────────────┴──────────┴───────────┘ ┌─────────────────────────┬──────────┬───────────┐ │ Method │ hit (ns) │ miss (ns) │ ├─────────────────────────┼──────────┼───────────┤ │ MatchCmpUncompressed │ 8.1 │ 7.3 │ ├─────────────────────────┼──────────┼───────────┤ │ MatchPrefixUncompressed │ 8.3 │ 7.1 │ ├─────────────────────────┼──────────┼───────────┤ │ NextUncompressed │ 5.3 │ — │ └─────────────────────────┴──────────┴───────────┘ ``` --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
) ## Summary - Hive `engine-cancun / Invalid PayloadAttributes, Missing BeaconRoot, Syncing=True (Cancun)` (test 376 of [this run](https://hive.ethpandaops.io/#/test/generic/1777279481-85063528cea3c966b05ec4fef21a48ad?testnumber=376)) was failing because Erigon validated `parentBeaconBlockRoot` before the SYNCING short-circuit and returned `-38003` when the head was unknown. - Split `validatePayloadAttributes` into request-level (pre-FCU) and point-(8)-extension (post-FCU) halves so the spec-mandated ordering is preserved on both the SYNCING and VALID paths. - The fix that motivated #20728 (hive `engine-withdrawals / Empty Withdrawals (Paris)`) still passes — its check (V2 wrong withdrawals presence) is in the pre-FCU group. ## Why [`cancun.md`](https://github.com/ethereum/execution-apis/blob/main/src/engine/cancun.md#engine_forkchoiceupdatedv3) says fcuV3 _extends_ point (8) of `engine_forkchoiceUpdatedV1`: > 2. **Extend point (8)** of the `engine_forkchoiceUpdatedV1` specification by defining the following sequence of checks that **MUST** be run over `payloadAttributes`: > 1. `payloadAttributes` matches the `PayloadAttributesV3` structure, return `-38003: Invalid payload attributes` on failure. > 2. `payloadAttributes.timestamp` does not fall within the time frame of the Cancun fork, return `-38005: Unsupported fork` on failure. > 3. `payloadAttributes.timestamp` is greater than `timestamp` of a block referenced by `forkchoiceState.headBlockHash`, return `-38003: Invalid payload attributes` on failure. …and point (8) of V1 (paris.md) gates the entire processing flow on the head being `VALID`: > 8. Client software **MUST** process provided `payloadAttributes` **after successfully applying the `forkchoiceState`** and **only if the payload referenced by `forkchoiceState.headBlockHash` is `VALID`**. So with an unknown head, point (8) doesn't fire and the response is the SYNCING one in point (9). The `Syncing=True` family of hive cancun tests pin exactly this. The pre-FCU "wrong version of structure" rule is different — it lives in the V2 [Request section](https://github.com/ethereum/execution-apis/blob/main/src/engine/shanghai.md#engine_forkchoiceupdatedv2) and has no precondition on head state, which is why #20728 correctly moved it before the short-circuit. ## What changed - `validatePayloadAttributesPreFCU` keeps the request-level checks: `parentBeaconBlockRoot` on V1/V2, V1/V2 fcu at a Cancun timestamp, withdrawals presence vs Shanghai. Runs before the SYNCING short-circuit. - `validatePayloadAttributesPostFCU` runs the V3 point-(8) extensions: `parentBeaconBlockRoot != null` for V3+, Cancun-window for V3+, `SlotNumber != null` for V4+. Runs only when `status.Status == ValidStatus`. - Two regression tests in `testing_api_test.go`: - `TestForkchoiceUpdatedV3DefersAttributesValidationWhenSyncing` — pins the hive `Syncing=True` scenario (no error, status=SYNCING). - `TestForkchoiceUpdatedV3RejectsMissingBeaconRootWhenValid` — pins the spec-mandated `-38003` when the head is VALID so the deferral can't regress into a permanent swallow. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
… mainnet (#20815) This PR enable maninet historical using the following transport type: - http - http compression - websockets
Cherry-pick of #20852 to main.
## Summary - Add `cannot unwind past 0` checks before sender and tx lookup stage unwind arithmetic. - Guard `state_stages` one-shot and loop unwind paths before subtracting from unsigned block numbers. - Clamp the backward-loop stop boundary at genesis to avoid uint64 underflow while preserving the existing integration command flow.
… hash (#20825) closes #20888 implements ethereum/execution-apis#786 which is needed for glamsterdam-devnet-0 --------- Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
…20328) ## Problem Tests in `execution/engineapi` were failing intermittently in the `race-tests / tests-linux (ubuntu-latest, execution-other)` CI job: - `TestEngineApiHighGasContractsFillBlock` → `listen tcp 127.0.0.1:46053: bind: address already in use` - `TestEngineApiBlockGasOverflowSpillsToNextBlock` → `empty response from JSON-RPC server` Observed on commits a9db702 and acc9898 (2 out of 10 recent commits). **Root cause:** The test infrastructure used `freeport.NextFreePort()` to pick a free port, then later called `net.Listen()` with that port. This TOCTOU (time-of-check/time-of-use) race meant another concurrent test could bind the same port in between, causing `address already in use`. Separately, the server might not be ready when the first client request arrived, causing `empty response from JSON-RPC server`. ## Fix Pre-create `net.Listener` instances on port `:0` (kernel-assigned, guaranteed free and already bound) before any server starts, then hand the bound socket directly to `StartHTTPEndpoint`. The kernel guarantees no other process can steal the port once it's bound. **Changes:** - `node/endpoints.go`: add optional `Listener` field to `HttpEndpointConfig`; if set, skip the `net.Listen` call and use the pre-bound socket - `cmd/rpcdaemon/cli/httpcfg/http_cfg.go`: add `HttpListener` / `AuthRpcListener` fields to `HttpCfg` - `cmd/rpcdaemon/cli/config.go`: wire up pre-created listeners in both the regular RPC and engine API listener paths - `execution/engineapi/engineapitester/engine_api_tester.go`: replace `freeport.NextFreePort()` with `net.Listen("tcp", "127.0.0.1:0")` for the JSON-RPC and engine API ports ## Testing - `go build ./...` ✅ - No `*_test.go` files modified ✅ - 5 consecutive `go test -race ./execution/engineapi/...` runs all pass ✅ CI job reference: https://github.com/erigontech/erigon/actions/runs/23983504381/job/69951584262 --------- Co-authored-by: erigon-copilot[bot] <erigon-copilot[bot]@users.noreply.github.com> Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com> Co-authored-by: Alexey Sharov <AskAlexSharov@gmail.com> Co-authored-by: Matt Joiner <anacrolix@gmail.com> Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
## Summary Follow-up to #20788. Address review feedback from @anacrolix: > I thought there was a helper in the websocket library for this so you don't need to spin up a bunch of timers and goroutines? The `coder/websocket` `Conn.Close(code, reason)` method already performs the close handshake with bounded internal timeouts (5s write, 5s read), so the manual goroutine + timer plumbing in `wsConnAdapter.Close` is redundant. Replace it with a direct call. The existing `TestWebsocketServerGracefulClose` test still passes — clients see a clean `StatusNormalClosure` (1000) instead of `StatusAbnormalClosure` (1006). ## Test plan - [x] `go test ./rpc/ -run Websocket` — all pass - [x] `make lint` — clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
close #17112 This PR replaces the existing single-path compress/gzip middleware in node/rpcstack.go with a dual-path implementation that distinguishes between streaming and non-streaming responses. a) Non-streaming responses (standard JSON-RPC calls such as eth_getBlockByNumber): - are now compressed in one shot using go-libdeflate, a lightweight CGo wrapper around ebiggers/libdeflate. Benchmarks show ~1.75x speedup vs stdlib gzip on ~30 KB JSON payloads (4.6 ms vs 8.0 ms, 122 MB/s vs 70 MB/s); under high concurrency the advantage is larger due to lower CPU usage per request. - Since the compressed size is known before writing, Content-Length is now set to the exact compressed size, avoiding unnecessary Transfer-Encoding: chunked. - Responses smaller than 1 KB are sent uncompressed: at that size the CPU overhead of initialising the compressor outweighs the benefit, and for very small payloads the compressed output can exceed the input size due to gzip framing overhead. b) Streaming responses (e.g. debug_traceTransaction, trace_filter) are detected via http.Flusher: when the RPC handler calls Flush() before writing, the middleware switches to stdlib compress/gzip in streaming mode, compressing trace data incrementally without buffering the full response. Changes: - node/rpcstack.go: new gzipResponseWriter with buffer/stream dual mode; 4 sync.Pool instances for zero-allocation hot path - rpc/http.go: injects http.Flusher into request context via httpFlusherContextKey - rpc/handler.go: streaming methods call flush before writing to activate streaming mode - go.mod: adds github.com/erigontech/go-libdeflate v0.1.0 # 🚀 Performance Benchmarks: Gzip Optimization --- <details> <summary><b>1. Isolated Compression Benchmarks (libdeflate vs stdlib)</b></summary> Diff Benchmark (Gzip isolation) Latency Throughput Mem Alloc - BenchmarkStdlibGzip 8.0ms/req 70 MB/s 66KB + BenchmarkLibdeflateGzip 4.6ms/req 122 MB/s 63KB (2x faster) Note: We observed a ~1.75x speedup on a single thread. Under high concurrency, the advantage is even greater due to reduced CPU overhead. </details> <details> <summary><b>2. eth_getBlockByNumber with txs (Old SW vs Main SW)</b></summary> Old SW - Results with instability and errors: Diff - [2. 1] qps: 2000 -> [R=97.74% max=9.103s error=503 Service Unavailable] - [3. 3] qps: 2300 -> [R=98.50% max=5.841s error=503 Service Unavailable] - [3. 4] qps: 2300 -> [R=99.80% max=5.634s error=503 Service Unavailable] - [3. 5] qps: 2300 -> [R=98.25% max=5.877s error=503 Service Unavailable] Main SW - Stable results with 100% success rate: Diff + [2. 1] qps: 2000 -> [R=100.00% max=141.22ms] + [2. 5] qps: 2000 -> [R=100.00% max=157.7ms] + [3. 1] qps: 2300 -> [R=100.00% max=217.23ms] + [3. 5] qps: 2300 -> [R=100.00% max=224.174ms] </details> <details> <summary><b>3. trace_block </b></summary> | Metric | Old SW | Main SW | Speedup | | :--- | :--- | :--- | :--- | | **Throughput** | 375.9 r/s | **455.0 r/s** | **+21%** | | **p50 Latency** | 114.4 ms | **96.9 ms** | **1.18x** | | **p90 Latency** | 232.0 ms | **181.2 ms** | **1.28x** | | **Mean Latency** | 132.4 ms | **109.5 ms** | **1.21x** | </details> <details> <summary><b>4. Executes all RPC using http, http-compressed and websockets</b></summary> ./run_all.sh -T http,http_comp,websocket Run tests in parallel on localhost:8545/localhost:8551 Result directory: /home/simon/silkworm/tests/rpc-tests3/integration/results Time: 2026-04-19 08:58:05.257624 Total round_trip time: 3:40:01.359044 Total marshalling time: 0:00:00.063308 Total unmarshalling time: 0:01:40.509521 No of json Diffs: 0 Test time-elapsed: 0:08:37.467460 Available tests: 1436 Available tested api: 112 Number of loop: 1 Number of executed tests: 4152 Number of NOT executed tests: 156 Number of success tests: 4152 Number of failed tests: 0 </details> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- \`seg reset\` regressed in #18273: \`removeLocal\` from the CLI is wired into \`reset.Reset.RemoveUnknown\` but not \`RemoveLocal\`, which gates chaindata + Heimdall/PolygonBridge DB removal in \`db/datadir/reset/reset.go\`. Result: \`seg reset\` no longer touches chaindata at all. - Library-level \`TestResetChaindata\` exists but sets \`RemoveLocal: true\` directly, so the CLI plumbing was uncovered. closes #20260
…pruning (#20915) ## Summary \`--exec.no-prune\` previously only gated the state-aggregator's pruning paths via \`dbg.NoPrune()\` (Domain/InvertedIndex in \`db/state/aggregator.go\`, forkable in \`db/state/forkable_agg.go\`). Stage-level pruning ran regardless: - \`PruneExecutionStage\` was calling \`rawdb.PruneTable\` on \`kv.ChangeSets3\` and \`kv.BlockAccessList\`. - \`PruneTxLookup\`, \`PruneWitnessProcessingStage\` and \`SnapshotsPrune\` (\`PruneAncientBlocks\`, \`pruneCanonicalMarkers\`, \`RetireBlocksInBackground\`, \`pruneBlockSnapshots\`) had no guard. ## Fix Add early-return guards on \`dbg.NoPrune()\` at the top of each stage prune entrypoint so the flag genuinely blocks every path that removes rows from MDBX. Each function still calls \`s.Done(tx)\` so the staged-sync state machine doesn't re-enter the prune step on every cycle. Updated flag usage text to describe the wider semantic. Companion bal-devnet-3 PR: #20914 ## Test plan - [x] \`go test ./execution/stagedsync -run TestNoPrune\` — new \`no_prune_test.go\` seeds rows in \`kv.ChangeSets3\`, \`kv.BlockAccessList\`, \`kv.TxLookup\`, \`kv.BorWitnesses\`; runs all four prune entrypoints with \`NoPrune=true\`; asserts no rows deleted and \`PruneProgress\` is recorded. - [x] \`make lint\` — clean. - [x] \`make erigon\` — builds. Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
- skipping pre-byzantium receipts because of #20910 --------- Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
) ## Summary - The startup banner logs the full command line via `log.RedactArgs(os.Args)` in `cmd/erigon/main.go`. URLs, IP addresses, and `--datadir` are already redacted, but `--ethstats` was not — and its value has the form `nodename:secret@host:port`, so the ethstats username and password were being printed in the clear at every startup. - This adds an `--ethstats` redaction following the same `--datadir` pattern (matches `--ethstats=value`, `--ethstats value`, and the single-dash variants), with tests. ## Test plan - [x] `go test ./common/log/v3/...` (new `TestRedactArgsEthstats` plus existing tests pass) - [x] `make lint` (0 issues) Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
…20869) This update resolves the matcher cleanup loop in the discover reply handling code. When an expired matcher is removed from the list, calling `Remove(el)` invalidates its links, therefore invoking `el.Next()` subsequently may cause the loop to terminate prematurely. To avoid this, the next element is preserved before removal, allowing iteration to proceed appropriately across the entire list. This ensures that all expired matchers in the same tick are processed and that `contTimeouts` are not undercounted when multiple matchers expire simultaneously.
…uilder (#20886) Adds `iidq.reader.Reset(0)` at the top of the retry loop in `SimpleAccessorBuilder.Build`, matching `buildVI` (5c7fe1b) and `buildHashMapAccessor`. Without it, retry iterations start with an exhausted `PagedReader` and feed zero keys to recsplit, producing a corrupt accessor index for forkable snapshots (headers/bodies). --------- Co-authored-by: Alexey Sharov <AskAlexSharov@gmail.com> Co-authored-by: moskud <sudeepdino008@gmail.com>
…es (#20892) ## Summary Two fixes addressing all 29 failures in the Hive `legacy-cancun` [run](https://hive.ethpandaops.io/#/test/generic/1777223434-c411e5be836d6d6f8a466489c2874cf3) against `erigon_default`, plus a regression test that catches the EVM-side bug locally. ### (a) `cmd/utils/app` — TD-based fork choice for `erigon import` The Hive consensus simulator writes each block to its own `00NN.rlp` file and runs `erigon import /blocks/00NN.rlp` for each — the simulator does not implement TD/fork-choice itself, it relies on the client's native chain selection. `InsertChain` always called `UpdateForkChoice(importedTip)`, so every imported block became head even when its TD was lower than the current canonical head's. For `lotsOfLeafs` block 6 (TD `0x9CCF4`) is the canonical winner, but Erigon kept switching head through blocks 7-13 (TD `0x9CC34`) and ended on `0x26ad10c0…` instead of `0xf7f9ea97…`. This affects **23 tests** (all Berlin/Istanbul/London variants): `lotsOfLeafs`, `ChainAtoChainB_difficultyB`, `ForkStressTest`, `CallContractFromNotBestBlock`, `lotsOfBranchesOverrideAtTheMiddle`, `sideChainWithMoreTransactions`, `uncleBlockAtBlock3afterBlock4`, `blockChainFrontierWithLargerTDvsHomesteadBlockchain[2]`. **Fix:** before calling `UpdateForkChoice`, compare the imported chain's total difficulty against the current head's TD using a shared `headerdownload.ShouldReorg` helper (extracted from the existing `FeedHeaderPoW` tie-break: TD → height → lex hash). If the imported chain doesn't beat the current head, write the headers/bodies/TDs straight to the DB as a side chain (no execution, no head change) so future side-chain extensions can still validate. Once a side-chain extension surpasses the canonical TD, the regular `InsertBlocks` + `UpdateForkChoice` path triggers the reorg and executes those blocks. PoS imports (all `difficulty=0`, TD never grows) explicitly bypass this branch — `ShouldReorg`'s "shorter wins on tie" rule would otherwise prevent the head from advancing past genesis. ### (b) `execution/vm` — STATICCALL touch for EIP-161 state clearing `RevertPrecompiledTouch_d3g0v0_{Berlin,Istanbul,London}` and `RevertPrecompiledTouch_storage_d3g0v0_{Berlin,Istanbul,London}` (**6 tests**) all touch precompile 3 (RIPEMD-160) via STATICCALL inside a frame that runs out of gas. Expected: ripemd state-cleared (deleted) at end of tx via the `journal.dirty(ripemd)` consensus quirk. Erigon left it. The CALL path in `evm.call` calls `Exist(addr)` first, which loads ripemd into `stateObjects`; the subsequent `AddBalance(addr, 0)` then falls through to `TouchAccount` and the famous quirk fires, so `FinalizeTx` deletes ripemd. But the STATICCALL path skipped `Exist` and called `AddBalance(addr, 0)` directly — in serial mode that hits the `versionMap == nil && addr == ripemd && amount.IsZero()` shortcut that uses `balanceIncrease` instead of `GetOrNewStateObject`. ripemd never enters `stateObjects`, so on revert `FinalizeTx`'s `if !exist { continue }` branch skips it. **Fix:** replace `AddBalance(addr, u256.Num0, …)` with `TouchAccount(addr)` for STATICCALL — same end behavior as CALL's post-`Exist` flow (loads the account, hits the dirty quirk on touch). Identical for non-ripemd / non-serial paths. `d0` (CALL), `d1` (DELEGATECALL), `d2` (CALLCODE) variants were unaffected: CALL goes through `Exist`+`Transfer`; DELEGATECALL and CALLCODE don't touch the callee at all (matching geth and the post-state expectations). ### (c) `execution/tests` — local regression test for the d3 RIPEMD path Add `TestLegacyCancunState`, walking `legacy-tests/LegacyTests/Cancun/GeneralStateTests` in state-test format. `TestLegacyBlockchain` (`block_test.go`) explicitly skips the `LegacyTests/` subtree and `TestState` walks the EEST static_tests, neither of which covers the d3 ripemd-touch case. Verified the new test fails at baseline (Berlin/1, Istanbul/1, London/1 — the d3 indexes) and passes with the fix in (b). Refactored the now-three state-test runners (`TestStateCornerCases` / `TestState` / `TestLegacyCancunState`) to share a `runStateTests(t, st, dir)` helper plus a `stateTestSetup(t)` helper for parallel/log/Windows-skip boilerplate. Six fixtures fail on Constantinople-only post-state-root mismatches (sstoreGas / *_HighNonce* / Ecrecover_Overflow / ecrecoverShortBuff) — pre-existing Erigon-side divergences from geth, unrelated to this PR. Geth doesn't catch them locally because its runner walks `LegacyTests/Constantinople/GeneralStateTests` (older snapshot, doesn't include these). Tracked separately in #20894 and `SkipLoad`-ed for now with a comment pointing at the issue. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Adds `--only-history` (alias `--history`) to `erigon seg rm-state`. When set, restricts deletion to history files (`SnapHistory` + `SnapIdx`) only — leaving domain data (`.kv` files in `SnapDomain`, accessor files) untouched. Useful when combined with `--step from-to` to prune old history without touching domain snapshots.
…-Byzantium status (#20916) ### New implementations - `transaction(hash)`: implemented transaction lookup by hash (was panicking) - `block(hash)`: implemented block lookup by hash (was unimplemented) - `block.transactionAt(index)`: implemented with proper bounds checking (was panicking) ### Bug fixes - `transaction.type`: changed schema from `Int` to `Long` so EIP-1559/EIP-4844 transactions return `"0x2"` / `"0x3"` (hex) instead of `2` / `3` (decimal), matching the GraphQL spec - `transaction.status`: pre-Byzantium blocks now correctly return `"0x0"` instead of `null` (pre-Byzantium receipts carry `PostState` instead of `Status`) - `transaction.gas`, `transaction.maxFeePerGas`, `transaction.maxPriorityFeePerGas`, `transaction.accessList`: fields were missing from the receipt map and are now populated correctly ### Hive simulator fix - Added `HIVE_CANCUN_TIMESTAMP` to the client environment map in the GraphQL hive simulator. Without it, Erigon rejected Cancun blocks during chain import, truncating the canonical chain at block 29 and causing ~10 tests to fail (withdrawals, Shanghai/Cancun block queries, block range queries). ## Tests - Added unit tests for the new `Long` scalar (marshal/unmarshal) - Added unit tests for nil guards in the `convertData` helpers - Added unit tests for `transactionAt` bounds checking - Added unit tests for `transaction` invalid hash error handling Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
Without a deadline, a WebSocket write can block indefinitely when the peer stops reading (e.g. server sends `StatusMessageTooBig` and closes). The blocked write holds the write mutex, preventing the read path from acquiring it to send its own close-frame response — a permanent deadlock that manifests as a 59-minute hang in `TestWebsocketLargeCall` under `-race`. Fix: in `websocketCodec.WriteJSON`, when the caller provides no context deadline, wrap the context with a `wsPingInterval` (60 s) timeout before passing it to `jsonCodec.WriteJSON`. This ensures `SetWriteDeadline` is set to at most 60 s, matching the dead-connection detection window used by the ping loop. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…#20939) - `--resume` already supported with history commitment rebuild
## Summary - Fix a long-standing race in `Client.dispatch` (rpc/client.go) where a fatal `readErr` racing against `reqSent` could orphan the in-flight op's `resp` channel, leaving `op.wait` blocked forever on a channel nobody would ever close. - The readErr handler was passing `lastOp` to `cancelAllRequests`, which deliberately *preserves* the inflight op's resp — directly contradicting the comment immediately above: ``` // A read error is fatal for the connection, and all pending requests // must be cancelled, including any that might still be considered in-flight. ``` Pass `nil` instead so the inflight op is cancelled too. Clear `lastOp` so a concurrent reconnect doesn't re-register an op whose resp is already closed (a later `op.resp <- batch` would panic). Nil-guard `addRequestOp` in the reconnect handler accordingly. ## Why Surfaced as the long-running CI flake [#16875](#16875) — `TestWebsocketLargeCall` hanging for ~59 minutes until the test timeout fires. From a recent failing run's goroutine dump: - the test goroutine sat in `(*requestOp).wait` (rpc/client.go:144) for 58 minutes — i.e., past `c.send`, with the write already returned; - `Client.dispatch` was idle in `select` (rpc/client.go:597) for the same 58 minutes; - there were no goroutines inside `coder/websocket` Read or Write — both had returned. The only way to reach that state is: `readErr` fired, `dispatch` processed it before/instead of `reqSent`'s `lastOp = nil`, `cancelAllRequests` skipped the in-flight op (because `lastOp` was passed as `inflightReq`), and so `op.resp` was never closed. The select between `c.readErr` and `c.reqSent` (which is buffered, size 1) is a coin flip — that's the flake. ## Test plan - [x] `make lint` — clean - [x] `make erigon integration` — builds - [x] `go test -race ./rpc/...` — full suite passes - [x] `go test -race -count=10 -run '^TestWebsocketLargeCall$' ./rpc/` — 10/10 pass - [ ] CI race-tests / core-rpc — should be green 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `canonicalHashCache` LRU (10,000 entries, 32 bytes/entry, ~320 KB) to `BlockReader` to cache `blockNum → hash` lookups for snapshot-range blocks. **Why a separate cache instead of increasing `headerByNumCache`:** `headerByNumCache` stores full `*Header` objects (~500 bytes/entry, 1,000 slots, ~500 KB). To cover the same 10,000-block working set it would require ~5 MB. `canonicalHashCache` achieves the same coverage at 16× less memory by storing only the 32-byte hash. **Two regimes:** - **Working set ≤ 1,000 blocks** (`headerByNumCache` warm): `canonicalHashCache` short-circuits `ViewSingleFile` (refcount increments + closure allocation) which main always pays even when the `*Header` is cached → **~3.3×** (572 → 167 ns) - **Working set > 1,000 and ≤ 10,000 blocks** (`headerByNumCache` evicted, `canonicalHashCache` still warm): main falls back to full snapshot path → **~33×** (5,119 → 167 ns) Results (real mainnet snapshots, i9-14900HX): | Benchmark | ns/op | B/op | allocs/op | Notes | |-----------|------:|-----:|----------:|-------| | `BenchmarkCanonicalHash_RealSnapshot` | 167 | 8 | 1 | this PR — `canonicalHashCache` hit | | `BenchmarkCanonicalHash_RealSnapshot_MainEquivalent` | 572 | 176 | 5 | main equivalent — `headerByNumCache` warm | | `BenchmarkCanonicalHash_RealSnapshot_Cold` | 5,119 | 1,657 | 11 | working set > 1,000 blocks | > **Why MainEquivalent (572 ns) is higher than the theoretical estimate (~235 ns):** `ViewSingleFile` is called on every `CanonicalHash` invocation even when `headerByNumCache` is warm. It atomically increments the refcount of each non-frozen segment and allocates a release closure — 5 allocs, 176 B per call. The theoretical estimate only counted MDBX miss + LRU lookup + atomic hash load and did not account for this overhead. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…20746) The `v1-` prefix check and `SchemeMinSupportedVersions` version floor in `CheckSnapshotsCompatibility` were guards for the upgrade window from pre-3.1 snapshot naming. At 3.5 that window is long closed. - Removes `CheckSnapshotsCompatibility` and `SchemeMinSupportedVersions` - Removes `sanityOldNaming` field and `SanityOldNaming()` method from `AggOpts` - Removes the unconditional call in rpcdaemon startup - Deletes `version_schema.go` (its only job was populating `SchemeMinSupportedVersions`); `init()` now calls `InitSchemasGen()` directly
…#20944) This PR fixes some logic re: handling shortened keys in commitment files: - `ValuesPlainKeyReferencingThresholdReached` as a general rule, if file contains 1 step, then use plain keys, otherwise references. This works well currently because all merged files contains even number of steps (2, 4, 8, 16, 32, etc.). The problem is the logic uses %0 implying odd number of steps in range should contain plain keys as well. Because I'm modifying the step-rebase command to allow arbitrary number of steps inside a merged range, if you convert an existing node and it ends up containing an odd number of steps, it'll crash on next merge. - I noticed a very similar function `MayContainValuesPlainKeyReferencing`, which kind of does the formula correctly, but it misses files with 2 steps. That function is used only by integrity checks, it is currently missing 2-step files and it is IMO incorrect. I'm removing that almost-duplicate function and retrofitting all callers to use `ValuesPlainKeyReferencingThresholdReached`.
…dition (#20872) ## Summary Fix two related cancellation bugs in `.github/workflows/release.yml`: 1. **Downstream jobs ignored cancellation.** `build-debian-pkg`, `publish-docker-image`, and `publish-release` used `if: always() && ...`, which kept dispatching them onto runners even after the operator cancelled the workflow. The intermediate `contains(needs.X.result, 'failure')` checks did not match `'cancelled'`, and `always()` explicitly bypasses workflow-level cancellation. 2. **Rollback skipped cancellations.** `In-case-of-failure` enumerated specific failure modes (`contains(..., 'failure')` on each upstream) and required `build-release.result == 'success'`. Cancellations and skipped-due-to-upstream cases didn't satisfy any branch, so the git tag was left on the remote whenever the operator cancelled. The rollback step itself also assumed the tag was always present, so calling it on a workflow that never reached the tag-push step would emit a confusing "tag does not exist" failure. ## Changes per job ### build-release — no `if:` change Always runs. Tag-push step (line 119) is still gated by `inputs.perform_release && inputs.release_version != ''`, so a dry-run never creates a tag. ### test-release — no `if:` change Job-level `if:` remains `! inputs.skip_tests`. Skipped tests yield result `'skipped'`, which downstream jobs continue to treat as a pass — same as before. ### build-debian-pkg ```diff -if: always() && contains(needs.build-release.result, 'success') && !contains(needs.test-release.result, 'failure') +if: !cancelled() && contains(needs.build-release.result, 'success') && !contains(needs.test-release.result, 'failure') ``` - Full success: runs. - `test-release` failed: skipped. - `test-release` skipped (`skip_tests=true`): runs. - `build-release` failed/cancelled: skipped. - **Workflow cancelled: skipped** — was the bug; previously ran because `'cancelled'` does not contain `'failure'`. ### publish-docker-image Same change as `build-debian-pkg`, same behaviour matrix. Internal docker push step is additionally gated by `inputs.perform_release` for dry-run support. ### publish-release ```diff -if: always() && contains(... 'success') && contains(... 'success') && contains(... 'success') +if: !cancelled() && contains(... 'success') && contains(... 'success') && contains(... 'success') ``` - Full success: runs. - Any upstream failed/cancelled/skipped: skipped. - **Workflow cancelled in the window between all upstreams finishing as success and this job starting: skipped** — was the bug. - The internal `gh release create` step keeps its current `set +e` soft-failure behaviour **intentionally**: a half-published draft must be completable manually using the printed instructions and the already-built artefacts. The job is by design not allowed to fail. `publish-release.result` therefore stays `'success'` even when the GitHub release object itself was not created, and rollback does not fire — preserving the tag and artefacts for the operator's manual recovery. ### In-case-of-failure Condition simplified from a four-branch failure-mode enumeration to a single positive invariant: roll back whenever the release did not fully publish. ```diff if: >- always() && inputs.perform_release && - contains(needs.build-release.result, 'success') && - ( - contains(needs.test-release.result, 'failure') || - contains(needs.build-debian-pkg.result, 'failure') || - contains(needs.publish-docker-image.result, 'failure') || - contains(needs.publish-release.result, 'failure') - ) + inputs.release_version != '' && + needs.publish-release.result != 'success' ``` `!= 'success'` matches `'failure'`, `'cancelled'`, and `'skipped'` uniformly. The `build-release.result == 'success'` precondition is removed so rollback also runs when `build-release` was cancelled mid-flight (the tag may already have been pushed at line 119 before the cancel signal arrived). The rollback step is now idempotent via `git ls-remote --exit-code --quiet --tags`: ```diff - git push -d origin ${RELEASE_VERSION} + if git ls-remote --exit-code --quiet --tags origin "${RELEASE_VERSION}"; then + git push -d origin "${RELEASE_VERSION}" + echo "Tag ${RELEASE_VERSION} removed." + else + echo "Tag ${RELEASE_VERSION} not present on remote — nothing to rollback." + fi ``` ## Behaviour matrix (assume `perform_release=true` unless noted) | Scenario | publish-release | rollback runs | |---------------------------------------------------|-----------------|---------------| | Happy path | success | no | | Dry run (`perform_release=false`) | success | no | | `skip_tests=true`, otherwise success | success | no | | build-release fails (tag may or may not exist) | skipped | **yes** | | test-release fails | skipped | **yes** | | build-debian-pkg fails | skipped | **yes** | | publish-docker-image fails | skipped | **yes** | | Cancel during build-release | skipped | **yes** | | Cancel during test-release | skipped | **yes** | | Cancel during build-debian-pkg | skipped | **yes** | | Cancel during publish-docker-image | skipped | **yes** | | Cancel during publish-release | cancelled | **yes** | | Cancel after publish-release succeeded (too late) | success | no | | `gh release create` fails (by design, soft-fail) | success | no | | Operator cancels rollback itself | - | cancelled — tag may persist; manual cleanup | ## Step-level conditions left unchanged - `if: always()` on the diagnostic torrent-client-status upload (test-release line 420). Step-level `always()` is contained to the job; it lets us capture diagnostics even when the QA test step itself failed. - `if: ${{ inputs.perform_release }}` on the publish-side steps (lines 78, 113, 586, 627, 696). These gate API calls and tag creation for dry-run support. - `if: ${{ ! inputs.disable_version_check }}` on the version validation step (line 164). ## Test plan - [ ] Dispatch with `perform_release: false` (dry run) — confirm all jobs run as before, no tag created, rollback not triggered. - [ ] Dispatch with `perform_release: true` and let it complete normally — confirm release published, rollback not triggered. - [ ] Dispatch with `perform_release: true`, cancel during `test-release` — confirm `build-debian-pkg` and `publish-docker-image` are skipped (not dispatched), rollback runs and removes tag. - [ ] Dispatch with `perform_release: true`, cancel during `build-debian-pkg` — confirm `publish-release` is skipped, rollback runs and removes tag. - [ ] Dispatch with `perform_release: true` against a non-existent commit/branch to force a `build-release` failure — confirm rollback runs and reports "nothing to rollback" (idempotent). Cherry-pick to `release/3.4` will follow once this is merged to `main`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )