Skip to content

[pull] main from erigontech:main#5

Merged
pull[bot] merged 538 commits intoDustin4444:mainfrom
erigontech:main
May 4, 2026
Merged

[pull] main from erigontech:main#5
pull[bot] merged 538 commits intoDustin4444:mainfrom
erigontech:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 4, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

AskAlexSharov and others added 30 commits April 10, 2026 04:25
Update `github.com/erigontech/fastkeccak` to latest master (`08e7b660`)
to pick up erigontech/fastkeccak#8
Using &result (double indirection on interface{} already holding a
pointer) prevented correct unmarshalling of JSON null responses. Fix
mirrors go-ethereum PR #26723.

For erigon fixes:

https://github.com/orgs/erigontech/projects/21?pane=issue&itemId=94164858&issue=erigontech%7Csecurity%7C7

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
## Summary

The `--chain=dev` mode for Erigon has been updated so that it creates a
dev
beacon chain rather than relying on Clique consensus. In dev mode Erigon
now
runs the same PoS process as mainnet including the same forking and
block
production code.

A single command starts a fully operational PoS node:

./build/bin/erigon --chain=dev --beacon.api=beacon,validator,node,config

### What's included

- **Programmatic beacon genesis** (`cl/clparams/devgenesis/`): builds a
valid
Deneb genesis state with deterministic BLS validators, sync committees,
participation lists, and execution payload header — no external tooling
needed.

- **Embedded dev validator** (`cl/validator/devvalidator/`): polls the
Beacon API
for proposer duties, fetches block templates, signs with proper BLS
domain
  separation (RANDAO, block, attestation), and submits via the standard
  `/eth/v2/beacon/blocks` endpoint.

- **Same code paths as mainnet**: blocks flow through Caplin's fork
choice,
`NewPayload`, `ForkChoiceUpdated`, and the staged sync pipeline. No mock
  consensus or special-cased execution.

### New flags

| Flag | Default | Description |
|------|---------|-------------|
| `--dev-validator-seed` | `devnet` | BLS key derivation seed |
| `--dev-validator-count` | `64` | Number of genesis validators |
| `--dev.slot-time` | `6` | Seconds per slot (minimum: 2) |

`--beacon.api=beacon,validator,node,config` is required to enable the
Beacon API
endpoints used by the embedded validator.

### Removed

- `--mine` and `--dev.period` are no longer used for `--chain=dev`
- Clique genesis configuration removed from dev mode

### Dependencies

Built on top of #20190 (caplin minimal preset support), which is now
merged.

## Test plan

- [x] Blocks produced every slot (tested at 6s and 2s slot times)
- [x] Epoch boundaries crossed cleanly (slots 1-16+, minimal preset 8
slots/epoch)
- [x] EL validates and commits each block (head advances)
- [x] No panics or errors over sustained runs
- [x] `make lint` clean
- [x] `make erigon integration` builds
- [ ] CI passes

---------

Co-authored-by: kewei-bot <kewei.train@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Close btree cursor when `Next()` advances past the prefix boundary in
the FILE_CURSOR merge loop
- Drain heap on function return, closing any remaining btree cursors
- no `ToUpper()` call on each format
- no `logEnvBool` call on each format
an alternative for: #20148

Move mutex from log handler to file writer. Seems we used mutex in
handler - to reduce allocs (by using shared buf in the past) - but now
we don't use any shared buffer there.

`TestStreamHandlerNoContention` 5K goroutines: `5s -> 7ms`
## Summary

  - add an early `new-step-size == 0` guard in `stepRebase`
  - prevent division-by-zero panic paths in modulo/division checks
- return a clear validation error before any filesystem operations begin
**Since the 'sync from scratch' test creates an archive node, we can run
the RPC test suite afterwards.**

This approach has two benefits:

1. Firstly, we will not be reliant on a pre-built database, thus
avoiding the fragilities inherent in this scenario.

2. Secondly, we will be testing a database that is entirely the product
of the code under test. Please note that when a pre-built database is
used for RPC testing, only the code that responds to requests is tested,
not the code that builds the database.
…20481)

- Replace `ERIGON_REBUILD_CONCURRENT_COMMITMENT` env var with the
existing `statecfg.ExperimentalConcurrentCommitment` global in
`RebuildCommitmentFiles()`
- Both rebuild paths (with-history and no-history) now use the
`--experimental.concurrent-commitment` CLI flag as the single source of
truth for trie variant selection
- Update test to set/restore the global directly instead of using
`t.Setenv`
…st (#20444)

Fixes #20355

- Skip DB entries whose step falls within the file range (`step.ToTxNum
< files.EndTxNum()`) in both the initial DB cursor push and the advance
loop — they are never loaded into the heap
- Matches the "don't query DB for data in files" pattern from #20360,
#20361, #20362
- Add `TestDomain_IteratePrefix_PrefersFilesOverDB` that simulates
partial lexicographic prune and verifies file values win

Follow-up: #20474 applies the same fix to
`DomainLatestIterFile.initCursorMDBX()` and `advanceInFiles()`

## Test plan
- [x] New test passes with fix, fails without it
- [x] `make lint` clean
- [x] `make erigon integration` builds
…ure (#20486)

### Problem
 
When Erigon is running at chain tip, `MergeLoop` executes merge steps
back-to-back with no pause between iterations. Each merge step involves
heavy disk I/O (reading, compressing, and writing state files). Running
these steps consecutively saturates the disk, starving block execution
of I/O bandwidth.
 
The result is periodic block processing stalls: the node's reported
block number freezes for minutes at a time while background merges
consume all available I/O, then bursts forward when a merge step
completes. During these stalls the node falls behind the chain tip and
is marked unhealthy by load balancers.
 
### Observed behavior
 
On a production fleet running Erigon v3.3.x on AWS Graviton instances
(64GB RAM, EBS gp3 volumes), we observed the following pattern during
MergeLoop activity on individual nodes:
 
- Block execution throughput drops from ~20 Mgas/s to 1-5 Mgas/s
- Node block number freezes for 8-16 minutes per merge step
- Page cache eviction of 16GB+ as merge I/O displaces cached state data
- Lag accumulates at ~5 blocks/minute during each stall
- Worst observed: 164 blocks behind over a 188-minute period of
continuous merge activity
 
The node always recovers eventually, but the stalls cause the node to be
removed from load balancer rotation, reducing fleet capacity.
 
### Solution
 
Add a configurable delay between `MergeLoop` iterations via the
`MERGE_THROTTLE_MS` environment variable (default 0, preserving current
behavior). The delay is inserted after each successful `mergeLoopStep`,
giving block execution a window to access the disk before the next merge
step begins.
 
```
Before (current):
  mergeLoopStep()  → heavy I/O
  mergeLoopStep()  → immediately, more heavy I/O
  mergeLoopStep()  → immediately, more heavy I/O
 
After (with ERIGON_MERGE_THROTTLE_MS=2000):
  mergeLoopStep()  → heavy I/O
  sleep(2s)        → block execution catches up
  mergeLoopStep()  → heavy I/O
  sleep(2s)        → block execution catches up
```
 
### Production results
 
We have been running this patch on a 3-node production fleet since
December 2025. Results:
 
- Individual node availability during merge-heavy periods improved from
~90% to >99%
- Block execution stalls reduced from 8-16 minutes to under 5 minutes
- Nodes maintain chain tip proximity during merge activity
- No negative impact on merge completion time (merges still finish, just
spread over a slightly longer window)
- Fleet-wide availability (via load-balanced proxy) is near 99.99%, with
the remaining downtime caused by synchronized stalls that this patch and
`AGGREGATION_DELAY_MS` (PR #20391) address together
 
Recommended values based on our testing:
 
| Use case | Value | Effect |
|----------|-------|--------|
| Default (no throttle) | 0 | Current behavior, no change |
| Light throttle | 500 | Slight breathing room between merges |
| Production RPC nodes | 2000 | Good balance of merge progress and block
execution |
| Heavy RPC workload | 5000 | Prioritize block execution over merge
speed |
 
### Notes
 
- This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces
I/O pressure *within* each merge step by limiting worker parallelism.
This PR addresses I/O pressure *between* merge steps.
- This is also complementary to `AGGREGATION_DELAY_MS` (PR #20391,
merged) which staggers the *start time* of aggregation across fleet
nodes.
- No impact on single-node deployments or initial sync (default delay is
0).

Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
## Summary
Fixes a panic seen in the wild:

```
panic: runtime error: index out of range [4] with length 1
  cl/sentinel.(*Sentinel).findPeersForSubnets.func1
        cl/sentinel/discovery.go:80
```

A malformed/short `attnets` ENR entry decodes into a
`bitfield.Bitvector64` shorter than the expected 8 bytes, so indexing
`peerSubnets[subnetIdx/8]` goes out of range when `subnetIdx >= 8`.

Adds a `len(peerSubnets) == 8` guard at all three call sites in
`cl/sentinel/discovery.go` (`findPeersForSubnets` filter, post-connect
coverage update, and `onConnection` underserved-subnet check).
Fixes #20238
## Summary

- Treat deletion entries (`len(v)==0`) as authoritative in
`getLatestFromDb`, preventing fallthrough to `getLatestFromFiles` which
returns stale pre-deletion data from frozen snapshots (root cause of
`gas used mismatch` diffs of exactly 17100 = `SSTORE_SET -
SSTORE_RESET`)
- Fix `unwind()` to distinguish `nil` (different step, skip) from
`[]byte{}` (key was deleted, write tombstone) — previously both were
skipped via `len(value) > 0`, causing deletion markers to be lost after
unwind
- Add LargeValues cross-check in `getLatest` for CodeDomain/RCacheDomain
where interrupted `PruneSmallBatches` can leave stale old-step
tombstones

## Test plan

- [x] Added `TestDomain_DeletedKeyNotResurrectedByFiles` — verifies
`getLatestFromDb` returns deletion entries as authoritative regardless
of step age
- [x] Added `TestDomain_UnwindRestoresDeletionMarker` — verifies
`unwind()` writes empty tombstones when reverting a re-write of a
previously deleted key (covers both DupSort and LargeValues paths, with
serialization round-trip through `ChangeSets3`)
- [x] `make lint && make test-short` pass
- [x] Verified on Sepolia full re-sync from snapshots: block 10624211
(previously failing with 17100 gas mismatch) passes cleanly, node tracks
chain tip with no errors
```
  1. Node struct is cache-unfriendly (biggest win potential)

  type node struct {
      val any     // 16 bytes (interface = type ptr + data ptr)
      n0  *node   // 8 bytes
      n1  *node   // 8 bytes
      p0  uint32  // 4 bytes
      p1  uint32  // 4 bytes
  }                // = 40 bytes per node

  Every mf2.top.n0, mf2.top.p0 etc. is a pointer chase to a heap-scattered node. With thousands of nodes, this hammers L1/L2 cache.

  Fix: Flatten into a []node arena, replace *node with uint32 index:

  type node struct {
      p  [2]uint32  // paths (indexed by bit)
      n  [2]uint32  // child indices (0 = nil)
      val uint32    // index into code2pattern, 0 = no value
  }                 // = 20 bytes, no pointers → GC-invisible

  This gives contiguous memory, eliminates GC scanning, and halves node size.
```

Bench showing 20% faster on 64K dict size

Reason - during large merge (large file word-level compression).
MatchFinder.unfold is a bottleneck:
<img width="2172" height="697" alt="Screenshot 2026-03-24 at 22 41 24"
src="https://github.com/user-attachments/assets/5a35614d-61d3-4bdb-9c86-06020565302f"
/>

---------

Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
…0495)

## Summary
- `shouldGenerateChangeSets` no longer short-circuits on `initialCycle`
— changesets are generated for blocks within `MaxReorgDepth` of the
batch end regardless, so the node can always handle reorgs at the tip
- Removes the `initialCycle` parameter from `shouldGenerateChangeSets`

Cherry-picked from #20445.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#20432)

This fixes a crash path when decoding an empty stateless witness. Some
downstream witness logic assumes at least one header is present and may
access `Headers[0]`, so accepting an empty witness can lead to a panic
later. The change adds a small validation in `fromExtWitness` to reject
witnesses with no headers during decode, and includes a regression test
for the RLP decode path.
…tion (#20287)

## Summary

Adds `TrieReader` — a lightweight, stateless read-only navigator for the
Patricia trie. Each `Lookup` call starts from the root and descends
independently without any mutable grid state.

## Motivation

`HexPatriciaHashed` carries ~200KB of mutable grid state
(`grid[64][16]cell`, `touchMap`, `afterMap`, etc.) designed for update
processing. For read-only path queries (proofs, witnesses, integrity
checks, tooling) this is unnecessary.

## Design

```go
type TrieReader struct {
    ctx           PatriciaContext
    accountKeyLen int16
    keccak        keccak.KeccakState
    hashBuf       [length.Hash]byte
}

func NewTrieReader(ctx PatriciaContext, accountKeyLen int) *TrieReader
func (tr *TrieReader) Lookup(hashedKey []byte) (cell, bool, error)
```

- **Stateless**: no grid, no activeRows. Each call allocates only a
small nibble prefix buffer
- **Direct descent**: calls `ctx.Branch(compact(prefix))` at each level,
parses cells from branch data
- **`deriveHashedKeys`**: reconstructs hashed nibble extensions from
plain keys (accountAddr/storageAddr) via keccak — the same step
`unfoldBranchNode` does. This is critical for navigating past leaf cells
into storage sub-tries
- **Storage support**: account cells with hash + more key to consume
continue into the storage sub-trie at depth 64+
- **Nibble validation**: rejects invalid nibbles (> 0x0f) with a clear
error

### Exported cell accessors

```go
func (c *cell) AccountAddrLen() int
func (c *cell) StorageAddrLen() int
func (c *cell) HashLen() int
func (c *cell) GetAccountAddr() []byte
func (c *cell) GetStorageAddr() []byte
func (c *cell) CellHash() []byte
func (c *cell) Extension() []byte
```

## Tests

| Test | What it verifies |
|---|---|
| `TestTrieReader_RoundTripWithHPH` | 7 accounts + 3 storage keys
through real `HexPatriciaHashed.Process` → TrieReader lookup |
| `TestTrieReader_RoundTripWithHPH_ManyAccounts` | 100 accounts, all 100
found |
| `TestTrieReader_IntegrationWithRealData` (db/state) | 200 accounts +
100 storage via full domain/aggregator/snapshot stack |
| 9 mock unit tests | Account/storage hits, misses, extensions,
multi-level descent, error propagation |

## Files

- `execution/commitment/trie_reader.go` — implementation + cell
accessors
- `execution/commitment/trie_reader_test.go` — 12 tests (mock + HPH
round-trip)
- `db/state/trie_reader_integration_test.go` — full-stack integration
test
- `execution/commitment/commitmentdb/commitment_context.go` —
`NewTrieContextRo` for read-only branch access

## Related

- Part of commitment evolution: #13884
SimulationIntraBlockStateReader, used by eth_simulateV1, implements
HasStorage by querying only RangeAsOf(firstMinTxNum), which reflects the
canonical state at the start of the first simulated block. Storage
written by prior simulated blocks lives exclusively in the RAM batch
(sd.GetMemBatch()) and is invisible to RangeAsOf. When a subsequent
simulated block calls HasStorage on a newly deployed contract, it
incorrectly returns false.
The concrete impact: EIP-7610 requires CREATE/CREATE2 to revert if the
target address already has storage but no code/nonce (collision
detection). With the bug, the collision is not detected and CREATE2
succeeds where it should fail.
Fix: added HasPrefixInRAM to the TemporalMemBatch interface, scanning
only the in-memory btree for StorageDomain without touching disk.
HasStorage now checks the RAM batch first before falling back to
RangeAsOf.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary of changes for compliance  (ethereum/execution-apis   PR 762) 
see ethereum/execution-apis#762)

The changes fix the EVM tracer JSON output (structLog) to align with the
Ethereum JSON-RPC specification.
---
1. Last memory word truncation
Before: the loop used i+32 <= len(memData), silently dropping the last
chunk when memory size was not an exact multiple of 32 bytes.
After: the loop iterates over all bytes; the last partial chunk is
zero-padded to 32 bytes, as required by the spec (each memory word is
always 32 bytes).
---
  2. Missing 0x prefix on storage and memory
Before: storage keys/values and memory words were serialized as raw hex
(e.g. "0000...0001").
After: all hex values are emitted with the 0x prefix (e.g.
"0x0000...0001"), as required by the Ethereum JSON-RPC specification.
---
  3. error field always present even without errors
Before: the "error" field was serialized as "error": "" even when no
error occurred.
After: with omitempty the field is omitted entirely when there is no
error, making the output cleaner and consistent with the behavior
expected by clients and Hive tests.
---
Impacted APIs
* debug_traceTransaction
* debug_traceBlockByNumber 
* debug_traceBlockByHash
* debug_traceCall
* debug_traceCallMany


All three fixes are covered by unit tests in the new json_stream_test.go
file.

Out of this PR is keyword to anable/disable and default value for them

### Tracer Configuration Comparison
| Field | Erigon | Geth | Spec #762 | Aligned |
| :--- | :--- | :--- | :--- | :---: |
| **Memory** | `disable=false` (ON) | `Enable=false` (OFF) |
`enable=false` (OFF) | ❌ |
| **Stack** | `disable=false` (ON) | `Disable=false` (ON) |
`disable=false` (ON) | ✅ |
| **Storage** | `disable=false` (ON) | `Disable=false` (ON) |
`disable=false` (ON) | ✅ |
| **Return** | `disable=false` (ON) | `Enable=false` (OFF) |
`enable=false` (OFF) | ❌ |

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…he tests (#20498)

## Summary

- Remove `TestStateCacheDeletedStorageSSTOREGas` — it duplicated
`TestStateCache_PutEmpty_ThenGet_IsCacheHit` and incorrectly attributed
the SSTORE gas mismatch (#20169) to the StateCache layer
- Simplify the comment on `TestStateCache_PutEmpty_ThenGet_IsCacheHit`
to frame it as a cache correctness property, not a regression test for
#20169

The StateCache fix (PR #20265) is a valid cache correctness improvement,
but the root cause of #20169 was in the domain/snapshot layer — a cache
miss falling through to the membatch/MDBX path should return the correct
value. The real fix was PR #20483.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This PR updates `eth_getLogs` so that when the requested block range is
larger than the configured range limit, Erigon returns `-32602
`(`invalid params`) instead of `-32000` (`server error`). This makes the
behavior consistent with the other parameter validation errors already
used in the same `eth_getLogs` flow and aligns Erigon with other clients
such as Reth.

The change is implemented in `getLogsV3` by returning an RPC
invalid-params error for the block range limit check, while keeping the
existing error message format unchanged. The related test was also
updated to verify the exact RPC behavior, including that the returned
error code is `-32602` and that the message remains `query block range
exceeds server limit, narrow your filter: 5.`
- Fix `SnapBlocksRead` to cap the iteration upper bound with `to`
instead of the hardcoded `2`
- Keep the existing function signature and all other integrity flows
unchanged
- Restore the intended bounded `[from, to)` behavior for future or
direct bounded checks
Goal: reduce mutex-contention and make txs cheaper to create (increase
throughput)

In PR: 
- add `agg.visible` atomic field
- all visible data recalculated and stored there
- all existing atomics (like `visibleFilesMinimaxTxNum`) moved inside
`agg.visible` object. So, they re-calcuclated together with other data
(consistency).
- `Aggregator.BeginFilesRo()` is now lock-free. 
- I guess salt can move there - but later
- `domain.BeginFilesRo()` is now private method and accept `visible`
object as a parameter (`domain._visible` field removed)

---------

Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Cherry-pick from `release/3.4` to `main`:

- #19677 agg: workers presets. ressplit workers
- #19919 Revert "flush: use etl.IdentityLoadFunc instead custom. part2"
- #19780 etl: zero-copy memDataProvider
- #19941 d_lru: disable for commitment
- #19942 TemporalMemBatch: re-use vals-slice when can
- #19996 etl: pool of bufwriter
- #19995 collate: replace bitmap by array
- #20002 skill creator review results
- #20033 seg: revert global limiter
- #20046 Caplin: prevent calling `glob` per file in
`BuildMissingIndices`
- #20113 seg: more usage of bufio
- #20194 execution/state: revert CodeSizePath in codeChange journal
entry
- #20431 remove `flush complete` log line
- #20440 etl: munmap temp files in Dispose to prevent disk space leak

93 commits skipped due to conflicts (branches diverged significantly).

---------

Co-authored-by: lystopad <oleksandr.lystopad@erigon.tech>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: moskud <sudeepdino008@gmail.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Replace the boolean --rpc.testing flag (engine-port-only, JWT-required)
with the standard --http.api mechanism: adding "testing" to the list
(e.g. --http.api eth,erigon,testing) registers testing_buildBlockV1 on
the plain HTTP port (8545) without JWT, matching Geth's behaviour.

The namespace is disabled by default to prevent accidental production
exposure. In standalone rpcdaemon mode the entry is silently skipped
with a Warn log, since block assembly requires direct execution service
access unavailable over gRPC.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
)

## Summary
- Fix BAL (Block Access List) recording bug where `coinbase ==
transaction target` caused storage/code/nonce reads to be dropped from
the BAL
- `finalizeTx` used `delete(result.TxIn, result.Coinbase)` which removed
the entire address from the shared read-set map, destroying execution
reads needed by `AsBlockAccessList`
- Replace with `result.TxIn.Delete(addr, AccountKey{Path: BalancePath})`
to surgically remove only the stale balance read, preserving all other
reads

## Test plan
- [x] Unskipped
`stBugs/random_statetest_default_minus_tue_07_58_41_minus_15153_minus_575192`
in `TestExecutionSpecBlockchainDevnet`
- [x] Both amsterdam variants pass (base + london)
- [x] `make lint` clean
- [x] `make erigon integration` builds
- [ ] Full `TestExecutionSpecBlockchainDevnet` suite on Linux CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
## Summary

- Wire the existing `RevertWithDiffset` into `unwindExec3` so the state
cache is surgically invalidated during unwinds — keys touched by the
unwound blocks are evicted, untouched keys remain cached
- Previously `RevertWithDiffset` was implemented but never called; the
cache survived unwinds only because `ValidateAndPrepare` cleared it on
the next block (hash mismatch fallback)
- The `lastExecHash` (already fetched with a comment anticipating this
wiring) detects if the cache was modified by a different execution path
(e.g. rolled-back `ValidatePayload`), falling back to a full clear

## Context

Follow-up to #20265 and #20483. Those PRs fixed correctness bugs in the
cache layer and domain unwind; this PR completes the picture by actually
invalidating the cache during unwinds rather than relying on the
next-block safety net.

## Test plan

- [x] `TestRevertWithDiffset_SurgicalEviction` — touched keys evicted,
untouched preserved, blockHash updated
- [x] `TestRevertWithDiffset_HashMismatch_ClearsAll` — mismatched
revertFromHash triggers full clear
- [x] `TestRevertWithDiffset_AccountChange_EvictsCode` — account diffs
also evict corresponding code entries
- [x] `TestRevertWithDiffset_ThenValidateAndPrepare_Continuity` —
end-to-end: surgical revert then re-execution preserves surviving cache
data
- [x] `make lint` clean
- [ ] Full sync test on a testnet with unwinds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Story: as part of `bt.Get()` optimization. as part of blocks exec
optimization. Now bt.Get does much allocs because calling `.Next(nil)`.
But it can call `.MatchCmp()`. `MatchCmp` is zero-copy (even in
compressed case) and allows partial-key-decompression.

MatchCmp:  `58ns 1 alloc -> 24ns 0 alloc`

```

  ┌──────────────────────┬──────────┬───────────┐
  │        Method        │ hit (ns) │ miss (ns) │
  ├──────────────────────┼──────────┼───────────┤
  │ MatchCmp (streaming) │ 43.5     │ 24.1      │
  ├──────────────────────┼──────────┼───────────┤
  │ MatchPrefix          │ 40.7     │ 7.9       │
  ├──────────────────────┼──────────┼───────────┤
  │ Next + buf reuse     │ 39.4     │ —         │
  └──────────────────────┴──────────┴───────────┘


  ┌─────────────────────────┬──────────┬───────────┐
  │         Method          │ hit (ns) │ miss (ns) │
  ├─────────────────────────┼──────────┼───────────┤
  │ MatchCmpUncompressed    │ 8.1      │ 7.3       │
  ├─────────────────────────┼──────────┼───────────┤
  │ MatchPrefixUncompressed │ 8.3      │ 7.1       │
  ├─────────────────────────┼──────────┼───────────┤
  │ NextUncompressed        │ 5.3      │ —         │
  └─────────────────────────┴──────────┴───────────┘
```

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
yperbasis and others added 27 commits April 29, 2026 03:22
)

## Summary

- Hive `engine-cancun / Invalid PayloadAttributes, Missing BeaconRoot,
Syncing=True (Cancun)` (test 376 of [this
run](https://hive.ethpandaops.io/#/test/generic/1777279481-85063528cea3c966b05ec4fef21a48ad?testnumber=376))
was failing because Erigon validated `parentBeaconBlockRoot` before the
SYNCING short-circuit and returned `-38003` when the head was unknown.
- Split `validatePayloadAttributes` into request-level (pre-FCU) and
point-(8)-extension (post-FCU) halves so the spec-mandated ordering is
preserved on both the SYNCING and VALID paths.
- The fix that motivated #20728 (hive `engine-withdrawals / Empty
Withdrawals (Paris)`) still passes — its check (V2 wrong withdrawals
presence) is in the pre-FCU group.

## Why


[`cancun.md`](https://github.com/ethereum/execution-apis/blob/main/src/engine/cancun.md#engine_forkchoiceupdatedv3)
says fcuV3 _extends_ point (8) of `engine_forkchoiceUpdatedV1`:

> 2. **Extend point (8)** of the `engine_forkchoiceUpdatedV1`
specification by defining the following sequence of checks that **MUST**
be run over `payloadAttributes`:
> 1. `payloadAttributes` matches the `PayloadAttributesV3` structure,
return `-38003: Invalid payload attributes` on failure.
> 2. `payloadAttributes.timestamp` does not fall within the time frame
of the Cancun fork, return `-38005: Unsupported fork` on failure.
> 3. `payloadAttributes.timestamp` is greater than `timestamp` of a
block referenced by `forkchoiceState.headBlockHash`, return `-38003:
Invalid payload attributes` on failure.

…and point (8) of V1 (paris.md) gates the entire processing flow on the
head being `VALID`:

> 8. Client software **MUST** process provided `payloadAttributes`
**after successfully applying the `forkchoiceState`** and **only if the
payload referenced by `forkchoiceState.headBlockHash` is `VALID`**.

So with an unknown head, point (8) doesn't fire and the response is the
SYNCING one in point (9). The `Syncing=True` family of hive cancun tests
pin exactly this.

The pre-FCU "wrong version of structure" rule is different — it lives in
the V2 [Request
section](https://github.com/ethereum/execution-apis/blob/main/src/engine/shanghai.md#engine_forkchoiceupdatedv2)
and has no precondition on head state, which is why #20728 correctly
moved it before the short-circuit.

## What changed

- `validatePayloadAttributesPreFCU` keeps the request-level checks:
`parentBeaconBlockRoot` on V1/V2, V1/V2 fcu at a Cancun timestamp,
withdrawals presence vs Shanghai. Runs before the SYNCING short-circuit.
- `validatePayloadAttributesPostFCU` runs the V3 point-(8) extensions:
`parentBeaconBlockRoot != null` for V3+, Cancun-window for V3+,
`SlotNumber != null` for V4+. Runs only when `status.Status ==
ValidStatus`.
- Two regression tests in `testing_api_test.go`:
- `TestForkchoiceUpdatedV3DefersAttributesValidationWhenSyncing` — pins
the hive `Syncing=True` scenario (no error, status=SYNCING).
- `TestForkchoiceUpdatedV3RejectsMissingBeaconRootWhenValid` — pins the
spec-mandated `-38003` when the head is VALID so the deferral can't
regress into a permanent swallow.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
… mainnet (#20815)

This PR enable maninet historical using the following transport type:
- http
- http compression
- websockets
## Summary

- Add `cannot unwind past 0` checks before sender and tx lookup stage
unwind arithmetic.
- Guard `state_stages` one-shot and loop unwind paths before subtracting
from unsigned block numbers.
- Clamp the backward-loop stop boundary at genesis to avoid uint64
underflow while preserving the existing integration command flow.
… hash (#20825)

closes #20888

implements ethereum/execution-apis#786

which is needed for glamsterdam-devnet-0

---------

Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
…20328)

## Problem

Tests in `execution/engineapi` were failing intermittently in the
`race-tests / tests-linux (ubuntu-latest, execution-other)` CI job:

- `TestEngineApiHighGasContractsFillBlock` → `listen tcp
127.0.0.1:46053: bind: address already in use`
- `TestEngineApiBlockGasOverflowSpillsToNextBlock` → `empty response
from JSON-RPC server`

Observed on commits a9db702 and acc9898 (2 out of 10 recent
commits).

**Root cause:** The test infrastructure used `freeport.NextFreePort()`
to pick a free port, then later called `net.Listen()` with that port.
This TOCTOU (time-of-check/time-of-use) race meant another concurrent
test could bind the same port in between, causing `address already in
use`. Separately, the server might not be ready when the first client
request arrived, causing `empty response from JSON-RPC server`.

## Fix

Pre-create `net.Listener` instances on port `:0` (kernel-assigned,
guaranteed free and already bound) before any server starts, then hand
the bound socket directly to `StartHTTPEndpoint`. The kernel guarantees
no other process can steal the port once it's bound.

**Changes:**
- `node/endpoints.go`: add optional `Listener` field to
`HttpEndpointConfig`; if set, skip the `net.Listen` call and use the
pre-bound socket
- `cmd/rpcdaemon/cli/httpcfg/http_cfg.go`: add `HttpListener` /
`AuthRpcListener` fields to `HttpCfg`
- `cmd/rpcdaemon/cli/config.go`: wire up pre-created listeners in both
the regular RPC and engine API listener paths
- `execution/engineapi/engineapitester/engine_api_tester.go`: replace
`freeport.NextFreePort()` with `net.Listen("tcp", "127.0.0.1:0")` for
the JSON-RPC and engine API ports

## Testing

- `go build ./...` ✅
- No `*_test.go` files modified ✅
- 5 consecutive `go test -race ./execution/engineapi/...` runs all pass
✅

CI job reference:
https://github.com/erigontech/erigon/actions/runs/23983504381/job/69951584262

---------

Co-authored-by: erigon-copilot[bot] <erigon-copilot[bot]@users.noreply.github.com>
Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>
Co-authored-by: Alexey Sharov <AskAlexSharov@gmail.com>
Co-authored-by: Matt Joiner <anacrolix@gmail.com>
Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Andrew Ashikhmin <34320705+yperbasis@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
## Summary

Follow-up to #20788. Address review feedback from @anacrolix:

> I thought there was a helper in the websocket library for this so you
don't need to spin up a bunch of timers and goroutines?

The `coder/websocket` `Conn.Close(code, reason)` method already performs
the close handshake with bounded internal timeouts (5s write, 5s read),
so the manual goroutine + timer plumbing in `wsConnAdapter.Close` is
redundant. Replace it with a direct call.

The existing `TestWebsocketServerGracefulClose` test still passes —
clients see a clean `StatusNormalClosure` (1000) instead of
`StatusAbnormalClosure` (1006).

## Test plan

- [x] `go test ./rpc/ -run Websocket` — all pass
- [x] `make lint` — clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
close #17112 
This PR replaces the existing single-path compress/gzip middleware in
node/rpcstack.go with a dual-path implementation that distinguishes
between streaming and non-streaming responses.
a) Non-streaming responses (standard JSON-RPC calls such as
eth_getBlockByNumber):
- are now compressed in one shot using go-libdeflate, a lightweight CGo
wrapper around ebiggers/libdeflate. Benchmarks show ~1.75x speedup vs
stdlib gzip on ~30 KB JSON payloads (4.6 ms vs 8.0 ms, 122 MB/s vs 70
MB/s); under high concurrency the advantage is larger due to lower CPU
usage per request.
- Since the compressed size is known before writing, Content-Length is
now set to the exact compressed size, avoiding unnecessary
Transfer-Encoding: chunked.
- Responses smaller than 1 KB are sent uncompressed: at that size the
CPU overhead of initialising the compressor outweighs the benefit, and
for very small payloads the compressed output can exceed the input size
due to gzip framing overhead.
b) Streaming responses (e.g. debug_traceTransaction, trace_filter) are
detected via http.Flusher: when the RPC handler calls Flush() before
writing, the middleware switches to stdlib compress/gzip in streaming
mode, compressing trace data incrementally without buffering the full
response.
   
  Changes:                                                  
- node/rpcstack.go: new gzipResponseWriter with buffer/stream dual mode;
4 sync.Pool instances for zero-allocation hot path
- rpc/http.go: injects http.Flusher into request context via
httpFlusherContextKey
- rpc/handler.go: streaming methods call flush before writing to
activate streaming mode
  - go.mod: adds github.com/erigontech/go-libdeflate v0.1.0 
  
# 🚀 Performance Benchmarks: Gzip Optimization
---
<details>
<summary><b>1. Isolated Compression Benchmarks (libdeflate vs
stdlib)</b></summary>

Diff

  Benchmark (Gzip isolation)    Latency      Throughput   Mem Alloc
- BenchmarkStdlibGzip           8.0ms/req    70 MB/s      66KB
+ BenchmarkLibdeflateGzip 4.6ms/req 122 MB/s 63KB (2x faster)
Note: We observed a ~1.75x speedup on a single thread. Under high
concurrency, the advantage is even greater due to reduced CPU overhead.

</details>

<details>
<summary><b>2. eth_getBlockByNumber with txs (Old SW vs Main
SW)</b></summary>

Old SW - Results with instability and errors:

Diff

- [2. 1] qps: 2000 -> [R=97.74% max=9.103s error=503 Service
Unavailable]
- [3. 3] qps: 2300 -> [R=98.50% max=5.841s error=503 Service
Unavailable]
- [3. 4] qps: 2300 -> [R=99.80% max=5.634s error=503 Service
Unavailable]
- [3. 5] qps: 2300 -> [R=98.25% max=5.877s error=503 Service
Unavailable]


Main SW - Stable results with 100% success rate:
Diff

+ [2. 1] qps: 2000 -> [R=100.00% max=141.22ms]
+ [2. 5] qps: 2000 -> [R=100.00% max=157.7ms]
+ [3. 1] qps: 2300 -> [R=100.00% max=217.23ms]
+ [3. 5] qps: 2300 -> [R=100.00% max=224.174ms]
</details>

<details>
<summary><b>3. trace_block </b></summary>

| Metric | Old SW | Main SW | Speedup |
| :--- | :--- | :--- | :--- |
| **Throughput** | 375.9 r/s | **455.0 r/s** | **+21%** |
| **p50 Latency** | 114.4 ms | **96.9 ms** | **1.18x** |
| **p90 Latency** | 232.0 ms | **181.2 ms** | **1.28x** |
| **Mean Latency** | 132.4 ms | **109.5 ms** | **1.21x** |

</details>

<details>
<summary><b>4. Executes all RPC using http, http-compressed and
websockets</b></summary>

./run_all.sh -T http,http_comp,websocket 
Run tests in parallel on localhost:8545/localhost:8551
Result directory:
/home/simon/silkworm/tests/rpc-tests3/integration/results
Time:                         2026-04-19 08:58:05.257624
Total round_trip time:        3:40:01.359044
Total marshalling time:       0:00:00.063308
Total unmarshalling time:     0:01:40.509521
No of json Diffs:             0
Test time-elapsed:            0:08:37.467460
Available tests:              1436
Available tested api:         112
Number of loop:               1
Number of executed tests:     4152
Number of NOT executed tests: 156
Number of success tests:      4152
Number of failed tests:       0
</details>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- \`seg reset\` regressed in #18273: \`removeLocal\` from the CLI is
wired into \`reset.Reset.RemoveUnknown\` but not \`RemoveLocal\`, which
gates chaindata + Heimdall/PolygonBridge DB removal in
\`db/datadir/reset/reset.go\`. Result: \`seg reset\` no longer touches
chaindata at all.
- Library-level \`TestResetChaindata\` exists but sets \`RemoveLocal:
true\` directly, so the CLI plumbing was uncovered.

closes #20260
…pruning (#20915)

## Summary

\`--exec.no-prune\` previously only gated the state-aggregator's pruning
paths via \`dbg.NoPrune()\` (Domain/InvertedIndex in
\`db/state/aggregator.go\`, forkable in \`db/state/forkable_agg.go\`).
Stage-level pruning ran regardless:

- \`PruneExecutionStage\` was calling \`rawdb.PruneTable\` on
\`kv.ChangeSets3\` and \`kv.BlockAccessList\`.
- \`PruneTxLookup\`, \`PruneWitnessProcessingStage\` and
\`SnapshotsPrune\` (\`PruneAncientBlocks\`, \`pruneCanonicalMarkers\`,
\`RetireBlocksInBackground\`, \`pruneBlockSnapshots\`) had no guard.

## Fix

Add early-return guards on \`dbg.NoPrune()\` at the top of each stage
prune entrypoint so the flag genuinely blocks every path that removes
rows from MDBX. Each function still calls \`s.Done(tx)\` so the
staged-sync state machine doesn't re-enter the prune step on every
cycle.

Updated flag usage text to describe the wider semantic.

Companion bal-devnet-3 PR: #20914

## Test plan

- [x] \`go test ./execution/stagedsync -run TestNoPrune\` — new
\`no_prune_test.go\` seeds rows in \`kv.ChangeSets3\`,
\`kv.BlockAccessList\`, \`kv.TxLookup\`, \`kv.BorWitnesses\`; runs all
four prune entrypoints with \`NoPrune=true\`; asserts no rows deleted
and \`PruneProgress\` is recorded.
- [x] \`make lint\` — clean.
- [x] \`make erigon\` — builds.

Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
- skipping pre-byzantium receipts because of
#20910

---------

Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
)

## Summary
- The startup banner logs the full command line via
`log.RedactArgs(os.Args)` in `cmd/erigon/main.go`. URLs, IP addresses,
and `--datadir` are already redacted, but `--ethstats` was not — and its
value has the form `nodename:secret@host:port`, so the ethstats username
and password were being printed in the clear at every startup.
- This adds an `--ethstats` redaction following the same `--datadir`
pattern (matches `--ethstats=value`, `--ethstats value`, and the
single-dash variants), with tests.

## Test plan
- [x] `go test ./common/log/v3/...` (new `TestRedactArgsEthstats` plus
existing tests pass)
- [x] `make lint` (0 issues)

Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
…20869)

This update resolves the matcher cleanup loop in the discover reply
handling code. When an expired matcher is removed from the list, calling
`Remove(el)` invalidates its links, therefore invoking `el.Next()`
subsequently may cause the loop to terminate prematurely. To avoid this,
the next element is preserved before removal, allowing iteration to
proceed appropriately across the entire list. This ensures that all
expired matchers in the same tick are processed and that `contTimeouts`
are not undercounted when multiple matchers expire simultaneously.
…uilder (#20886)

Adds `iidq.reader.Reset(0)` at the top of the retry loop in
`SimpleAccessorBuilder.Build`, matching `buildVI` (5c7fe1b) and
`buildHashMapAccessor`. Without it, retry iterations start with an
exhausted `PagedReader` and feed zero keys to recsplit, producing a
corrupt accessor index for forkable snapshots (headers/bodies).

---------

Co-authored-by: Alexey Sharov <AskAlexSharov@gmail.com>
Co-authored-by: moskud <sudeepdino008@gmail.com>
…es (#20892)

## Summary

Two fixes addressing all 29 failures in the Hive `legacy-cancun`

[run](https://hive.ethpandaops.io/#/test/generic/1777223434-c411e5be836d6d6f8a466489c2874cf3)
against `erigon_default`, plus a regression test that catches the
EVM-side bug locally.

### (a) `cmd/utils/app` — TD-based fork choice for `erigon import`

The Hive consensus simulator writes each block to its own `00NN.rlp`
file and runs `erigon import /blocks/00NN.rlp` for each — the simulator
does not implement TD/fork-choice itself, it relies on the client's
native chain selection. `InsertChain` always called
`UpdateForkChoice(importedTip)`, so every imported block became head
even when its TD was lower than the current canonical head's. For
`lotsOfLeafs` block 6 (TD `0x9CCF4`) is the canonical winner, but Erigon
kept switching head through blocks 7-13 (TD `0x9CC34`) and ended on
`0x26ad10c0…` instead of `0xf7f9ea97…`.

This affects **23 tests** (all Berlin/Istanbul/London variants):
`lotsOfLeafs`, `ChainAtoChainB_difficultyB`, `ForkStressTest`,
`CallContractFromNotBestBlock`, `lotsOfBranchesOverrideAtTheMiddle`,
`sideChainWithMoreTransactions`, `uncleBlockAtBlock3afterBlock4`,
`blockChainFrontierWithLargerTDvsHomesteadBlockchain[2]`.

**Fix:** before calling `UpdateForkChoice`, compare the imported chain's
total difficulty against the current head's TD using a shared
`headerdownload.ShouldReorg` helper (extracted from the existing
`FeedHeaderPoW` tie-break: TD → height → lex hash). If the imported
chain doesn't beat the current head, write the headers/bodies/TDs
straight to the DB as a side chain (no execution, no head change) so
future side-chain extensions can still validate. Once a side-chain
extension surpasses the canonical TD, the regular `InsertBlocks` +
`UpdateForkChoice` path triggers the reorg and executes those blocks.
PoS imports (all `difficulty=0`, TD never grows) explicitly bypass this
branch — `ShouldReorg`'s "shorter wins on tie" rule would otherwise
prevent the head from advancing past genesis.

### (b) `execution/vm` — STATICCALL touch for EIP-161 state clearing

`RevertPrecompiledTouch_d3g0v0_{Berlin,Istanbul,London}` and
`RevertPrecompiledTouch_storage_d3g0v0_{Berlin,Istanbul,London}` (**6
tests**) all touch precompile 3 (RIPEMD-160) via STATICCALL inside a
frame that runs out of gas. Expected: ripemd state-cleared (deleted) at
end of tx via the `journal.dirty(ripemd)` consensus quirk. Erigon left
it.

The CALL path in `evm.call` calls `Exist(addr)` first, which loads
ripemd into `stateObjects`; the subsequent `AddBalance(addr, 0)` then
falls through to `TouchAccount` and the famous quirk fires, so
`FinalizeTx` deletes ripemd. But the STATICCALL path skipped `Exist` and
called `AddBalance(addr, 0)` directly — in serial mode that hits the
`versionMap == nil && addr == ripemd && amount.IsZero()` shortcut that
uses `balanceIncrease` instead of `GetOrNewStateObject`. ripemd never
enters `stateObjects`, so on revert `FinalizeTx`'s `if !exist { continue
}` branch skips it.

**Fix:** replace `AddBalance(addr, u256.Num0, …)` with
`TouchAccount(addr)` for STATICCALL — same end behavior as CALL's
post-`Exist` flow (loads the account, hits the dirty quirk on touch).
Identical for non-ripemd / non-serial paths.

`d0` (CALL), `d1` (DELEGATECALL), `d2` (CALLCODE) variants were
unaffected: CALL goes through `Exist`+`Transfer`; DELEGATECALL and
CALLCODE don't touch the callee at all (matching geth and the post-state
expectations).

### (c) `execution/tests` — local regression test for the d3 RIPEMD path

Add `TestLegacyCancunState`, walking
`legacy-tests/LegacyTests/Cancun/GeneralStateTests` in state-test
format. `TestLegacyBlockchain` (`block_test.go`) explicitly skips the
`LegacyTests/` subtree and `TestState` walks the EEST static_tests,
neither of which covers the d3 ripemd-touch case. Verified the new test
fails at baseline (Berlin/1, Istanbul/1, London/1 — the d3 indexes) and
passes with the fix in (b). Refactored the now-three state-test runners
(`TestStateCornerCases` / `TestState` / `TestLegacyCancunState`) to
share a `runStateTests(t, st, dir)` helper plus a `stateTestSetup(t)`
helper for parallel/log/Windows-skip boilerplate.

Six fixtures fail on Constantinople-only post-state-root mismatches
(sstoreGas / *_HighNonce* / Ecrecover_Overflow / ecrecoverShortBuff) —
pre-existing Erigon-side divergences from geth, unrelated to this PR.
Geth doesn't catch them locally because its runner walks
`LegacyTests/Constantinople/GeneralStateTests` (older snapshot, doesn't
include these). Tracked separately in #20894 and `SkipLoad`-ed for now
with a comment pointing at the issue.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Adds `--only-history` (alias `--history`) to `erigon seg rm-state`.

When set, restricts deletion to history files (`SnapHistory` +
`SnapIdx`) only — leaving domain data (`.kv` files in `SnapDomain`,
accessor files) untouched.

Useful when combined with `--step from-to` to prune old history without
touching domain snapshots.
…-Byzantium status (#20916)

### New implementations
- `transaction(hash)`: implemented transaction lookup by hash (was
panicking)
- `block(hash)`: implemented block lookup by hash (was unimplemented)
- `block.transactionAt(index)`: implemented with proper bounds checking
(was panicking)
  ### Bug fixes                                                         
- `transaction.type`: changed schema from `Int` to `Long` so
EIP-1559/EIP-4844
transactions return `"0x2"` / `"0x3"` (hex) instead of `2` / `3`
(decimal),
matching the GraphQL spec
- `transaction.status`: pre-Byzantium blocks now correctly return
`"0x0"` instead
of `null` (pre-Byzantium receipts carry `PostState` instead of `Status`)
- `transaction.gas`, `transaction.maxFeePerGas`,
`transaction.maxPriorityFeePerGas`,
`transaction.accessList`: fields were missing from the receipt map and
are now
populated correctly
                                                                        
### Hive simulator fix
- Added `HIVE_CANCUN_TIMESTAMP` to the client environment map in the
GraphQL hive
simulator. Without it, Erigon rejected Cancun blocks during chain
import,
truncating the canonical chain at block 29 and causing ~10 tests to fail
(withdrawals, Shanghai/Cancun block queries, block range queries).
## Tests
- Added unit tests for the new `Long` scalar (marshal/unmarshal)
- Added unit tests for nil guards in the `convertData` helpers
- Added unit tests for `transactionAt` bounds checking
- Added unit tests for `transaction` invalid hash error handling

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
Without a deadline, a WebSocket write can block indefinitely when the
peer stops reading (e.g. server sends `StatusMessageTooBig` and closes).
The blocked write holds the write mutex, preventing the read path from
acquiring it to send its own close-frame response — a permanent deadlock
that manifests as a 59-minute hang in `TestWebsocketLargeCall` under
`-race`.

Fix: in `websocketCodec.WriteJSON`, when the caller provides no context
deadline, wrap the context with a `wsPingInterval` (60 s) timeout before
passing it to `jsonCodec.WriteJSON`. This ensures `SetWriteDeadline` is
set to at most 60 s, matching the dead-connection detection window used
by the ping loop.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…#20939)

- `--resume` already supported with history commitment rebuild
## Summary

- Fix a long-standing race in `Client.dispatch` (rpc/client.go) where a
fatal `readErr` racing against `reqSent` could orphan the in-flight op's
`resp` channel, leaving `op.wait` blocked forever on a channel nobody
would ever close.
- The readErr handler was passing `lastOp` to `cancelAllRequests`, which
deliberately *preserves* the inflight op's resp — directly contradicting
the comment immediately above:
  ```
  // A read error is fatal for the connection, and all pending requests
// must be cancelled, including any that might still be considered
in-flight.
  ```
Pass `nil` instead so the inflight op is cancelled too. Clear `lastOp`
so a concurrent reconnect doesn't re-register an op whose resp is
already closed (a later `op.resp <- batch` would panic). Nil-guard
`addRequestOp` in the reconnect handler accordingly.

## Why

Surfaced as the long-running CI flake
[#16875](#16875) —
`TestWebsocketLargeCall` hanging for ~59 minutes until the test timeout
fires. From a recent failing run's goroutine dump:

- the test goroutine sat in `(*requestOp).wait` (rpc/client.go:144) for
58 minutes — i.e., past `c.send`, with the write already returned;
- `Client.dispatch` was idle in `select` (rpc/client.go:597) for the
same 58 minutes;
- there were no goroutines inside `coder/websocket` Read or Write — both
had returned.

The only way to reach that state is: `readErr` fired, `dispatch`
processed it before/instead of `reqSent`'s `lastOp = nil`,
`cancelAllRequests` skipped the in-flight op (because `lastOp` was
passed as `inflightReq`), and so `op.resp` was never closed. The select
between `c.readErr` and `c.reqSent` (which is buffered, size 1) is a
coin flip — that's the flake.

## Test plan

- [x] `make lint` — clean
- [x] `make erigon integration` — builds
- [x] `go test -race ./rpc/...` — full suite passes
- [x] `go test -race -count=10 -run '^TestWebsocketLargeCall$' ./rpc/` —
10/10 pass
- [ ] CI race-tests / core-rpc — should be green

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `canonicalHashCache` LRU (10,000 entries, 32 bytes/entry, ~320
KB) to `BlockReader` to cache `blockNum → hash` lookups for
snapshot-range blocks.

**Why a separate cache instead of increasing `headerByNumCache`:**
`headerByNumCache` stores full `*Header` objects (~500 bytes/entry,
1,000 slots, ~500 KB). To cover the same 10,000-block working set it
would require ~5 MB. `canonicalHashCache` achieves the same coverage at
16× less memory by storing only the 32-byte hash.

**Two regimes:**

- **Working set ≤ 1,000 blocks** (`headerByNumCache` warm):
`canonicalHashCache` short-circuits `ViewSingleFile` (refcount
increments + closure allocation) which main always pays even when the
`*Header` is cached → **~3.3×** (572 → 167 ns)
- **Working set > 1,000 and ≤ 10,000 blocks** (`headerByNumCache`
evicted, `canonicalHashCache` still warm): main falls back to full
snapshot path → **~33×** (5,119 → 167 ns)

Results (real mainnet snapshots, i9-14900HX):

| Benchmark | ns/op | B/op | allocs/op | Notes |
|-----------|------:|-----:|----------:|-------|
| `BenchmarkCanonicalHash_RealSnapshot` | 167 | 8 | 1 | this PR —
`canonicalHashCache` hit |
| `BenchmarkCanonicalHash_RealSnapshot_MainEquivalent` | 572 | 176 | 5 |
main equivalent — `headerByNumCache` warm |
| `BenchmarkCanonicalHash_RealSnapshot_Cold` | 5,119 | 1,657 | 11 |
working set > 1,000 blocks |

> **Why MainEquivalent (572 ns) is higher than the theoretical estimate
(~235 ns):** `ViewSingleFile` is called on every `CanonicalHash`
invocation even when `headerByNumCache` is warm. It atomically
increments the refcount of each non-frozen segment and allocates a
release closure — 5 allocs, 176 B per call. The theoretical estimate
only counted MDBX miss + LRU lookup + atomic hash load and did not
account for this overhead.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…20746)

The `v1-` prefix check and `SchemeMinSupportedVersions` version floor in
`CheckSnapshotsCompatibility` were guards for the upgrade window from
pre-3.1 snapshot naming. At 3.5 that window is long closed.

- Removes `CheckSnapshotsCompatibility` and `SchemeMinSupportedVersions`
- Removes `sanityOldNaming` field and `SanityOldNaming()` method from
`AggOpts`
- Removes the unconditional call in rpcdaemon startup
- Deletes `version_schema.go` (its only job was populating
`SchemeMinSupportedVersions`); `init()` now calls `InitSchemasGen()`
directly
…#20944)

This PR fixes some logic re: handling shortened keys in commitment
files:

- `ValuesPlainKeyReferencingThresholdReached` as a general rule, if file
contains 1 step, then use plain keys, otherwise references. This works
well currently because all merged files contains even number of steps
(2, 4, 8, 16, 32, etc.).

The problem is the logic uses %0 implying odd number of steps in range
should contain plain keys as well. Because I'm modifying the step-rebase
command to allow arbitrary number of steps inside a merged range, if you
convert an existing node and it ends up containing an odd number of
steps, it'll crash on next merge.

- I noticed a very similar function
`MayContainValuesPlainKeyReferencing`, which kind of does the formula
correctly, but it misses files with 2 steps. That function is used only
by integrity checks, it is currently missing 2-step files and it is IMO
incorrect. I'm removing that almost-duplicate function and retrofitting
all callers to use `ValuesPlainKeyReferencingThresholdReached`.
…dition (#20872)

## Summary

Fix two related cancellation bugs in `.github/workflows/release.yml`:

1. **Downstream jobs ignored cancellation.** `build-debian-pkg`,
`publish-docker-image`, and `publish-release` used `if: always() &&
...`, which kept dispatching them onto runners even after the operator
cancelled the workflow. The intermediate `contains(needs.X.result,
'failure')` checks did not match `'cancelled'`, and `always()`
explicitly bypasses workflow-level cancellation.
2. **Rollback skipped cancellations.** `In-case-of-failure` enumerated
specific failure modes (`contains(..., 'failure')` on each upstream) and
required `build-release.result == 'success'`. Cancellations and
skipped-due-to-upstream cases didn't satisfy any branch, so the git tag
was left on the remote whenever the operator cancelled. The rollback
step itself also assumed the tag was always present, so calling it on a
workflow that never reached the tag-push step would emit a confusing
"tag does not exist" failure.

## Changes per job

### build-release — no `if:` change
Always runs. Tag-push step (line 119) is still gated by
`inputs.perform_release && inputs.release_version != ''`, so a dry-run
never creates a tag.

### test-release — no `if:` change
Job-level `if:` remains `! inputs.skip_tests`. Skipped tests yield
result `'skipped'`, which downstream jobs continue to treat as a pass —
same as before.

### build-debian-pkg
```diff
-if: always() && contains(needs.build-release.result, 'success') && !contains(needs.test-release.result, 'failure')
+if: !cancelled() && contains(needs.build-release.result, 'success') && !contains(needs.test-release.result, 'failure')
```
- Full success: runs.
- `test-release` failed: skipped.
- `test-release` skipped (`skip_tests=true`): runs.
- `build-release` failed/cancelled: skipped.
- **Workflow cancelled: skipped** — was the bug; previously ran because
`'cancelled'` does not contain `'failure'`.

### publish-docker-image
Same change as `build-debian-pkg`, same behaviour matrix. Internal
docker push step is additionally gated by `inputs.perform_release` for
dry-run support.

### publish-release
```diff
-if: always() && contains(... 'success') && contains(... 'success') && contains(... 'success')
+if: !cancelled() && contains(... 'success') && contains(... 'success') && contains(... 'success')
```
- Full success: runs.
- Any upstream failed/cancelled/skipped: skipped.
- **Workflow cancelled in the window between all upstreams finishing as
success and this job starting: skipped** — was the bug.
- The internal `gh release create` step keeps its current `set +e`
soft-failure behaviour **intentionally**: a half-published draft must be
completable manually using the printed instructions and the
already-built artefacts. The job is by design not allowed to fail.
`publish-release.result` therefore stays `'success'` even when the
GitHub release object itself was not created, and rollback does not fire
— preserving the tag and artefacts for the operator's manual recovery.

### In-case-of-failure
Condition simplified from a four-branch failure-mode enumeration to a
single positive invariant: roll back whenever the release did not fully
publish.

```diff
 if: >-
   always() &&
   inputs.perform_release &&
-  contains(needs.build-release.result, 'success') &&
-  (
-    contains(needs.test-release.result, 'failure') ||
-    contains(needs.build-debian-pkg.result, 'failure') ||
-    contains(needs.publish-docker-image.result, 'failure') ||
-    contains(needs.publish-release.result, 'failure')
-  )
+  inputs.release_version != '' &&
+  needs.publish-release.result != 'success'
```

`!= 'success'` matches `'failure'`, `'cancelled'`, and `'skipped'`
uniformly. The `build-release.result == 'success'` precondition is
removed so rollback also runs when `build-release` was cancelled
mid-flight (the tag may already have been pushed at line 119 before the
cancel signal arrived).

The rollback step is now idempotent via `git ls-remote --exit-code
--quiet --tags`:

```diff
- git push -d origin ${RELEASE_VERSION}
+ if git ls-remote --exit-code --quiet --tags origin "${RELEASE_VERSION}"; then
+   git push -d origin "${RELEASE_VERSION}"
+   echo "Tag ${RELEASE_VERSION} removed."
+ else
+   echo "Tag ${RELEASE_VERSION} not present on remote — nothing to rollback."
+ fi
```

## Behaviour matrix (assume `perform_release=true` unless noted)

| Scenario | publish-release | rollback runs |

|---------------------------------------------------|-----------------|---------------|
| Happy path | success | no |
| Dry run (`perform_release=false`) | success | no |
| `skip_tests=true`, otherwise success | success | no |
| build-release fails (tag may or may not exist) | skipped | **yes** |
| test-release fails | skipped | **yes** |
| build-debian-pkg fails | skipped | **yes** |
| publish-docker-image fails | skipped | **yes** |
| Cancel during build-release | skipped | **yes** |
| Cancel during test-release | skipped | **yes** |
| Cancel during build-debian-pkg | skipped | **yes** |
| Cancel during publish-docker-image | skipped | **yes** |
| Cancel during publish-release | cancelled | **yes** |
| Cancel after publish-release succeeded (too late) | success | no |
| `gh release create` fails (by design, soft-fail) | success | no |
| Operator cancels rollback itself | - | cancelled — tag may persist;
manual cleanup |

## Step-level conditions left unchanged

- `if: always()` on the diagnostic torrent-client-status upload
(test-release line 420). Step-level `always()` is contained to the job;
it lets us capture diagnostics even when the QA test step itself failed.
- `if: ${{ inputs.perform_release }}` on the publish-side steps (lines
78, 113, 586, 627, 696). These gate API calls and tag creation for
dry-run support.
- `if: ${{ ! inputs.disable_version_check }}` on the version validation
step (line 164).

## Test plan

- [ ] Dispatch with `perform_release: false` (dry run) — confirm all
jobs run as before, no tag created, rollback not triggered.
- [ ] Dispatch with `perform_release: true` and let it complete normally
— confirm release published, rollback not triggered.
- [ ] Dispatch with `perform_release: true`, cancel during
`test-release` — confirm `build-debian-pkg` and `publish-docker-image`
are skipped (not dispatched), rollback runs and removes tag.
- [ ] Dispatch with `perform_release: true`, cancel during
`build-debian-pkg` — confirm `publish-release` is skipped, rollback runs
and removes tag.
- [ ] Dispatch with `perform_release: true` against a non-existent
commit/branch to force a `build-release` failure — confirm rollback runs
and reports "nothing to rollback" (idempotent).

Cherry-pick to `release/3.4` will follow once this is merged to `main`.
@pull pull Bot locked and limited conversation to collaborators May 4, 2026
@pull pull Bot added the ⤵️ pull label May 4, 2026
@pull pull Bot merged commit b8a7235 into Dustin4444:main May 4, 2026
1 check was pending
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.