Skip to content

miner, consensus/bor: fix leaked-wedge family in worker state machine#2220

Open
cffls wants to merge 7 commits into
developfrom
wedge-fix-audit
Open

miner, consensus/bor: fix leaked-wedge family in worker state machine#2220
cffls wants to merge 7 commits into
developfrom
wedge-fix-audit

Conversation

@cffls
Copy link
Copy Markdown
Contributor

@cffls cffls commented May 12, 2026

Description

Closes four silent stall paths in the producer state machine

  1. miner.mainLoop: when PeerCount==0 on production chains, the dropped newWorkReq branch now clears pendingWorkBlock instead of leaking it.
  2. miner.commitWork: the defer that clears pendingWorkBlock is now registered above the early syncing-check return.
  3. miner.taskLoop: interrupt() now deletes the previous sealhash's pendingTasks entry. Bor.Seal's stop-branch returns silently without posting to resultCh, so resultLoop never cleans the entry.
  4. consensus/bor.Seal: the result-delivery goroutine's second select no longer has a silent default branch — it blocks on send or exits via <-stop.

Regression tests cover each path (four unit + one integration):

  • TestMainLoopClearsPendingWorkBlockOnPeerCountZero
  • TestCommitWorkLeaksPendingWorkBlockWhenSyncing
  • TestTaskLoopInterruptCleansStalePendingTasks
  • TestSeal_BlocksOnFullResultChannelInsteadOfSilentDrop
  • TestProducerRecoversAfterMiningRestart (tests/bor, integration tag)

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@cffls cffls force-pushed the wedge-fix-audit branch from ec127a3 to 065a7da Compare May 12, 2026 02:07
Closes four silent stall paths in the producer state machine

1. miner.mainLoop: when PeerCount==0 on production chains, the dropped
   newWorkReq branch now clears pendingWorkBlock instead of leaking it.
2. miner.commitWork: the defer that clears pendingWorkBlock is now
   registered above the early syncing-check return.
3. miner.taskLoop: interrupt() now deletes the previous sealhash's
   pendingTasks entry. Bor.Seal's stop-branch returns silently without
   posting to resultCh, so resultLoop never cleans the entry.
4. consensus/bor.Seal: the result-delivery goroutine's second select
   no longer has a silent default branch — it blocks on send or exits
   via <-stop.

Regression tests cover each path (four unit + one integration):
- TestMainLoopClearsPendingWorkBlockOnPeerCountZero
- TestCommitWorkLeaksPendingWorkBlockWhenSyncing
- TestTaskLoopInterruptCleansStalePendingTasks
- TestSeal_BlocksOnFullResultChannelInsteadOfSilentDrop
- TestProducerRecoversAfterMiningRestart (tests/bor, integration tag)
@cffls cffls force-pushed the wedge-fix-audit branch from 065a7da to 6e5052f Compare May 12, 2026 02:23
The previous chain-ID gate (BorMainnet/Mumbai/Amoy) blocked the
PeerCount==0 drop from being exercised by kurtosis chaos tests on
chain ID 4927. Replacing it with chain-ID list omissions broke
dev/test setups (Clique single-node, intentional-disconnection
tests like TestValidatorWentOffline) where peer count is
legitimately zero by design.

Gate on heimdall presence instead. The HeimdallClient is nil on
all test/dev setups (--bor.withoutheimdall, Clique, Ethash) and
non-nil for any real-network Bor node, so the semantic 'this is
a real network node' is captured directly. Forward-compatible:
new production chain IDs don't need to be added to a list.

Also restore the !IsRio removal in the veblop fallback: the
post-Rio kurtosis bootstrap needs the periodic retry trigger
after the PeerCount drop. Gate the fallback on !isBor instead so
Clique/Ethash don't accidentally fire it.

Fixes CI failures introduced by the prior commit:
- TestServer_DeveloperMode (internal/cli/server)
- TestCommand_DebugBlock   (internal/cli)
- TestValidatorWentOffline (tests/bor, integration)
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 95.06173% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.35%. Comparing base (f0ed88c) to head (0eeb164).

Files with missing lines Patch % Lines
miner/worker.go 94.28% 2 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2220      +/-   ##
===========================================
+ Coverage    52.29%   52.35%   +0.05%     
===========================================
  Files          885      885              
  Lines       155727   155792      +65     
===========================================
+ Hits         81441    81560     +119     
+ Misses       69055    69000      -55     
- Partials      5231     5232       +1     
Files with missing lines Coverage Δ
consensus/bor/bor.go 85.83% <100.00%> (+0.47%) ⬆️
miner/fake_miner.go 91.87% <100.00%> (+1.25%) ⬆️
miner/worker.go 74.52% <94.28%> (+1.29%) ⬆️

... and 17 files with indirect coverage changes

Files with missing lines Coverage Δ
consensus/bor/bor.go 85.83% <100.00%> (+0.47%) ⬆️
miner/fake_miner.go 91.87% <100.00%> (+1.25%) ⬆️
miner/worker.go 74.52% <94.28%> (+1.29%) ⬆️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cffls
Copy link
Copy Markdown
Contributor Author

cffls commented May 12, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Claude finished @cffls's task in 2m 38s —— View job


Code Review: PR #2220 — fix leaked-wedge family in worker state machine


Summary

All four fixes are well-motivated and correctly address the leak paths described in the PR. The regression tests are thorough and include excellent post-mortem context in their docstrings. I found one bug (data race) and have a few nits / observations below.


Bug: Data race in warnIfStalledlen(w.pendingTasks) without lock

miner/worker.go:2425warnIfStalled reads len(w.pendingTasks) without holding pendingMu:

log.Warn("Possible producer stall: veblop fallback skipping while chain is stale",
    ...
    "pendingTasksCount", len(w.pendingTasks),  // ← no lock held
    "peerCount", w.eth.PeerCount())

This is called from newWorkLoop (line 765), while pendingTasks is concurrently modified by taskLoop (add at line 1005, delete via deletePendingTask) and resultLoop (delete at line 1050+) — all under pendingMu. The hasPendingTasks parameter was correctly obtained under RLock at lines 760–762, but the log statement bypasses that.

go test -race would flag this. Fix: either pass hasPendingTasks (already a bool parameter) to the log instead, or read the count under the lock before calling warnIfStalled.

Fix this →


Observations (non-blocking)

1. Behavioral change: veblopTimer now fires for pre-Rio Bor chains

miner/worker.go:755 — The condition changed from:

// OLD: skip veblop fallback commit for non-Rio chains
if w.chainConfig.Bor == nil || !w.chainConfig.Bor.IsRio(currentBlock.Number) {

to:

// NEW: skip veblop fallback commit only for non-Bor engines
if !isBor || w.chainConfig.Bor == nil {

This intentionally extends the veblop fallback retry to all Bor chains (not just Rio+). The comment explains the rationale (pre-Rio needs it for peer-outage recovery). This is a behavior change beyond the four bug fixes — worth noting in the PR description for reviewers assessing regression risk on pre-Rio chain configs.

2. Behavioral change: PeerCount gate — chain-ID → heimdall-presence

miner/worker.go:846 — The PeerCount gate changed from chain-ID matching (BorMainnet || Mumbai || Amoy) to bor.HeimdallClient != nil. This is cleaner — it automatically covers future chain IDs and correctly skips the gate for --bor.withoutheimdall setups. The logic is equivalent for all current production chains.

3. Fix #4 (Seal): blocking on full resultCh

consensus/bor/bor.go:1495–1499 — Replacing default with case <-stop means the Seal goroutine will block if resultCh (buffered, size 10) is full and stop hasn't been closed. This is correct behavior — 10 unconsumed sealed blocks would indicate resultLoop is severely stuck, and blocking here is the right backpressure signal. The companion fix in taskLoop.interrupt() (which closes stop and cleans pendingTasks) ensures clean exit on the interrupt path.

4. Test quality

The regression tests are excellent — thorough docstrings explaining the exact pre/post-fix behavior, correct isolation of each bug path, and a thoughtful integration test that exercises the stop/start recovery cycle end-to-end. The TestTaskLoopInterruptCleansStalePendingTasks test's approach of manually injecting pendingTasks entries to simulate the Seal-stop leak without driving actual seal authorization is a good pattern.


Verdict

The four core fixes are correct and well-tested. The data race in warnIfStalled should be fixed before merge.

Comment thread miner/worker.go
Caller in newWorkLoop's veblopTimer case captured len(w.pendingTasks)
under pendingMu, but the helper warnIfStalled read len(w.pendingTasks)
again unguarded in its log statement. Concurrent taskLoop /
resultLoop mutations would race with the read.

Capture the count once under the lock and pass it through to the
helper instead.

Also drop the redundant DevFakeAuthor=false write from
TestMainLoopClearsPendingWorkBlockOnPeerCountZero — NewFakeBor
already constructs with DevFakeAuthor=false, and writing after
mainLoop has captured the value at startup is itself a data race
that -race flagged.

Reported in PR #2220 review.
@cffls
Copy link
Copy Markdown
Contributor Author

cffls commented May 12, 2026

Thanks for the careful review.

Race fix landed in eee7f8b6. The caller in newWorkLoop's veblopTimer case now captures len(w.pendingTasks) once under pendingMu.RLock() and passes the count through to warnIfStalled as a parameter, so the helper no longer reads the slice unguarded. go test -race ./miner/... ./consensus/bor/... is clean.

While fixing it, -race also flagged a second race in TestMainLoopClearsPendingWorkBlockOnPeerCountZero — the test was writing bor.DevFakeAuthor = false after mainLoop had already captured the value at startup. NewFakeBor already constructs with DevFakeAuthor=false, so I dropped the redundant write. Same commit.

On the non-blocking observations:

  1. Veblop fallback fires pre-Rio — intentional. Without it, the kurtosis dev chain (pre-Rio at boot) has no retry trigger after the new PeerCount drop, so a transient peer outage at startup would wedge it permanently. Pre-Rio Bor chains use wiggle * (succession-1) * blockTime delays inside Bor.Seal, so any extra commits from the fallback either get dedup'd by pendingWorkBlock == head+1 or back off naturally — no in-flight regression risk.

  2. PeerCount gate (chain-ID → heimdall-presence) — agreed, the new gate is equivalent for current production chains and automatically covers future ones. Also fixes the unintended exclusion of test setups (Clique, --bor.withoutheimdall) that the chain-ID list was accidentally enforcing.

  3. Bor.Seal blocking on full resultCh — exactly the reasoning. The companion taskLoop.interrupt() cleanup closes stopCh on the next task arrival, which is the only path that would meaningfully unblock a stuck seal goroutine.

Ready for re-review.

Comment thread miner/worker_test.go
cffls added a commit that referenced this pull request May 12, 2026
…-gate

The PeerCount==0 fix was originally chain-ID-gated (BorMainnet/Mumbai/Amoy)
and the test docstrings reflected that. After review the gate was changed
to heimdall-presence (`bor.HeimdallClient != nil`), but the docstrings
were not updated.

This commit corrects the inaccuracies:
- TestMainLoopClearsPendingWorkBlockOnPeerCountZero now correctly
  describes the heimdall-presence trigger and notes that the test uses
  the mock heimdall client (non-nil) from NewFakeBor — no ChainID
  override exists.
- TestProducerRecoversAfterMiningRestart now correctly explains that
  Bug 1's drop branch isn't exercised because the test uses
  withoutHeimdall=true (HeimdallClient nil), not because of chain-ID.

Reported in PR #2220 review.
Comment thread miner/worker.go
cffls and others added 2 commits May 13, 2026 11:06
…-gate

The PeerCount==0 fix was originally chain-ID-gated (BorMainnet/Mumbai/Amoy)
and the test docstrings reflected that. After review the gate was changed
to heimdall-presence (`bor.HeimdallClient != nil`), but the docstrings
were not updated.

This commit corrects the inaccuracies:
- TestMainLoopClearsPendingWorkBlockOnPeerCountZero now correctly
  describes the heimdall-presence trigger and notes that the test uses
  the mock heimdall client (non-nil) from NewFakeBor — no ChainID
  override exists.
- TestProducerRecoversAfterMiningRestart now correctly explains that
  Bug 1's drop branch isn't exercised because the test uses
  withoutHeimdall=true (HeimdallClient nil), not because of chain-ID.
…exits

The previous taskLoop fix made interrupt() unconditionally call
deletePendingTask(prev), which races with Bor.Seal's success path:
when the goroutine has already delivered the result to resultCh but
resultLoop is busy, interrupt() deletes the entry resultLoop is about
to look up, and the !exist branch silently drops the validly-sealed
block.

Move cleanup into Bor.Seal itself, gated to stop-branch exits, via a
new SealWithStopHook(..., onStopExit func()) method. The existing
Seal becomes a thin wrapper passing nil to preserve consensus.Engine.

taskLoop type-asserts to *bor.Bor and uses SealWithStopHook with a
per-sealhash cleanup closure. interrupt() now only closes stopCh.

Adds three bor-level tests covering both stop-branch exits and the
success path, and rewrites the worker-level test to assert the new
contract: interrupt() must NOT delete pendingTasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cffls cffls force-pushed the wedge-fix-audit branch from 0161c71 to dbfe887 Compare May 13, 2026 18:07
Comment thread tests/bor/bor_test.go Outdated
cffls and others added 2 commits May 13, 2026 11:27
…ction

Two docstrings still described the pre-revision design where
interrupt() deleted pendingTasks directly:

- tests/bor/bor_test.go bullet #3 named a unit test
  (TestTaskLoopInterruptCleansStalePendingTasks) that no longer
  exists; the actual test is TestTaskLoopInterruptPreservesPendingTasks
  asserting the opposite semantic.
- miner/worker.go deletePendingTask docstring said "Used by
  taskLoop.interrupt" but the caller is the per-task onStopExit
  closure passed to Bor.SealWithStopHook.

Documentation-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, reopen this pull request to trigger a review.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
6.4% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant