miner, consensus/bor: fix leaked-wedge family in worker state machine by cffls · Pull Request #2220 · 0xPolygon/bor

cffls · 2026-05-12T01:50:58Z

Description

Closes four silent stall paths in the producer state machine

miner.mainLoop: when PeerCount==0 on production chains, the dropped newWorkReq branch now clears pendingWorkBlock instead of leaking it.
miner.commitWork: the defer that clears pendingWorkBlock is now registered above the early syncing-check return.
miner.taskLoop: interrupt() now deletes the previous sealhash's pendingTasks entry. Bor.Seal's stop-branch returns silently without posting to resultCh, so resultLoop never cleans the entry.
consensus/bor.Seal: the result-delivery goroutine's second select no longer has a silent default branch — it blocks on send or exits via <-stop.

Regression tests cover each path (four unit + one integration):

TestMainLoopClearsPendingWorkBlockOnPeerCountZero
TestCommitWorkLeaksPendingWorkBlockWhenSyncing
TestTaskLoopInterruptCleansStalePendingTasks
TestSeal_BlocksOnFullResultChannelInsteadOfSilentDrop
TestProducerRecoversAfterMiningRestart (tests/bor, integration tag)

Changes

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)
Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

I have added at least 2 reviewer or the whole pos-v1 team
I have added sufficient documentation in code
I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

This PR requires changes to heimdall
- In case link the PR here:
This PR requires changes to matic-cli
- In case link the PR here:

Testing

I have added unit tests
I have added tests to CI
I have tested this code manually on local environment
I have tested this code manually on remote devnet using express-cli
I have tested this code manually on amoy
I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

claude · 2026-05-12T01:57:30Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Closes four silent stall paths in the producer state machine 1. miner.mainLoop: when PeerCount==0 on production chains, the dropped newWorkReq branch now clears pendingWorkBlock instead of leaking it. 2. miner.commitWork: the defer that clears pendingWorkBlock is now registered above the early syncing-check return. 3. miner.taskLoop: interrupt() now deletes the previous sealhash's pendingTasks entry. Bor.Seal's stop-branch returns silently without posting to resultCh, so resultLoop never cleans the entry. 4. consensus/bor.Seal: the result-delivery goroutine's second select no longer has a silent default branch — it blocks on send or exits via <-stop. Regression tests cover each path (four unit + one integration): - TestMainLoopClearsPendingWorkBlockOnPeerCountZero - TestCommitWorkLeaksPendingWorkBlockWhenSyncing - TestTaskLoopInterruptCleansStalePendingTasks - TestSeal_BlocksOnFullResultChannelInsteadOfSilentDrop - TestProducerRecoversAfterMiningRestart (tests/bor, integration tag)

The previous chain-ID gate (BorMainnet/Mumbai/Amoy) blocked the PeerCount==0 drop from being exercised by kurtosis chaos tests on chain ID 4927. Replacing it with chain-ID list omissions broke dev/test setups (Clique single-node, intentional-disconnection tests like TestValidatorWentOffline) where peer count is legitimately zero by design. Gate on heimdall presence instead. The HeimdallClient is nil on all test/dev setups (--bor.withoutheimdall, Clique, Ethash) and non-nil for any real-network Bor node, so the semantic 'this is a real network node' is captured directly. Forward-compatible: new production chain IDs don't need to be added to a list. Also restore the !IsRio removal in the veblop fallback: the post-Rio kurtosis bootstrap needs the periodic retry trigger after the PeerCount drop. Gate the fallback on !isBor instead so Clique/Ethash don't accidentally fire it. Fixes CI failures introduced by the prior commit: - TestServer_DeveloperMode (internal/cli/server) - TestCommand_DebugBlock (internal/cli) - TestValidatorWentOffline (tests/bor, integration)

codecov · 2026-05-12T18:15:21Z

Codecov Report

❌ Patch coverage is 95.06173% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.35%. Comparing base (f0ed88c) to head (0eeb164).

Files with missing lines	Patch %	Lines
miner/worker.go	94.28%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2220      +/-   ##
===========================================
+ Coverage    52.29%   52.35%   +0.05%     
===========================================
  Files          885      885              
  Lines       155727   155792      +65     
===========================================
+ Hits         81441    81560     +119     
+ Misses       69055    69000      -55     
- Partials      5231     5232       +1

Files with missing lines	Coverage Δ
consensus/bor/bor.go	`85.83% <100.00%> (+0.47%)`	⬆️
miner/fake_miner.go	`91.87% <100.00%> (+1.25%)`	⬆️
miner/worker.go	`74.52% <94.28%> (+1.29%)`	⬆️

... and 17 files with indirect coverage changes

Files with missing lines	Coverage Δ
consensus/bor/bor.go	`85.83% <100.00%> (+0.47%)`	⬆️
miner/fake_miner.go	`91.87% <100.00%> (+1.25%)`	⬆️
miner/worker.go	`74.52% <94.28%> (+1.29%)`	⬆️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cffls · 2026-05-12T18:31:59Z

@claude review

claude · 2026-05-12T18:32:20Z

Claude finished @cffls's task in 2m 38s —— View job

Code Review: PR #2220 — fix leaked-wedge family in worker state machine

Read changed production code (miner/worker.go, consensus/bor/bor.go, miner/fake_miner.go)
Read changed test files (miner/worker_test.go, consensus/bor/bor_test.go, miner/miner_test.go, tests/bor/bor_test.go)
Review fix MAT-156 Span sync #1: mainLoop clears pendingWorkBlock on PeerCount==0
Review fix build(deps): bump mixin-deep from 1.3.1 to 1.3.2 in /dashboard/assets #2: commitWork defer placement above syncing-check
Review fix build(deps): bump eslint-utils from 1.3.1 to 1.4.3 in /dashboard/assets #3: taskLoop interrupt cleans stale pendingTasks
Review fix build(deps): bump lodash from 4.17.11 to 4.17.15 in /dashboard/assets #4: Seal removes silent default branch
Security review (consensus-path safety, race conditions, resource leaks)
Post final review

Summary

All four fixes are well-motivated and correctly address the leak paths described in the PR. The regression tests are thorough and include excellent post-mortem context in their docstrings. I found one bug (data race) and have a few nits / observations below.

Bug: Data race in `warnIfStalled` — `len(w.pendingTasks)` without lock

miner/worker.go:2425 — warnIfStalled reads len(w.pendingTasks) without holding pendingMu:

log.Warn("Possible producer stall: veblop fallback skipping while chain is stale",
    ...
    "pendingTasksCount", len(w.pendingTasks),  // ← no lock held
    "peerCount", w.eth.PeerCount())

This is called from newWorkLoop (line 765), while pendingTasks is concurrently modified by taskLoop (add at line 1005, delete via deletePendingTask) and resultLoop (delete at line 1050+) — all under pendingMu. The hasPendingTasks parameter was correctly obtained under RLock at lines 760–762, but the log statement bypasses that.

go test -race would flag this. Fix: either pass hasPendingTasks (already a bool parameter) to the log instead, or read the count under the lock before calling warnIfStalled.

Fix this →

Observations (non-blocking)

1. Behavioral change: veblopTimer now fires for pre-Rio Bor chains

miner/worker.go:755 — The condition changed from:

// OLD: skip veblop fallback commit for non-Rio chains
if w.chainConfig.Bor == nil || !w.chainConfig.Bor.IsRio(currentBlock.Number) {

to:

// NEW: skip veblop fallback commit only for non-Bor engines
if !isBor || w.chainConfig.Bor == nil {

This intentionally extends the veblop fallback retry to all Bor chains (not just Rio+). The comment explains the rationale (pre-Rio needs it for peer-outage recovery). This is a behavior change beyond the four bug fixes — worth noting in the PR description for reviewers assessing regression risk on pre-Rio chain configs.

2. Behavioral change: PeerCount gate — chain-ID → heimdall-presence

miner/worker.go:846 — The PeerCount gate changed from chain-ID matching (BorMainnet || Mumbai || Amoy) to bor.HeimdallClient != nil. This is cleaner — it automatically covers future chain IDs and correctly skips the gate for --bor.withoutheimdall setups. The logic is equivalent for all current production chains.

3. Fix #4 (Seal): blocking on full `resultCh`

consensus/bor/bor.go:1495–1499 — Replacing default with case <-stop means the Seal goroutine will block if resultCh (buffered, size 10) is full and stop hasn't been closed. This is correct behavior — 10 unconsumed sealed blocks would indicate resultLoop is severely stuck, and blocking here is the right backpressure signal. The companion fix in taskLoop.interrupt() (which closes stop and cleans pendingTasks) ensures clean exit on the interrupt path.

4. Test quality

The regression tests are excellent — thorough docstrings explaining the exact pre/post-fix behavior, correct isolation of each bug path, and a thoughtful integration test that exercises the stop/start recovery cycle end-to-end. The TestTaskLoopInterruptCleansStalePendingTasks test's approach of manually injecting pendingTasks entries to simulate the Seal-stop leak without driving actual seal authorization is a good pattern.

Verdict

The four core fixes are correct and well-tested. The data race in warnIfStalled should be fixed before merge.

Caller in newWorkLoop's veblopTimer case captured len(w.pendingTasks) under pendingMu, but the helper warnIfStalled read len(w.pendingTasks) again unguarded in its log statement. Concurrent taskLoop / resultLoop mutations would race with the read. Capture the count once under the lock and pass it through to the helper instead. Also drop the redundant DevFakeAuthor=false write from TestMainLoopClearsPendingWorkBlockOnPeerCountZero — NewFakeBor already constructs with DevFakeAuthor=false, and writing after mainLoop has captured the value at startup is itself a data race that -race flagged. Reported in PR #2220 review.

cffls · 2026-05-12T19:23:12Z

Thanks for the careful review.

Race fix landed in eee7f8b6. The caller in newWorkLoop's veblopTimer case now captures len(w.pendingTasks) once under pendingMu.RLock() and passes the count through to warnIfStalled as a parameter, so the helper no longer reads the slice unguarded. go test -race ./miner/... ./consensus/bor/... is clean.

While fixing it, -race also flagged a second race in TestMainLoopClearsPendingWorkBlockOnPeerCountZero — the test was writing bor.DevFakeAuthor = false after mainLoop had already captured the value at startup. NewFakeBor already constructs with DevFakeAuthor=false, so I dropped the redundant write. Same commit.

On the non-blocking observations:

Veblop fallback fires pre-Rio — intentional. Without it, the kurtosis dev chain (pre-Rio at boot) has no retry trigger after the new PeerCount drop, so a transient peer outage at startup would wedge it permanently. Pre-Rio Bor chains use wiggle * (succession-1) * blockTime delays inside Bor.Seal, so any extra commits from the fallback either get dedup'd by pendingWorkBlock == head+1 or back off naturally — no in-flight regression risk.
PeerCount gate (chain-ID → heimdall-presence) — agreed, the new gate is equivalent for current production chains and automatically covers future ones. Also fixes the unintended exclusion of test setups (Clique, --bor.withoutheimdall) that the chain-ID list was accidentally enforcing.
Bor.Seal blocking on full resultCh — exactly the reasoning. The companion taskLoop.interrupt() cleanup closes stopCh on the next task arrival, which is the only path that would meaningfully unblock a stuck seal goroutine.

Ready for re-review.

…-gate The PeerCount==0 fix was originally chain-ID-gated (BorMainnet/Mumbai/Amoy) and the test docstrings reflected that. After review the gate was changed to heimdall-presence (`bor.HeimdallClient != nil`), but the docstrings were not updated. This commit corrects the inaccuracies: - TestMainLoopClearsPendingWorkBlockOnPeerCountZero now correctly describes the heimdall-presence trigger and notes that the test uses the mock heimdall client (non-nil) from NewFakeBor — no ChainID override exists. - TestProducerRecoversAfterMiningRestart now correctly explains that Bug 1's drop branch isn't exercised because the test uses withoutHeimdall=true (HeimdallClient nil), not because of chain-ID. Reported in PR #2220 review.

…-gate The PeerCount==0 fix was originally chain-ID-gated (BorMainnet/Mumbai/Amoy) and the test docstrings reflected that. After review the gate was changed to heimdall-presence (`bor.HeimdallClient != nil`), but the docstrings were not updated. This commit corrects the inaccuracies: - TestMainLoopClearsPendingWorkBlockOnPeerCountZero now correctly describes the heimdall-presence trigger and notes that the test uses the mock heimdall client (non-nil) from NewFakeBor — no ChainID override exists. - TestProducerRecoversAfterMiningRestart now correctly explains that Bug 1's drop branch isn't exercised because the test uses withoutHeimdall=true (HeimdallClient nil), not because of chain-ID.

…exits The previous taskLoop fix made interrupt() unconditionally call deletePendingTask(prev), which races with Bor.Seal's success path: when the goroutine has already delivered the result to resultCh but resultLoop is busy, interrupt() deletes the entry resultLoop is about to look up, and the !exist branch silently drops the validly-sealed block. Move cleanup into Bor.Seal itself, gated to stop-branch exits, via a new SealWithStopHook(..., onStopExit func()) method. The existing Seal becomes a thin wrapper passing nil to preserve consensus.Engine. taskLoop type-asserts to *bor.Bor and uses SealWithStopHook with a per-sealhash cleanup closure. interrupt() now only closes stopCh. Adds three bor-level tests covering both stop-branch exits and the success path, and rewrites the worker-level test to assert the new contract: interrupt() must NOT delete pendingTasks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ction Two docstrings still described the pre-revision design where interrupt() deleted pendingTasks directly: - tests/bor/bor_test.go bullet #3 named a unit test (TestTaskLoopInterruptCleansStalePendingTasks) that no longer exists; the actual test is TestTaskLoopInterruptPreservesPendingTasks asserting the opposite semantic. - miner/worker.go deletePendingTask docstring said "Used by taskLoop.interrupt" but the caller is the per-task onStopExit closure passed to Bor.SealWithStopHook. Documentation-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, reopen this pull request to trigger a review.

sonarqubecloud · 2026-05-15T20:23:41Z

Quality Gate failed

Failed conditions
6.4% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

claude · 2026-05-15T20:30:45Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

claude Bot reviewed May 12, 2026

View reviewed changes

cffls force-pushed the wedge-fix-audit branch from ec127a3 to 065a7da Compare May 12, 2026 02:07

cffls force-pushed the wedge-fix-audit branch from 065a7da to 6e5052f Compare May 12, 2026 02:23

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread miner/worker.go

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread miner/worker_test.go

claude Bot reviewed May 12, 2026

View reviewed changes

Comment thread miner/worker.go

cffls and others added 2 commits May 13, 2026 11:06

cffls force-pushed the wedge-fix-audit branch from 0161c71 to dbfe887 Compare May 13, 2026 18:07

claude Bot reviewed May 13, 2026

View reviewed changes

Comment thread tests/bor/bor_test.go Outdated

cffls and others added 2 commits May 13, 2026 11:27

Merge remote-tracking branch 'origin/develop' into wedge-fix-audit

0eeb164

claude Bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

miner, consensus/bor: fix leaked-wedge family in worker state machine#2220

miner, consensus/bor: fix leaked-wedge family in worker state machine#2220
cffls wants to merge 7 commits into
developfrom
wedge-fix-audit

cffls commented May 12, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

codecov Bot commented May 12, 2026 •

edited

Loading

Uh oh!

cffls commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

cffls commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

sonarqubecloud Bot commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cffls commented May 12, 2026

Description

Changes

Breaking changes

Nodes audience

Checklist

Cross repository changes

Testing

Manual tests

Additional comments

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

claude Bot commented May 12, 2026

Code review

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cffls commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #2220 — fix leaked-wedge family in worker state machine

Summary

Bug: Data race in warnIfStalled — len(w.pendingTasks) without lock

Observations (non-blocking)

1. Behavioral change: veblopTimer now fires for pre-Rio Bor chains

2. Behavioral change: PeerCount gate — chain-ID → heimdall-presence

3. Fix #4 (Seal): blocking on full resultCh

4. Test quality

Verdict

Uh oh!

Uh oh!

cffls commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 15, 2026

Quality Gate failed

Uh oh!

claude Bot commented May 15, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 12, 2026 •

edited

Loading

claude Bot commented May 12, 2026 •

edited

Loading

Bug: Data race in `warnIfStalled` — `len(w.pendingTasks)` without lock

3. Fix #4 (Seal): blocking on full `resultCh`