miner, consensus/bor: fix leaked-wedge family in worker state machine#2220
miner, consensus/bor: fix leaked-wedge family in worker state machine#2220cffls wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
Closes four silent stall paths in the producer state machine 1. miner.mainLoop: when PeerCount==0 on production chains, the dropped newWorkReq branch now clears pendingWorkBlock instead of leaking it. 2. miner.commitWork: the defer that clears pendingWorkBlock is now registered above the early syncing-check return. 3. miner.taskLoop: interrupt() now deletes the previous sealhash's pendingTasks entry. Bor.Seal's stop-branch returns silently without posting to resultCh, so resultLoop never cleans the entry. 4. consensus/bor.Seal: the result-delivery goroutine's second select no longer has a silent default branch — it blocks on send or exits via <-stop. Regression tests cover each path (four unit + one integration): - TestMainLoopClearsPendingWorkBlockOnPeerCountZero - TestCommitWorkLeaksPendingWorkBlockWhenSyncing - TestTaskLoopInterruptCleansStalePendingTasks - TestSeal_BlocksOnFullResultChannelInsteadOfSilentDrop - TestProducerRecoversAfterMiningRestart (tests/bor, integration tag)
The previous chain-ID gate (BorMainnet/Mumbai/Amoy) blocked the PeerCount==0 drop from being exercised by kurtosis chaos tests on chain ID 4927. Replacing it with chain-ID list omissions broke dev/test setups (Clique single-node, intentional-disconnection tests like TestValidatorWentOffline) where peer count is legitimately zero by design. Gate on heimdall presence instead. The HeimdallClient is nil on all test/dev setups (--bor.withoutheimdall, Clique, Ethash) and non-nil for any real-network Bor node, so the semantic 'this is a real network node' is captured directly. Forward-compatible: new production chain IDs don't need to be added to a list. Also restore the !IsRio removal in the veblop fallback: the post-Rio kurtosis bootstrap needs the periodic retry trigger after the PeerCount drop. Gate the fallback on !isBor instead so Clique/Ethash don't accidentally fire it. Fixes CI failures introduced by the prior commit: - TestServer_DeveloperMode (internal/cli/server) - TestCommand_DebugBlock (internal/cli) - TestValidatorWentOffline (tests/bor, integration)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #2220 +/- ##
===========================================
+ Coverage 52.29% 52.35% +0.05%
===========================================
Files 885 885
Lines 155727 155792 +65
===========================================
+ Hits 81441 81560 +119
+ Misses 69055 69000 -55
- Partials 5231 5232 +1
... and 17 files with indirect coverage changes
🚀 New features to boost your workflow:
|
|
@claude review |
|
Claude finished @cffls's task in 2m 38s —— View job Code Review: PR #2220 — fix leaked-wedge family in worker state machine
SummaryAll four fixes are well-motivated and correctly address the leak paths described in the PR. The regression tests are thorough and include excellent post-mortem context in their docstrings. I found one bug (data race) and have a few nits / observations below. Bug: Data race in
|
Caller in newWorkLoop's veblopTimer case captured len(w.pendingTasks) under pendingMu, but the helper warnIfStalled read len(w.pendingTasks) again unguarded in its log statement. Concurrent taskLoop / resultLoop mutations would race with the read. Capture the count once under the lock and pass it through to the helper instead. Also drop the redundant DevFakeAuthor=false write from TestMainLoopClearsPendingWorkBlockOnPeerCountZero — NewFakeBor already constructs with DevFakeAuthor=false, and writing after mainLoop has captured the value at startup is itself a data race that -race flagged. Reported in PR #2220 review.
|
Thanks for the careful review. Race fix landed in While fixing it, On the non-blocking observations:
Ready for re-review. |
…-gate The PeerCount==0 fix was originally chain-ID-gated (BorMainnet/Mumbai/Amoy) and the test docstrings reflected that. After review the gate was changed to heimdall-presence (`bor.HeimdallClient != nil`), but the docstrings were not updated. This commit corrects the inaccuracies: - TestMainLoopClearsPendingWorkBlockOnPeerCountZero now correctly describes the heimdall-presence trigger and notes that the test uses the mock heimdall client (non-nil) from NewFakeBor — no ChainID override exists. - TestProducerRecoversAfterMiningRestart now correctly explains that Bug 1's drop branch isn't exercised because the test uses withoutHeimdall=true (HeimdallClient nil), not because of chain-ID. Reported in PR #2220 review.
…-gate The PeerCount==0 fix was originally chain-ID-gated (BorMainnet/Mumbai/Amoy) and the test docstrings reflected that. After review the gate was changed to heimdall-presence (`bor.HeimdallClient != nil`), but the docstrings were not updated. This commit corrects the inaccuracies: - TestMainLoopClearsPendingWorkBlockOnPeerCountZero now correctly describes the heimdall-presence trigger and notes that the test uses the mock heimdall client (non-nil) from NewFakeBor — no ChainID override exists. - TestProducerRecoversAfterMiningRestart now correctly explains that Bug 1's drop branch isn't exercised because the test uses withoutHeimdall=true (HeimdallClient nil), not because of chain-ID.
…exits The previous taskLoop fix made interrupt() unconditionally call deletePendingTask(prev), which races with Bor.Seal's success path: when the goroutine has already delivered the result to resultCh but resultLoop is busy, interrupt() deletes the entry resultLoop is about to look up, and the !exist branch silently drops the validly-sealed block. Move cleanup into Bor.Seal itself, gated to stop-branch exits, via a new SealWithStopHook(..., onStopExit func()) method. The existing Seal becomes a thin wrapper passing nil to preserve consensus.Engine. taskLoop type-asserts to *bor.Bor and uses SealWithStopHook with a per-sealhash cleanup closure. interrupt() now only closes stopCh. Adds three bor-level tests covering both stop-branch exits and the success path, and rewrites the worker-level test to assert the new contract: interrupt() must NOT delete pendingTasks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ction Two docstrings still described the pre-revision design where interrupt() deleted pendingTasks directly: - tests/bor/bor_test.go bullet #3 named a unit test (TestTaskLoopInterruptCleansStalePendingTasks) that no longer exists; the actual test is TestTaskLoopInterruptPreservesPendingTasks asserting the opposite semantic. - miner/worker.go deletePendingTask docstring said "Used by taskLoop.interrupt" but the caller is the per-task onStopExit closure passed to Bor.SealWithStopHook. Documentation-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.
Once the cap resets or is raised, reopen this pull request to trigger a review.
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |


Description
Closes four silent stall paths in the producer state machine
Regression tests cover each path (four unit + one integration):
Changes
Breaking changes
Please complete this section if any breaking changes have been made, otherwise delete it
Nodes audience
In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)
Checklist
Cross repository changes
Testing
Manual tests
Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it
Additional comments
Please post additional comments in this section if you have them, otherwise delete it