Conversation
The nightly-race workflow had been failing on every run for a month. Most
failures were either race-detector overhead pushing test timings past
narrow assertion margins, tests mutating shared global state that
collided with parallel tests, or a single real data race that cascaded
into many victim tests.
internal/ethapi: testBackendWithPreMadhuguriBorReceipt.ChainConfig() did
a shallow copy of params.AllEthashProtocolChanges, leaving cfg.Bor
aliased to the global BorConfig, then mutated MadhugiriBlock on it.
Concurrent tests calling IsMadhugiri raced with the write. Deep-copy
BorConfig like the sibling testBackendWithNilBorTx already does. This
single race was the root cause of the ~10 cascading ethapi test failures
(TestBorForks, TestBorGetLogs_*, TestCoinbase, TestEstimateGas, etc.).
consensus/bor/heimdall: TestFailover_SwitchOnPrimaryDown and
TestRegistry_MarkUnhealthyOnRealFailure set only getSpanFn on the
primary mock, leaving FetchStatus returning success. The registry's
background probe then raced with the test's MarkUnhealthy call and could
flip the primary back to healthy (or the active gauge back to 0) before
the assertion ran. Make the primary mock fail FetchStatus too so probe
and API are consistent.
core/state: TestConcurrentUsedParallelism measures wall-clock parallel
speedup and asserts >=2x. Race instrumentation serializes atomic/mutex
ops and skews the measurement to ~1.7x. Skip under -race via a new
race_{on,off}_test.go build-tag pair; the test still guards against the
global-lock regression in non-race runs.
core/txpool/legacypool: TestLockOrdering_{PricedHeapNoDeadlock,
ReplacePendingNoDeadlock,RemovedNoDeadlock} used a 10s deadlock-detect
timeout that is genuinely too short once -race instrumentation is added
(the bare test takes ~12s). Bump to 60s — still catches real deadlocks,
no longer fires on legitimate completion.
eth/downloader: TestBeaconSync68/69Full used a 3s sync timeout that
CI routinely exceeded under -race. Bump to 30s.
TestSkeletonSyncRetrievals had four 2s polling loops that could exit
before background serving goroutines finished incrementing the served
counter (the assertion then read a partial count). Bump the budget to
30s, cap per-iteration sleep at 500ms so exponential backoff stays
responsive, and fold the served counter into the polling condition so
we wait for both subchain state and served totals before asserting.
eth/relay: TestCheckTxStatus, TestSubmitPreconfTx, TestSubmitPrivateTx
parallel subtests had handlers sleeping for rpcTimeout-100ms, leaving
only 100ms of slack before the client-side 2s timeout fired; under
-race that slack was easily exhausted. Halve the handler sleep to
rpcTimeout/2 — still proves parallelism (3 serial calls would exceed
the unchanged 2s upper bound) with comfortable margin for -race.
metrics: TestExpDecaySampleNanosecondRegression relied on the unseeded
global RNG and wall-clock timing, making the reservoir's average a
noisy statistic that occasionally drifted outside [14, 16] under
-race. Drive the test with a seeded RNG and synthesised monotonic
timestamps; override t0/t1 after NewExpDecaySample so dt in update()
stays positive. Test still catches the priority-overflow regression
(average would stick at 10 under the buggy formula).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
|
test comment |
Code review1 issue found. Checked for bugs and CLAUDE.md compliance. eth/downloader/skeleton_test.go line 922 (inline comment): Comment says "10s budget" but the loop on line 931 uses Per CLAUDE.md: comments should only be added when the WHY is non-obvious, and when added they must be accurate. A factually wrong concrete number is worse than no comment. Note: This was intended as a PR review with an inline comment but was posted as a regular comment due to environment constraints. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #2193 +/- ##
===========================================
- Coverage 51.99% 51.98% -0.01%
===========================================
Files 884 884
Lines 155503 155503
===========================================
- Hits 80854 80842 -12
- Misses 69448 69460 +12
Partials 5201 5201 see 19 files with indirect coverage changes 🚀 New features to boost your workflow:
|



The nightly-race workflow had been failing on every run for a month. Most failures were either race-detector overhead pushing test timings past narrow assertion margins, tests mutating shared global state that collided with parallel tests, or a single real data race that cascaded into many victim tests.
internal/ethapi: testBackendWithPreMadhuguriBorReceipt.ChainConfig() did a shallow copy of params.AllEthashProtocolChanges, leaving cfg.Bor aliased to the global BorConfig, then mutated MadhugiriBlock on it. Concurrent tests calling IsMadhugiri raced with the write. Deep-copy BorConfig like the sibling testBackendWithNilBorTx already does. This single race was the root cause of the ~10 cascading ethapi test failures (TestBorForks, TestBorGetLogs_*, TestCoinbase, TestEstimateGas, etc.).
consensus/bor/heimdall: TestFailover_SwitchOnPrimaryDown and TestRegistry_MarkUnhealthyOnRealFailure set only getSpanFn on the primary mock, leaving FetchStatus returning success. The registry's background probe then raced with the test's MarkUnhealthy call and could flip the primary back to healthy (or the active gauge back to 0) before the assertion ran. Make the primary mock fail FetchStatus too so probe and API are consistent.
core/state: TestConcurrentUsedParallelism measures wall-clock parallel speedup and asserts >=2x. Race instrumentation serializes atomic/mutex ops and skews the measurement to ~1.7x. Skip under -race via a new race_{on,off}_test.go build-tag pair; the test still guards against the global-lock regression in non-race runs.
core/txpool/legacypool: TestLockOrdering_{PricedHeapNoDeadlock, ReplacePendingNoDeadlock,RemovedNoDeadlock} used a 10s deadlock-detect timeout that is genuinely too short once -race instrumentation is added (the bare test takes ~12s). Bump to 60s — still catches real deadlocks, no longer fires on legitimate completion.
eth/downloader: TestBeaconSync68/69Full used a 3s sync timeout that CI routinely exceeded under -race. Bump to 30s.
TestSkeletonSyncRetrievals had four 2s polling loops that could exit before background serving goroutines finished incrementing the served counter (the assertion then read a partial count). Bump the budget to 30s, cap per-iteration sleep at 500ms so exponential backoff stays responsive, and fold the served counter into the polling condition so we wait for both subchain state and served totals before asserting.
eth/relay: TestCheckTxStatus, TestSubmitPreconfTx, TestSubmitPrivateTx parallel subtests had handlers sleeping for rpcTimeout-100ms, leaving only 100ms of slack before the client-side 2s timeout fired; under -race that slack was easily exhausted. Halve the handler sleep to rpcTimeout/2 — still proves parallelism (3 serial calls would exceed the unchanged 2s upper bound) with comfortable margin for -race.
metrics: TestExpDecaySampleNanosecondRegression relied on the unseeded global RNG and wall-clock timing, making the reservoir's average a noisy statistic that occasionally drifted outside [14, 16] under -race. Drive the test with a seeded RNG and synthesised monotonic timestamps; override t0/t1 after NewExpDecaySample so dt in update() stays positive. Test still catches the priority-overflow regression (average would stick at 10 under the buggy formula).