Skip to content

all: fix nightly race test flakes#2193

Merged
cffls merged 1 commit into
developfrom
fix_race
Apr 23, 2026
Merged

all: fix nightly race test flakes#2193
cffls merged 1 commit into
developfrom
fix_race

Conversation

@cffls
Copy link
Copy Markdown
Contributor

@cffls cffls commented Apr 21, 2026

The nightly-race workflow had been failing on every run for a month. Most failures were either race-detector overhead pushing test timings past narrow assertion margins, tests mutating shared global state that collided with parallel tests, or a single real data race that cascaded into many victim tests.

internal/ethapi: testBackendWithPreMadhuguriBorReceipt.ChainConfig() did a shallow copy of params.AllEthashProtocolChanges, leaving cfg.Bor aliased to the global BorConfig, then mutated MadhugiriBlock on it. Concurrent tests calling IsMadhugiri raced with the write. Deep-copy BorConfig like the sibling testBackendWithNilBorTx already does. This single race was the root cause of the ~10 cascading ethapi test failures (TestBorForks, TestBorGetLogs_*, TestCoinbase, TestEstimateGas, etc.).

consensus/bor/heimdall: TestFailover_SwitchOnPrimaryDown and TestRegistry_MarkUnhealthyOnRealFailure set only getSpanFn on the primary mock, leaving FetchStatus returning success. The registry's background probe then raced with the test's MarkUnhealthy call and could flip the primary back to healthy (or the active gauge back to 0) before the assertion ran. Make the primary mock fail FetchStatus too so probe and API are consistent.

core/state: TestConcurrentUsedParallelism measures wall-clock parallel speedup and asserts >=2x. Race instrumentation serializes atomic/mutex ops and skews the measurement to ~1.7x. Skip under -race via a new race_{on,off}_test.go build-tag pair; the test still guards against the global-lock regression in non-race runs.

core/txpool/legacypool: TestLockOrdering_{PricedHeapNoDeadlock, ReplacePendingNoDeadlock,RemovedNoDeadlock} used a 10s deadlock-detect timeout that is genuinely too short once -race instrumentation is added (the bare test takes ~12s). Bump to 60s — still catches real deadlocks, no longer fires on legitimate completion.

eth/downloader: TestBeaconSync68/69Full used a 3s sync timeout that CI routinely exceeded under -race. Bump to 30s.
TestSkeletonSyncRetrievals had four 2s polling loops that could exit before background serving goroutines finished incrementing the served counter (the assertion then read a partial count). Bump the budget to 30s, cap per-iteration sleep at 500ms so exponential backoff stays responsive, and fold the served counter into the polling condition so we wait for both subchain state and served totals before asserting.

eth/relay: TestCheckTxStatus, TestSubmitPreconfTx, TestSubmitPrivateTx parallel subtests had handlers sleeping for rpcTimeout-100ms, leaving only 100ms of slack before the client-side 2s timeout fired; under -race that slack was easily exhausted. Halve the handler sleep to rpcTimeout/2 — still proves parallelism (3 serial calls would exceed the unchanged 2s upper bound) with comfortable margin for -race.

metrics: TestExpDecaySampleNanosecondRegression relied on the unseeded global RNG and wall-clock timing, making the reservoir's average a noisy statistic that occasionally drifted outside [14, 16] under -race. Drive the test with a seeded RNG and synthesised monotonic timestamps; override t0/t1 after NewExpDecaySample so dt in update() stays positive. Test still catches the priority-overflow regression (average would stick at 10 under the buggy formula).

The nightly-race workflow had been failing on every run for a month. Most
failures were either race-detector overhead pushing test timings past
narrow assertion margins, tests mutating shared global state that
collided with parallel tests, or a single real data race that cascaded
into many victim tests.

internal/ethapi: testBackendWithPreMadhuguriBorReceipt.ChainConfig() did
a shallow copy of params.AllEthashProtocolChanges, leaving cfg.Bor
aliased to the global BorConfig, then mutated MadhugiriBlock on it.
Concurrent tests calling IsMadhugiri raced with the write. Deep-copy
BorConfig like the sibling testBackendWithNilBorTx already does. This
single race was the root cause of the ~10 cascading ethapi test failures
(TestBorForks, TestBorGetLogs_*, TestCoinbase, TestEstimateGas, etc.).

consensus/bor/heimdall: TestFailover_SwitchOnPrimaryDown and
TestRegistry_MarkUnhealthyOnRealFailure set only getSpanFn on the
primary mock, leaving FetchStatus returning success. The registry's
background probe then raced with the test's MarkUnhealthy call and could
flip the primary back to healthy (or the active gauge back to 0) before
the assertion ran. Make the primary mock fail FetchStatus too so probe
and API are consistent.

core/state: TestConcurrentUsedParallelism measures wall-clock parallel
speedup and asserts >=2x. Race instrumentation serializes atomic/mutex
ops and skews the measurement to ~1.7x. Skip under -race via a new
race_{on,off}_test.go build-tag pair; the test still guards against the
global-lock regression in non-race runs.

core/txpool/legacypool: TestLockOrdering_{PricedHeapNoDeadlock,
ReplacePendingNoDeadlock,RemovedNoDeadlock} used a 10s deadlock-detect
timeout that is genuinely too short once -race instrumentation is added
(the bare test takes ~12s). Bump to 60s — still catches real deadlocks,
no longer fires on legitimate completion.

eth/downloader: TestBeaconSync68/69Full used a 3s sync timeout that
CI routinely exceeded under -race. Bump to 30s.
TestSkeletonSyncRetrievals had four 2s polling loops that could exit
before background serving goroutines finished incrementing the served
counter (the assertion then read a partial count). Bump the budget to
30s, cap per-iteration sleep at 500ms so exponential backoff stays
responsive, and fold the served counter into the polling condition so
we wait for both subchain state and served totals before asserting.

eth/relay: TestCheckTxStatus, TestSubmitPreconfTx, TestSubmitPrivateTx
parallel subtests had handlers sleeping for rpcTimeout-100ms, leaving
only 100ms of slack before the client-side 2s timeout fired; under
-race that slack was easily exhausted. Halve the handler sleep to
rpcTimeout/2 — still proves parallelism (3 serial calls would exceed
the unchanged 2s upper bound) with comfortable margin for -race.

metrics: TestExpDecaySampleNanosecondRegression relied on the unseeded
global RNG and wall-clock timing, making the reservoir's average a
noisy statistic that occasionally drifted outside [14, 16] under
-race. Drive the test with a seeded RNG and synthesised monotonic
timestamps; override t0/t1 after NewExpDecaySample so dt in update()
stays positive. Test still catches the priority-overflow regression
(average would stick at 10 under the buggy formula).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@sonarqubecloud
Copy link
Copy Markdown

@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

test comment

@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Code review

1 issue found. Checked for bugs and CLAUDE.md compliance.


eth/downloader/skeleton_test.go line 922 (inline comment):

Comment says "10s budget" but the loop on line 931 uses 30*time.Second. Should read "30s budget".

Per CLAUDE.md: comments should only be added when the WHY is non-obvious, and when added they must be accurate. A factually wrong concrete number is worse than no comment.

			// window. 30s budget tolerates -race overhead.

Note: This was intended as a PR review with an inline comment but was posted as a regular comment due to environment constraints.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.98%. Comparing base (7cd4e72) to head (f16ef0d).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2193      +/-   ##
===========================================
- Coverage    51.99%   51.98%   -0.01%     
===========================================
  Files          884      884              
  Lines       155503   155503              
===========================================
- Hits         80854    80842      -12     
- Misses       69448    69460      +12     
  Partials      5201     5201              

see 19 files with indirect coverage changes
see 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread core/state/race_off_test.go
@cffls cffls merged commit 366be79 into develop Apr 23, 2026
26 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants