all: fix nightly race test flakes by cffls · Pull Request #2193 · 0xPolygon/bor

cffls · 2026-04-21T19:31:48Z

The nightly-race workflow had been failing on every run for a month. Most failures were either race-detector overhead pushing test timings past narrow assertion margins, tests mutating shared global state that collided with parallel tests, or a single real data race that cascaded into many victim tests.

internal/ethapi: testBackendWithPreMadhuguriBorReceipt.ChainConfig() did a shallow copy of params.AllEthashProtocolChanges, leaving cfg.Bor aliased to the global BorConfig, then mutated MadhugiriBlock on it. Concurrent tests calling IsMadhugiri raced with the write. Deep-copy BorConfig like the sibling testBackendWithNilBorTx already does. This single race was the root cause of the ~10 cascading ethapi test failures (TestBorForks, TestBorGetLogs_*, TestCoinbase, TestEstimateGas, etc.).

consensus/bor/heimdall: TestFailover_SwitchOnPrimaryDown and TestRegistry_MarkUnhealthyOnRealFailure set only getSpanFn on the primary mock, leaving FetchStatus returning success. The registry's background probe then raced with the test's MarkUnhealthy call and could flip the primary back to healthy (or the active gauge back to 0) before the assertion ran. Make the primary mock fail FetchStatus too so probe and API are consistent.

core/state: TestConcurrentUsedParallelism measures wall-clock parallel speedup and asserts >=2x. Race instrumentation serializes atomic/mutex ops and skews the measurement to ~1.7x. Skip under -race via a new race_{on,off}_test.go build-tag pair; the test still guards against the global-lock regression in non-race runs.

core/txpool/legacypool: TestLockOrdering_{PricedHeapNoDeadlock, ReplacePendingNoDeadlock,RemovedNoDeadlock} used a 10s deadlock-detect timeout that is genuinely too short once -race instrumentation is added (the bare test takes ~12s). Bump to 60s — still catches real deadlocks, no longer fires on legitimate completion.

eth/downloader: TestBeaconSync68/69Full used a 3s sync timeout that CI routinely exceeded under -race. Bump to 30s.
TestSkeletonSyncRetrievals had four 2s polling loops that could exit before background serving goroutines finished incrementing the served counter (the assertion then read a partial count). Bump the budget to 30s, cap per-iteration sleep at 500ms so exponential backoff stays responsive, and fold the served counter into the polling condition so we wait for both subchain state and served totals before asserting.

eth/relay: TestCheckTxStatus, TestSubmitPreconfTx, TestSubmitPrivateTx parallel subtests had handlers sleeping for rpcTimeout-100ms, leaving only 100ms of slack before the client-side 2s timeout fired; under -race that slack was easily exhausted. Halve the handler sleep to rpcTimeout/2 — still proves parallelism (3 serial calls would exceed the unchanged 2s upper bound) with comfortable margin for -race.

metrics: TestExpDecaySampleNanosecondRegression relied on the unseeded global RNG and wall-clock timing, making the reservoir's average a noisy statistic that occasionally drifted outside [14, 16] under -race. Drive the test with a seeded RNG and synthesised monotonic timestamps; override t0/t1 after NewExpDecaySample so dt in update() stays positive. Test still catches the priority-overflow regression (average would stick at 10 under the buggy formula).

The nightly-race workflow had been failing on every run for a month. Most failures were either race-detector overhead pushing test timings past narrow assertion margins, tests mutating shared global state that collided with parallel tests, or a single real data race that cascaded into many victim tests. internal/ethapi: testBackendWithPreMadhuguriBorReceipt.ChainConfig() did a shallow copy of params.AllEthashProtocolChanges, leaving cfg.Bor aliased to the global BorConfig, then mutated MadhugiriBlock on it. Concurrent tests calling IsMadhugiri raced with the write. Deep-copy BorConfig like the sibling testBackendWithNilBorTx already does. This single race was the root cause of the ~10 cascading ethapi test failures (TestBorForks, TestBorGetLogs_*, TestCoinbase, TestEstimateGas, etc.). consensus/bor/heimdall: TestFailover_SwitchOnPrimaryDown and TestRegistry_MarkUnhealthyOnRealFailure set only getSpanFn on the primary mock, leaving FetchStatus returning success. The registry's background probe then raced with the test's MarkUnhealthy call and could flip the primary back to healthy (or the active gauge back to 0) before the assertion ran. Make the primary mock fail FetchStatus too so probe and API are consistent. core/state: TestConcurrentUsedParallelism measures wall-clock parallel speedup and asserts >=2x. Race instrumentation serializes atomic/mutex ops and skews the measurement to ~1.7x. Skip under -race via a new race_{on,off}_test.go build-tag pair; the test still guards against the global-lock regression in non-race runs. core/txpool/legacypool: TestLockOrdering_{PricedHeapNoDeadlock, ReplacePendingNoDeadlock,RemovedNoDeadlock} used a 10s deadlock-detect timeout that is genuinely too short once -race instrumentation is added (the bare test takes ~12s). Bump to 60s — still catches real deadlocks, no longer fires on legitimate completion. eth/downloader: TestBeaconSync68/69Full used a 3s sync timeout that CI routinely exceeded under -race. Bump to 30s. TestSkeletonSyncRetrievals had four 2s polling loops that could exit before background serving goroutines finished incrementing the served counter (the assertion then read a partial count). Bump the budget to 30s, cap per-iteration sleep at 500ms so exponential backoff stays responsive, and fold the served counter into the polling condition so we wait for both subchain state and served totals before asserting. eth/relay: TestCheckTxStatus, TestSubmitPreconfTx, TestSubmitPrivateTx parallel subtests had handlers sleeping for rpcTimeout-100ms, leaving only 100ms of slack before the client-side 2s timeout fired; under -race that slack was easily exhausted. Halve the handler sleep to rpcTimeout/2 — still proves parallelism (3 serial calls would exceed the unchanged 2s upper bound) with comfortable margin for -race. metrics: TestExpDecaySampleNanosecondRegression relied on the unseeded global RNG and wall-clock timing, making the reservoir's average a noisy statistic that occasionally drifted outside [14, 16] under -race. Drive the test with a seeded RNG and synthesised monotonic timestamps; override t0/t1 after NewExpDecaySample so dt in update() stays positive. Test still catches the priority-overflow regression (average would stick at 10 under the buggy formula). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

sonarqubecloud · 2026-04-21T19:33:05Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

claude · 2026-04-21T19:52:41Z

test comment

claude · 2026-04-21T19:54:41Z

Code review

1 issue found. Checked for bugs and CLAUDE.md compliance.

eth/downloader/skeleton_test.go line 922 (inline comment):

Comment says "10s budget" but the loop on line 931 uses 30*time.Second. Should read "30s budget".

Per CLAUDE.md: comments should only be added when the WHY is non-obvious, and when added they must be accurate. A factually wrong concrete number is worse than no comment.

			// window. 30s budget tolerates -race overhead.

Note: This was intended as a PR review with an inline comment but was posted as a regular comment due to environment constraints.

codecov · 2026-04-21T20:00:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.98%. Comparing base (7cd4e72) to head (f16ef0d).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2193      +/-   ##
===========================================
- Coverage    51.99%   51.98%   -0.01%     
===========================================
  Files          884      884              
  Lines       155503   155503              
===========================================
- Hits         80854    80842      -12     
- Misses       69448    69460      +12     
  Partials      5201     5201

see 19 files with indirect coverage changes
see 19 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude Bot reviewed Apr 21, 2026

View reviewed changes

lucca30 reviewed Apr 22, 2026

View reviewed changes

Comment thread core/state/race_off_test.go

lucca30 approved these changes Apr 22, 2026

View reviewed changes

kamuikatsurgi approved these changes Apr 23, 2026

View reviewed changes

cffls merged commit 366be79 into develop Apr 23, 2026
26 of 29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all: fix nightly race test flakes#2193

all: fix nightly race test flakes#2193
cffls merged 1 commit into
developfrom
fix_race

cffls commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Uh oh!

sonarqubecloud Bot commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cffls commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

sonarqubecloud Bot commented Apr 21, 2026

Quality Gate passed

Uh oh!

claude Bot commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026

Code review

Uh oh!

codecov Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 21, 2026 •

edited

Loading