Skip to content

consensus/bor: fix race in SpanStore.PurgeCache#2235

Merged
kamuikatsurgi merged 2 commits into
developfrom
lmartins/fix-span-store-purgecache-race
May 26, 2026
Merged

consensus/bor: fix race in SpanStore.PurgeCache#2235
kamuikatsurgi merged 2 commits into
developfrom
lmartins/fix-span-store-purgecache-race

Conversation

@lucca30
Copy link
Copy Markdown
Contributor

@lucca30 lucca30 commented May 19, 2026

Summary

SpanStore.PurgeCache clears latestSpanCache (and the other cache fields) via an atomic Store(nil), but does not stop the background polling goroutine that NewSpanStore spawns (span_store.go:62-89). That goroutine ticks every 200ms and writes the latest span back into latestSpanCache. If a tick lands between the clear and the caller's next read, the "purge" is silently undone.

The window is microscopic in the original test (the assertion runs in microseconds after the clear), so this manifests as a low-rate CI flake on TestSpanStore_PurgeCache rather than a consistent failure.

Where it showed up

Caught while iterating on #2192:

PurgeCache is test-only — it is only invoked from bor_test.go, tests/bor/bor_test.go, and the span-store test itself — so changing its lifecycle behavior is low-risk.

Fix

  • Extract the poll loop into a runPollLoop method tracked by a sync.WaitGroup.
  • Have PurgeCache cancel the context and Wait() for the goroutine to exit before resetting state. Close() uses the same path (stopPollLoop) so it's also race-safe.
  • PurgeCache no longer restarts the loop. On-demand reads via getLatestSpan/spanById fall back to inline updateLatestSpan when the cache is empty, so callers still get fresh data — they just don't get background warming until the next NewSpanStore. This matches the test-only use case.

Also slimmed the loop body by extracting the rate-limited error logger into a small helper so the new method stays under diffguard's cognitive-complexity threshold.

Reproducer

TestSpanStore_PurgeCache_RaceWithPollLoop sleeps 300ms after PurgeCache (past one tick of the 200ms poll). On develop it fails 5/5 with the same span Id:0x2 signature as the CI flake; with the fix it passes deterministically.

Test plan

  • go build ./... clean
  • go test -count=20 -run TestSpanStore_PurgeCache ./consensus/bor/ — 40/40 pass (new reproducer + original)
  • go test -race ./consensus/bor/... — 526 tests pass
  • diffguard --base origin/develop --skip-mutation . — all sections PASS

PurgeCache cleared latestSpanCache via atomic Store(nil) but did not stop
the background polling goroutine started by NewSpanStore. That goroutine
ticks every 200ms and writes the latest span back into latestSpanCache,
silently undoing the purge whenever a tick lands between the clear and
the caller's next read.

Fix: extract the loop into runPollLoop, track it with a WaitGroup, and
have PurgeCache stop and join the goroutine before resetting state.
Close uses the same path. PurgeCache no longer restarts the loop —
on-demand reads via getLatestSpan fall back to updateLatestSpan inline,
so callers still get fresh data without the race window.

Adds TestSpanStore_PurgeCache_RaceWithPollLoop, a deterministic
reproducer that sleeps past one tick before asserting. Fails reliably on
develop, passes with the fix.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

With the poll loop stopped after PurgeCache, a stale heimdallStatus
(typically CatchingUp:false) would persist and let
waitUntilHeimdallIsSynced return immediately without refreshing against
a freshly-swapped heimdall client. Clear it alongside the other
atomics, and extend the reproducer to assert the invariant.
@sonarqubecloud
Copy link
Copy Markdown

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.68%. Comparing base (92e427a) to head (fd2eccd).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2235      +/-   ##
===========================================
+ Coverage    52.58%   52.68%   +0.10%     
===========================================
  Files          885      885              
  Lines       156286   156686     +400     
===========================================
+ Hits         82179    82547     +368     
- Misses       68845    68878      +33     
+ Partials      5262     5261       -1     
Files with missing lines Coverage Δ
consensus/bor/span_store.go 92.06% <100.00%> (+1.33%) ⬆️

... and 31 files with indirect coverage changes

Files with missing lines Coverage Δ
consensus/bor/span_store.go 92.06% <100.00%> (+1.33%) ⬆️

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@cffls cffls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@kamuikatsurgi kamuikatsurgi merged commit 4eb7548 into develop May 26, 2026
28 of 30 checks passed
@kamuikatsurgi kamuikatsurgi deleted the lmartins/fix-span-store-purgecache-race branch May 26, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants