consensus/bor: fix race in SpanStore.PurgeCache#2235
Merged
Conversation
PurgeCache cleared latestSpanCache via atomic Store(nil) but did not stop the background polling goroutine started by NewSpanStore. That goroutine ticks every 200ms and writes the latest span back into latestSpanCache, silently undoing the purge whenever a tick lands between the clear and the caller's next read. Fix: extract the loop into runPollLoop, track it with a WaitGroup, and have PurgeCache stop and join the goroutine before resetting state. Close uses the same path. PurgeCache no longer restarts the loop — on-demand reads via getLatestSpan fall back to updateLatestSpan inline, so callers still get fresh data without the race window. Adds TestSpanStore_PurgeCache_RaceWithPollLoop, a deterministic reproducer that sleeps past one tick before asserting. Fails reliably on develop, passes with the fix.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
With the poll loop stopped after PurgeCache, a stale heimdallStatus (typically CatchingUp:false) would persist and let waitUntilHeimdallIsSynced return immediately without refreshing against a freshly-swapped heimdall client. Clear it alongside the other atomics, and extend the reproducer to assert the invariant.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #2235 +/- ##
===========================================
+ Coverage 52.58% 52.68% +0.10%
===========================================
Files 885 885
Lines 156286 156686 +400
===========================================
+ Hits 82179 82547 +368
- Misses 68845 68878 +33
+ Partials 5262 5261 -1
... and 31 files with indirect coverage changes
🚀 New features to boost your workflow:
|
manav2401
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
SpanStore.PurgeCacheclearslatestSpanCache(and the other cache fields) via an atomicStore(nil), but does not stop the background polling goroutine thatNewSpanStorespawns (span_store.go:62-89). That goroutine ticks every 200ms and writes the latest span back intolatestSpanCache. If a tick lands between the clear and the caller's next read, the "purge" is silently undone.The window is microscopic in the original test (the assertion runs in microseconds after the clear), so this manifests as a low-rate CI flake on
TestSpanStore_PurgeCacherather than a consistent failure.Where it showed up
Caught while iterating on #2192:
latestSpanCachecontainedspan Id:0x2— the exact value the mock heimdall client returns fromGetLatestSpan, confirming the polling goroutine wrote it back after the purge.PurgeCacheis test-only — it is only invoked frombor_test.go,tests/bor/bor_test.go, and the span-store test itself — so changing its lifecycle behavior is low-risk.Fix
runPollLoopmethod tracked by async.WaitGroup.PurgeCachecancel the context andWait()for the goroutine to exit before resetting state.Close()uses the same path (stopPollLoop) so it's also race-safe.PurgeCacheno longer restarts the loop. On-demand reads viagetLatestSpan/spanByIdfall back to inlineupdateLatestSpanwhen the cache is empty, so callers still get fresh data — they just don't get background warming until the nextNewSpanStore. This matches the test-only use case.Also slimmed the loop body by extracting the rate-limited error logger into a small helper so the new method stays under diffguard's cognitive-complexity threshold.
Reproducer
TestSpanStore_PurgeCache_RaceWithPollLoopsleeps 300ms afterPurgeCache(past one tick of the 200ms poll). Ondevelopit fails 5/5 with the samespan Id:0x2signature as the CI flake; with the fix it passes deterministically.Test plan
go build ./...cleango test -count=20 -run TestSpanStore_PurgeCache ./consensus/bor/— 40/40 pass (new reproducer + original)go test -race ./consensus/bor/...— 526 tests passdiffguard --base origin/develop --skip-mutation .— all sections PASS