fix(pruning): wire periodic janitor for partition pruner caches#450
Conversation
internal/pruning/partition_pruner.go has two TTL caches — globCache (30s) and partitionCache (60s) — both used on every query. Their get() returns "expired" as a miss but does NOT evict the stale entry; neither cache has a max-size cap. The public CleanupGlobCache() and CleanupPartitionCache() methods existed since the 2024-12 and 2026-01 cache PRs but had ZERO production callers (verified: only definitions + test files + a progress doc reference them). High-cardinality workloads accumulate map entries monotonically until either InvalidateAllCaches() runs post- compaction or the process exits — one component of the 24h RSS climb on the satellite-data demo. Fix: PartitionPruner.StartCleanup(ctx, interval) spawns a janitor goroutine that sweeps both caches at DefaultCleanupInterval = 30s (matches the shorter TTL → bounds worst-case retention at ~2× TTL). Exits cleanly on ctx cancel. Idempotent via cleanupStarted atomic — repeat calls log a warn and return without spawning a second goroutine. QueryHandler.StartBackgroundWorkers(ctx) is the seam from main.go; cmd/arc/main.go wires it via context.WithCancel + a shutdown hook at PriorityHTTPServer (= 10, the earliest tier — the janitor has no flush state, just a ticker + an in-memory map). Matches the WAL maintenance pattern at line 730. Identified by the 2026-05-22 memory-retention triage.
There was a problem hiding this comment.
Code Review
This pull request implements a background janitor for the partition pruner to periodically evict expired entries from the glob and partition caches, addressing potential memory retention issues. The implementation includes a new StartCleanup method with idempotency guards, integration with the application's shutdown coordinator, and several unit tests verifying eviction and lifecycle management. Feedback suggests running the individual cache cleanup tasks concurrently to ensure that a delay in one does not impact the other.
| p.CleanupGlobCache() | ||
| p.CleanupPartitionCache() |
There was a problem hiding this comment.
While the cleanup operations are expected to be fast, running them in parallel would make the janitor more robust. Since CleanupGlobCache and CleanupPartitionCache operate on different data structures with separate locks, they can be executed concurrently. This would prevent a potential slowdown in one cache's cleanup from delaying the other.
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
p.CleanupGlobCache()
}()
go func() {
defer wg.Done()
p.CleanupPartitionCache()
}()
wg.Wait()|
The implementation looks excellent. You have correctly identified the memory growth issue and provided a robust, idiomatic solution. The use of |
Gemini medium finding: run glob + partition cache cleanups on independent goroutines per tick. The two mutexes are unrelated (globCache.mu vs partitionCache.mu) so a future hot-glob-cache that grows between sweeps shouldn't delay the partition sweep behind it, or vice versa. WaitGroup keeps the tick boundary sensible — both must finish before the next ticker fires. Also adds the missing release notes entry under "Bug fixes" in RELEASE_NOTES_2026.06.1.md (oversight in 3329e68).
|
Round-1 fix pushed in 09e788c, plus the release notes entry I forgot to include in the initial commit: Concurrent sweeps: Release notes: added a "Partition pruner caches no longer grow unboundedly over process lifetime" entry under All tests still green under /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a background janitor to periodically sweep expired entries from the partition pruner's glob and partition caches, addressing a memory leak where stale entries accumulated indefinitely. The implementation includes lifecycle management via the shutdown coordinator, idempotency guards using atomic flags, and new unit tests. Review feedback suggests removing internal metadata—such as PR numbers and AI model references—from source code comments. Additionally, it was recommended to simplify the cleanup loop by executing the cache sweeps sequentially, as the overhead of spawning goroutines and managing a WaitGroup likely exceeds the cost of the lightweight map operations being performed.
| // goroutine per pruner. Idempotent on repeat calls — second and | ||
| // later invocations log a warn and return without spawning. Guards | ||
| // against a future hot-reload or test refactor accidentally | ||
| // multiplying goroutines silently. Internal-review M1 on PR #450. |
There was a problem hiding this comment.
Please remove internal development references such as PR numbers and specific review round identifiers from the code comments. These details are useful during the pull request phase but add unnecessary noise to the production codebase's long-term maintenance.
| // multiplying goroutines silently. Internal-review M1 on PR #450. | |
| // multiplying goroutines silently. |
| // The two cleanups run on independent goroutines per tick because they | ||
| // take unrelated mutexes (globCache.mu vs partitionCache.mu). A future | ||
| // hot-glob-cache that grows entries between sweeps shouldn't delay the | ||
| // partition sweep behind it, or vice versa. (Gemini round 1 / PR #450.) |
There was a problem hiding this comment.
Avoid including references to specific AI models or internal PR numbers in production code comments. This information is meta-data that belongs in the commit history or PR description rather than the source code itself.
| // partition sweep behind it, or vice versa. (Gemini round 1 / PR #450.) | |
| // partition sweep behind it, or vice versa. |
| var wg sync.WaitGroup | ||
| wg.Add(2) | ||
| go func() { | ||
| defer wg.Done() | ||
| p.CleanupGlobCache() | ||
| }() | ||
| go func() { | ||
| defer wg.Done() | ||
| p.CleanupPartitionCache() | ||
| }() | ||
| wg.Wait() | ||
| } |
There was a problem hiding this comment.
Spawning two goroutines and using a sync.WaitGroup for every cleanup tick introduces unnecessary overhead for operations that are extremely lightweight (simple map iterations with a mutex). While the intent to prevent cross-cache delays is noted, map iteration and deletion are fast enough that sequential execution would be more idiomatic and efficient here. If you prefer to keep them concurrent, consider removing the WaitGroup and just spawning the goroutines, as the ticker naturally provides the necessary periodic scheduling without needing to block the loop.
case <-ticker.C:
p.CleanupGlobCache()
p.CleanupPartitionCache()Gemini round 2 reversed its round-1 suggestion: parallelising the two cache sweeps adds more overhead (goroutine spawn + WaitGroup sync ~ hundreds of ns) than it removes (the cleanup() itself is ~µs at realistic key cardinality). Round 1 was a pessimisation. Reverting to sequential, but keeping the comment noting the trigger for revisiting: a profile showing the sweep itself blocking concurrent get() calls under a hot-cache workload. Also drops the "concurrent sweep" bullet from the 26.06.1 release notes. Declining the round-2 metadata-cleanup suggestion (remove PR numbers and "Gemini round N" references from comments) — the project convention, established across PRs #442–#449, is to keep these audit trails in code. Future reviewers benefit from seeing "this shape exists because round N flagged X."
|
Round-2 fixes pushed in 780a1a9: Sequential sweep (reversal of round-1 suggestion) — you're right, the goroutine-spawn + WaitGroup overhead exceeds the actual cleanup cost at realistic key cardinality. Reverted to the sequential shape. The comment now explicitly notes the trigger for revisiting (a profile showing sweep blocking concurrent Declining the metadata-cleanup suggestion — keeping the All tests still green under /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a background janitor goroutine to periodically sweep expired entries from the globCache and partitionCache within the PartitionPruner, addressing a memory leak where stale entries accumulated indefinitely. The janitor is integrated into the main application lifecycle via the shutdown coordinator and includes unit tests for eviction, cancellation, and idempotency. Review feedback suggests enforcing a minimum cleanup interval to prevent CPU-intensive spin loops and resetting the cleanupStarted flag upon goroutine termination to support component restarts or hot-reloads.
| if interval <= 0 { | ||
| interval = DefaultCleanupInterval | ||
| } |
There was a problem hiding this comment.
While the code correctly handles non-positive intervals by falling back to the default, it does not prevent extremely small positive intervals (e.g., 1ns). A very small interval could lead to a CPU-intensive spin loop in the background goroutine. It is recommended to enforce a minimum sensible interval (e.g., 100ms) to prevent accidental performance degradation.
| if interval <= 0 { | |
| interval = DefaultCleanupInterval | |
| } | |
| if interval <= 0 { | |
| interval = DefaultCleanupInterval | |
| } else if interval < 100*time.Millisecond { | |
| interval = 100 * time.Millisecond | |
| } |
References
- Enforce defensive programming by ensuring that invalid inputs (like an extremely small ticker interval) are safely handled to prevent performance issues.
| func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) { | ||
| ticker := time.NewTicker(interval) |
There was a problem hiding this comment.
The cleanupStarted flag is currently set to true but never reset to false. If the background goroutine exits (e.g., because the context was cancelled), subsequent calls to StartCleanup will be ignored even if a new, valid context is provided. To support scenarios like hot-reloads or component restarts (as mentioned in the comments), the flag should be reset when the goroutine terminates.
| func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) { | |
| ticker := time.NewTicker(interval) | |
| func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) { | |
| defer p.cleanupStarted.Store(false) | |
| ticker := time.NewTicker(interval) |
Gemini round-3 finding (1 of 2): a caller passing sub-millisecond interval would pin a CPU in a tight ticker loop. Adds a 1ms floor via minCleanupInterval; sub-floor requests log a Warn naming both the requested and clamped values. 1ms is several orders of magnitude below any plausible production interval AND below every fast-iteration test in this package (which use 5-10ms). Declining round-3 finding (2 of 2) — "reset cleanupStarted on goroutine termination to support hot-reload": YAGNI. There is no hot-reload path today and adding the reset expands the API surface (state-machine invariants, more atomics, tests for the overlap window between cancel and second-start). Will revisit when a concrete restart use case lands.
|
Round-3 fixes pushed in 6723ed1: Accepted (1/2): minimum cleanup interval clamp. Added Declining (2/2): reset 5 tests green under Side note: I want to flag something interesting that I think is worth knowing for both of us. Round 1 told us to parallelise the sweeps. Round 2 reversed that and told us to serialise. Both presented with the same confidence; neither pass measured. This is a pattern with LLM perf reviews — plausible-sounding architectural priors that can land on either side depending on context. The right move for the next perf-shaped suggestion is to bench first — same posture we already use for declining non-perf findings, just symmetric on the accept side. /gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a background janitor to the partition pruner to prevent unbounded cache growth by periodically sweeping expired entries. The changes include wiring the janitor into the application lifecycle with graceful shutdown, implementing the cleanup logic with idempotency and safety clamping, and adding comprehensive tests. Review feedback focuses on improving code maintainability by removing ephemeral meta-comments, internal review references, and external links from the source code, as well as standardizing spelling.
| // goroutine per pruner. Idempotent on repeat calls — second and | ||
| // later invocations log a warn and return without spawning. Guards | ||
| // against a future hot-reload or test refactor accidentally | ||
| // multiplying goroutines silently. Internal-review M1 on PR #450. |
There was a problem hiding this comment.
| // minCleanupInterval is the smallest interval StartCleanup will honor. | ||
| // Anything below this is clamped up to prevent a misconfigured caller | ||
| // (or a fuzz input that passes a sub-millisecond duration) from | ||
| // pinning a CPU core in a tight ticker loop. 1ms is a generous floor | ||
| // — orders of magnitude below any plausible production interval and | ||
| // any fast-iteration test (tests in this package use 5–10ms). | ||
| // (Gemini round 3 / PR #450.) |
There was a problem hiding this comment.
Remove meta-comments referencing review rounds. This information is ephemeral and does not add value to the long-term maintenance of the code.
| // minCleanupInterval is the smallest interval StartCleanup will honor. | |
| // Anything below this is clamped up to prevent a misconfigured caller | |
| // (or a fuzz input that passes a sub-millisecond duration) from | |
| // pinning a CPU core in a tight ticker loop. 1ms is a generous floor | |
| // — orders of magnitude below any plausible production interval and | |
| // any fast-iteration test (tests in this package use 5–10ms). | |
| // (Gemini round 3 / PR #450.) | |
| // minCleanupInterval is the smallest interval StartCleanup will honor. | |
| // Anything below this is clamped up to prevent a misconfigured caller | |
| // (or a fuzz input that passes a sub-millisecond duration) from | |
| // pinning a CPU core in a tight ticker loop. 1ms is a generous floor | |
| // — orders of magnitude below any plausible production interval and | |
| // any fast-iteration test (tests in this package use 5–10ms). |
| // fills that gap. See https://github.com/Basekick-Labs/arc/pull/450 | ||
| // (Q2 finding from the 2026-05-22 memory-retention triage). |
There was a problem hiding this comment.
Remove external PR links and internal triage references from code comments. The technical context provided in the comment is sufficient for future maintainers.
| // fills that gap. See https://github.com/Basekick-Labs/arc/pull/450 | |
| // (Q2 finding from the 2026-05-22 memory-retention triage). | |
| // fills that gap. |
| // goroutine-spawn + WaitGroup-sync overhead that parallelising the | ||
| // two sweeps would add. Run sequentially; if a future profile ever | ||
| // shows the sweep itself blocking concurrent get() calls under a | ||
| // hot-cache workload, that's the trigger to revisit. (Gemini round 2 | ||
| // / PR #450 reversed a round-1 suggestion to parallelise after | ||
| // pointing out the overhead inversion.) |
There was a problem hiding this comment.
Correct the spelling of 'parallelizing' to match the American English standard used throughout the project (e.g., 'optimized', 'initialization'). Also, remove the meta-comment referencing review history.
| // goroutine-spawn + WaitGroup-sync overhead that parallelising the | |
| // two sweeps would add. Run sequentially; if a future profile ever | |
| // shows the sweep itself blocking concurrent get() calls under a | |
| // hot-cache workload, that's the trigger to revisit. (Gemini round 2 | |
| // / PR #450 reversed a round-1 suggestion to parallelise after | |
| // pointing out the overhead inversion.) | |
| // goroutine-spawn + WaitGroup-sync overhead that parallelizing the | |
| // two sweeps would add. Run sequentially; if a future profile ever | |
| // shows the sweep itself blocking concurrent get() calls under a | |
| // hot-cache workload, that's the trigger to revisit. |
…und 4) Drops the "Gemini round N / PR #450" stamps and the 2026-05-22-triage reference from comments. For a 4-round PR on a small fix, the audit trail belongs in the PR description and commit log, not in the source. (The X1 PR kept references in-source because 12 review rounds genuinely needed the trail; calibrate per PR size.) Also "parallelising" → "parallelizing" to match the project's predominantly US spelling. No behavior change.
|
Round-4 fixes pushed in 80264bb — dropped the No behavior change; 5 tests still green under PR is ready for merge from my side. |
Summary
pruner.CleanupGlobCache()+pruner.CleanupPartitionCache()every 30s. Pre-PR both methods existed and were ready to use but had zero production callers — both maps grew monotonically over the process lifetime.*PartitionPruner; started viaQueryHandler.StartBackgroundWorkers(ctx)fromcmd/arc/main.gowith a shutdown hook atPriorityHTTPServer. Idempotent viacleanupStartedatomic — repeat calls are no-ops.Why this is a real bug, not a perf knob
get()on both caches returns "expired" as a cache miss but does not remove the stale entry (verified atpartition_pruner.go:45-65and:188-206). Neither cache has a max-size cap. Under satellite-shape workloads with high-cardinality glob patterns or distinct(path, sql)keys, both maps accumulate forever — capped only by post-compactionInvalidateAllCaches()(which itself is rare and not a designed eviction strategy). The 2026-01-25 progress doc explicitly says "prevents unbounded cache growth" aboutCleanupPartitionCache()and then nobody wired the cleanup.Test plan
go test -race ./internal/pruning/... ./internal/api/...— all green, including 4 new testsgo vet ./internal/pruning/... ./internal/api/... ./cmd/arc/...cleangofmt -lclean on all 4 touched filescmd/arc/main.go):Partition pruner cache janitor started interval=30000at bootPartition pruner cache janitor stoppedon SIGINTcleanupStarted.CompareAndSwap, M2 doc accuracy onStartBackgroundWorkers)Tests added
TestStartCleanup_RemovesExpiredEntries— seeds both caches, starts janitor with sub-TTL interval, asserts entries gone within 500ms.TestStartCleanup_StopsOnContextCancel— verifies the goroutine exits on ctx cancel by seeding a post-cancel entry and confirming it does NOT get swept.TestStartCleanup_DefaultInterval—interval=0→DefaultCleanupInterval, no panic.TestStartCleanup_Idempotent— repeatStartCleanupcalls don't spawn additional goroutines; the first janitor remains alive and sweeping.Out of scope
mainby PRs fix(query): break streaming loop on Flush error so DuckDB result buffers free on client disconnect #424 and fix(storage): release glibc heap pages after S3 retention/delete #420 respectively — the satellite demo (v26.05.1) needs an upgrade to pick those up; this PR is the only finding that's still broken onmain.🤖 Generated with Claude Code