fix(pruning): wire periodic janitor for partition pruner caches by xe-nvdk · Pull Request #450 · Basekick-Labs/arc

xe-nvdk · 2026-05-22T23:38:19Z

Summary

Wires a background janitor goroutine to call pruner.CleanupGlobCache() + pruner.CleanupPartitionCache() every 30s. Pre-PR both methods existed and were ready to use but had zero production callers — both maps grew monotonically over the process lifetime.
One component of the 24h RSS climb on the satellite-data demo container (identified by the 2026-05-22 memory-retention triage).
Goroutine is owned by *PartitionPruner; started via QueryHandler.StartBackgroundWorkers(ctx) from cmd/arc/main.go with a shutdown hook at PriorityHTTPServer. Idempotent via cleanupStarted atomic — repeat calls are no-ops.

Why this is a real bug, not a perf knob

$ grep -rn "CleanupGlobCache\|CleanupPartitionCache" \
    --include='*.go' --exclude='*_test.go' \
    --exclude='*/partition_pruner.go'
# zero matches

get() on both caches returns "expired" as a cache miss but does not remove the stale entry (verified at partition_pruner.go:45-65 and :188-206). Neither cache has a max-size cap. Under satellite-shape workloads with high-cardinality glob patterns or distinct (path, sql) keys, both maps accumulate forever — capped only by post-compaction InvalidateAllCaches() (which itself is rare and not a designed eviction strategy). The 2026-01-25 progress doc explicitly says "prevents unbounded cache growth" about CleanupPartitionCache() and then nobody wired the cleanup.

Test plan

go test -race ./internal/pruning/... ./internal/api/... — all green, including 4 new tests
go vet ./internal/pruning/... ./internal/api/... ./cmd/arc/... clean
gofmt -l clean on all 4 touched files
Binary smoke (CLAUDE.md mandates this when touching cmd/arc/main.go):
- Partition pruner cache janitor started interval=30000 at boot
- Partition pruner cache janitor stopped on SIGINT
- Process exits cleanly without SIGKILL
Internal review (configuration matrix + deep reviewer): 0 blockers / 0 high / 2 medium — both addressed (M1 idempotent guard via cleanupStarted.CompareAndSwap, M2 doc accuracy on StartBackgroundWorkers)
Gemini Code Assist review (requested below)

Tests added

TestStartCleanup_RemovesExpiredEntries — seeds both caches, starts janitor with sub-TTL interval, asserts entries gone within 500ms.
TestStartCleanup_StopsOnContextCancel — verifies the goroutine exits on ctx cancel by seeding a post-cancel entry and confirming it does NOT get swept.
TestStartCleanup_DefaultInterval — interval=0 → DefaultCleanupInterval, no panic.
TestStartCleanup_Idempotent — repeat StartCleanup calls don't spawn additional goroutines; the first janitor remains alive and sweeping.

Out of scope

Adding a max-size LRU cap on either cache. The sweep-cost analysis (S2 in the internal review) said this is unlikely to bite at realistic key cardinality; revisit when telemetry shows sweep latency > 10ms.
The other findings from the 2026-05-22 triage (Q1 client-disconnect leak, Q3 + C3 glibc-arena retention via DuckDB pool). Q1 and Q3 are already addressed on main by PRs fix(query): break streaming loop on Flush error so DuckDB result buffers free on client disconnect #424 and fix(storage): release glibc heap pages after S3 retention/delete #420 respectively — the satellite demo (v26.05.1) needs an upgrade to pick those up; this PR is the only finding that's still broken on main.

🤖 Generated with Claude Code

internal/pruning/partition_pruner.go has two TTL caches — globCache (30s) and partitionCache (60s) — both used on every query. Their get() returns "expired" as a miss but does NOT evict the stale entry; neither cache has a max-size cap. The public CleanupGlobCache() and CleanupPartitionCache() methods existed since the 2024-12 and 2026-01 cache PRs but had ZERO production callers (verified: only definitions + test files + a progress doc reference them). High-cardinality workloads accumulate map entries monotonically until either InvalidateAllCaches() runs post- compaction or the process exits — one component of the 24h RSS climb on the satellite-data demo. Fix: PartitionPruner.StartCleanup(ctx, interval) spawns a janitor goroutine that sweeps both caches at DefaultCleanupInterval = 30s (matches the shorter TTL → bounds worst-case retention at ~2× TTL). Exits cleanly on ctx cancel. Idempotent via cleanupStarted atomic — repeat calls log a warn and return without spawning a second goroutine. QueryHandler.StartBackgroundWorkers(ctx) is the seam from main.go; cmd/arc/main.go wires it via context.WithCancel + a shutdown hook at PriorityHTTPServer (= 10, the earliest tier — the janitor has no flush state, just a ticker + an in-memory map). Matches the WAL maintenance pattern at line 730. Identified by the 2026-05-22 memory-retention triage.

xe-nvdk · 2026-05-22T23:38:24Z

@gemini-code-assist

gemini-code-assist

Code Review

This pull request implements a background janitor for the partition pruner to periodically evict expired entries from the glob and partition caches, addressing potential memory retention issues. The implementation includes a new StartCleanup method with idempotency guards, integration with the application's shutdown coordinator, and several unit tests verifying eviction and lifecycle management. Feedback suggests running the individual cache cleanup tasks concurrently to ensure that a delay in one does not impact the other.

gemini-code-assist · 2026-05-22T23:40:27Z

+			p.CleanupGlobCache()
+			p.CleanupPartitionCache()


While the cleanup operations are expected to be fast, running them in parallel would make the janitor more robust. Since CleanupGlobCache and CleanupPartitionCache operate on different data structures with separate locks, they can be executed concurrently. This would prevent a potential slowdown in one cache's cleanup from delaying the other.

var wg sync.WaitGroup wg.Add(2) go func() { defer wg.Done() p.CleanupGlobCache() }() go func() { defer wg.Done() p.CleanupPartitionCache() }() wg.Wait()

gemini-code-assist · 2026-05-22T23:40:36Z

The implementation looks excellent. You have correctly identified the memory growth issue and provided a robust, idiomatic solution. The use of atomic.Bool for idempotency and the integration with the existing shutdownCoordinator pattern are spot on. The added tests are thorough and correctly verify the janitor's behavior under various conditions, including cancellation and idempotency. This change effectively addresses the identified memory retention issue without introducing unnecessary complexity.

Gemini medium finding: run glob + partition cache cleanups on independent goroutines per tick. The two mutexes are unrelated (globCache.mu vs partitionCache.mu) so a future hot-glob-cache that grows between sweeps shouldn't delay the partition sweep behind it, or vice versa. WaitGroup keeps the tick boundary sensible — both must finish before the next ticker fires. Also adds the missing release notes entry under "Bug fixes" in RELEASE_NOTES_2026.06.1.md (oversight in 3329e68).

xe-nvdk · 2026-05-22T23:42:15Z

Round-1 fix pushed in 09e788c, plus the release notes entry I forgot to include in the initial commit:

Concurrent sweeps: cleanupLoop now spawns a sub-goroutine for each cache and sync.WaitGroup.Wait()s before the next tick. The mutexes (globCache.mu vs partitionCache.mu) are unrelated so they parallelise cleanly, and the WaitGroup keeps the tick boundary sensible (next sweep starts only after both finish).

Release notes: added a "Partition pruner caches no longer grow unboundedly over process lifetime" entry under ## Bug fixes in RELEASE_NOTES_2026.06.1.md covering the root cause, fix shape, operator-visible log lines, and the LRU-cap follow-up.

All tests still green under -race.

/gemini review

gemini-code-assist

Code Review

This pull request introduces a background janitor to periodically sweep expired entries from the partition pruner's glob and partition caches, addressing a memory leak where stale entries accumulated indefinitely. The implementation includes lifecycle management via the shutdown coordinator, idempotency guards using atomic flags, and new unit tests. Review feedback suggests removing internal metadata—such as PR numbers and AI model references—from source code comments. Additionally, it was recommended to simplify the cleanup loop by executing the cache sweeps sequentially, as the overhead of spawning goroutines and managing a WaitGroup likely exceeds the cost of the lightweight map operations being performed.

gemini-code-assist · 2026-05-22T23:47:26Z

+	// goroutine per pruner. Idempotent on repeat calls — second and
+	// later invocations log a warn and return without spawning. Guards
+	// against a future hot-reload or test refactor accidentally
+	// multiplying goroutines silently. Internal-review M1 on PR #450.


Please remove internal development references such as PR numbers and specific review round identifiers from the code comments. These details are useful during the pull request phase but add unnecessary noise to the production codebase's long-term maintenance.

Suggested change

// multiplying goroutines silently. Internal-review M1 on PR #450.

// multiplying goroutines silently.

gemini-code-assist · 2026-05-22T23:47:26Z

+// The two cleanups run on independent goroutines per tick because they
+// take unrelated mutexes (globCache.mu vs partitionCache.mu). A future
+// hot-glob-cache that grows entries between sweeps shouldn't delay the
+// partition sweep behind it, or vice versa. (Gemini round 1 / PR #450.)


Avoid including references to specific AI models or internal PR numbers in production code comments. This information is meta-data that belongs in the commit history or PR description rather than the source code itself.

Suggested change

// partition sweep behind it, or vice versa. (Gemini round 1 / PR #450.)

// partition sweep behind it, or vice versa.

gemini-code-assist · 2026-05-22T23:47:26Z

+			var wg sync.WaitGroup
+			wg.Add(2)
+			go func() {
+				defer wg.Done()
+				p.CleanupGlobCache()
+			}()
+			go func() {
+				defer wg.Done()
+				p.CleanupPartitionCache()
+			}()
+			wg.Wait()
+		}


Spawning two goroutines and using a sync.WaitGroup for every cleanup tick introduces unnecessary overhead for operations that are extremely lightweight (simple map iterations with a mutex). While the intent to prevent cross-cache delays is noted, map iteration and deletion are fast enough that sequential execution would be more idiomatic and efficient here. If you prefer to keep them concurrent, consider removing the WaitGroup and just spawning the goroutines, as the ticker naturally provides the necessary periodic scheduling without needing to block the loop.

case <-ticker.C: p.CleanupGlobCache() p.CleanupPartitionCache()

Gemini round 2 reversed its round-1 suggestion: parallelising the two cache sweeps adds more overhead (goroutine spawn + WaitGroup sync ~ hundreds of ns) than it removes (the cleanup() itself is ~µs at realistic key cardinality). Round 1 was a pessimisation. Reverting to sequential, but keeping the comment noting the trigger for revisiting: a profile showing the sweep itself blocking concurrent get() calls under a hot-cache workload. Also drops the "concurrent sweep" bullet from the 26.06.1 release notes. Declining the round-2 metadata-cleanup suggestion (remove PR numbers and "Gemini round N" references from comments) — the project convention, established across PRs #442–#449, is to keep these audit trails in code. Future reviewers benefit from seeing "this shape exists because round N flagged X."

xe-nvdk · 2026-05-22T23:49:18Z

Round-2 fixes pushed in 780a1a9:

Sequential sweep (reversal of round-1 suggestion) — you're right, the goroutine-spawn + WaitGroup overhead exceeds the actual cleanup cost at realistic key cardinality. Reverted to the sequential shape. The comment now explicitly notes the trigger for revisiting (a profile showing sweep blocking concurrent get() under hot cache). Also dropped the now-stale 'concurrent sweep' bullet from the release notes.

Declining the metadata-cleanup suggestion — keeping the PR #450 and Gemini round N references in source comments is the established project convention (every recent security-fix PR — #442 through #449 — uses the same shape). These are useful audit trails for future reviewers: 'this shape exists because round N flagged X' has saved real review time on later PRs. Happy to revisit if the project ever changes that policy globally.

All tests still green under -race.

/gemini review

gemini-code-assist

Code Review

This pull request introduces a background janitor goroutine to periodically sweep expired entries from the globCache and partitionCache within the PartitionPruner, addressing a memory leak where stale entries accumulated indefinitely. The janitor is integrated into the main application lifecycle via the shutdown coordinator and includes unit tests for eviction, cancellation, and idempotency. Review feedback suggests enforcing a minimum cleanup interval to prevent CPU-intensive spin loops and resetting the cleanupStarted flag upon goroutine termination to support component restarts or hot-reloads.

gemini-code-assist · 2026-05-22T23:51:10Z

+	if interval <= 0 {
+		interval = DefaultCleanupInterval
+	}


While the code correctly handles non-positive intervals by falling back to the default, it does not prevent extremely small positive intervals (e.g., 1ns). A very small interval could lead to a CPU-intensive spin loop in the background goroutine. It is recommended to enforce a minimum sensible interval (e.g., 100ms) to prevent accidental performance degradation.

Suggested change

if interval <= 0 {

interval = DefaultCleanupInterval

}

if interval <= 0 {

interval = DefaultCleanupInterval

} else if interval < 100*time.Millisecond {

interval = 100 * time.Millisecond

}

References

Enforce defensive programming by ensuring that invalid inputs (like an extremely small ticker interval) are safely handled to prevent performance issues.

gemini-code-assist · 2026-05-22T23:51:10Z

+func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) {
+	ticker := time.NewTicker(interval)


The cleanupStarted flag is currently set to true but never reset to false. If the background goroutine exits (e.g., because the context was cancelled), subsequent calls to StartCleanup will be ignored even if a new, valid context is provided. To support scenarios like hot-reloads or component restarts (as mentioned in the comments), the flag should be reset when the goroutine terminates.

Suggested change

func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) {

ticker := time.NewTicker(interval)

func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) {

defer p.cleanupStarted.Store(false)

ticker := time.NewTicker(interval)

Gemini round-3 finding (1 of 2): a caller passing sub-millisecond interval would pin a CPU in a tight ticker loop. Adds a 1ms floor via minCleanupInterval; sub-floor requests log a Warn naming both the requested and clamped values. 1ms is several orders of magnitude below any plausible production interval AND below every fast-iteration test in this package (which use 5-10ms). Declining round-3 finding (2 of 2) — "reset cleanupStarted on goroutine termination to support hot-reload": YAGNI. There is no hot-reload path today and adding the reset expands the API surface (state-machine invariants, more atomics, tests for the overlap window between cancel and second-start). Will revisit when a concrete restart use case lands.

xe-nvdk · 2026-05-22T23:53:45Z

Round-3 fixes pushed in 6723ed1:

Accepted (1/2): minimum cleanup interval clamp. Added minCleanupInterval = 1ms floor; any request below that logs a Warn and clamps. 1ms is well below any production interval and below every fast-iteration test in this package (5-10ms). New test TestStartCleanup_ClampsTinyInterval pins it.

Declining (2/2): reset cleanupStarted on goroutine termination. YAGNI today — there is no hot-reload path in Arc. Adding the reset expands the API surface (state-machine invariants for a brief overlap window between cancel and second-start) for a hypothetical use case. Happy to revisit when a concrete restart path lands; the test TestStartCleanup_StopsOnContextCancel already pins the current contract (one-shot per pruner).

5 tests green under -race.

Side note: I want to flag something interesting that I think is worth knowing for both of us. Round 1 told us to parallelise the sweeps. Round 2 reversed that and told us to serialise. Both presented with the same confidence; neither pass measured. This is a pattern with LLM perf reviews — plausible-sounding architectural priors that can land on either side depending on context. The right move for the next perf-shaped suggestion is to bench first — same posture we already use for declining non-perf findings, just symmetric on the accept side.

/gemini review

gemini-code-assist

Code Review

This pull request introduces a background janitor to the partition pruner to prevent unbounded cache growth by periodically sweeping expired entries. The changes include wiring the janitor into the application lifecycle with graceful shutdown, implementing the cleanup logic with idempotency and safety clamping, and adding comprehensive tests. Review feedback focuses on improving code maintainability by removing ephemeral meta-comments, internal review references, and external links from the source code, as well as standardizing spelling.

gemini-code-assist · 2026-05-22T23:56:39Z

+	// goroutine per pruner. Idempotent on repeat calls — second and
+	// later invocations log a warn and return without spawning. Guards
+	// against a future hot-reload or test refactor accidentally
+	// multiplying goroutines silently. Internal-review M1 on PR #450.


Remove internal review references from code comments to maintain a clean codebase. These details are better suited for the PR description or commit messages.

Suggested change

// multiplying goroutines silently. Internal-review M1 on PR #450.

// multiplying goroutines silently.

gemini-code-assist · 2026-05-22T23:56:39Z

+// minCleanupInterval is the smallest interval StartCleanup will honor.
+// Anything below this is clamped up to prevent a misconfigured caller
+// (or a fuzz input that passes a sub-millisecond duration) from
+// pinning a CPU core in a tight ticker loop. 1ms is a generous floor
+// — orders of magnitude below any plausible production interval and
+// any fast-iteration test (tests in this package use 5–10ms).
+// (Gemini round 3 / PR #450.)


Remove meta-comments referencing review rounds. This information is ephemeral and does not add value to the long-term maintenance of the code.

Suggested change

// minCleanupInterval is the smallest interval StartCleanup will honor.

// Anything below this is clamped up to prevent a misconfigured caller

// (or a fuzz input that passes a sub-millisecond duration) from

// pinning a CPU core in a tight ticker loop. 1ms is a generous floor

// — orders of magnitude below any plausible production interval and

// any fast-iteration test (tests in this package use 5–10ms).

// (Gemini round 3 / PR #450.)

// minCleanupInterval is the smallest interval StartCleanup will honor.

// Anything below this is clamped up to prevent a misconfigured caller

// (or a fuzz input that passes a sub-millisecond duration) from

// pinning a CPU core in a tight ticker loop. 1ms is a generous floor

// — orders of magnitude below any plausible production interval and

// any fast-iteration test (tests in this package use 5–10ms).

gemini-code-assist · 2026-05-22T23:56:39Z

+// fills that gap. See https://github.com/Basekick-Labs/arc/pull/450
+// (Q2 finding from the 2026-05-22 memory-retention triage).


Remove external PR links and internal triage references from code comments. The technical context provided in the comment is sufficient for future maintainers.

Suggested change

// fills that gap. See https://github.com/Basekick-Labs/arc/pull/450

// (Q2 finding from the 2026-05-22 memory-retention triage).

// fills that gap.

gemini-code-assist · 2026-05-22T23:56:39Z

+// goroutine-spawn + WaitGroup-sync overhead that parallelising the
+// two sweeps would add. Run sequentially; if a future profile ever
+// shows the sweep itself blocking concurrent get() calls under a
+// hot-cache workload, that's the trigger to revisit. (Gemini round 2
+// / PR #450 reversed a round-1 suggestion to parallelise after
+// pointing out the overhead inversion.)


Correct the spelling of 'parallelizing' to match the American English standard used throughout the project (e.g., 'optimized', 'initialization'). Also, remove the meta-comment referencing review history.

Suggested change

// goroutine-spawn + WaitGroup-sync overhead that parallelising the

// two sweeps would add. Run sequentially; if a future profile ever

// shows the sweep itself blocking concurrent get() calls under a

// hot-cache workload, that's the trigger to revisit. (Gemini round 2

// / PR #450 reversed a round-1 suggestion to parallelise after

// pointing out the overhead inversion.)

// goroutine-spawn + WaitGroup-sync overhead that parallelizing the

// two sweeps would add. Run sequentially; if a future profile ever

// shows the sweep itself blocking concurrent get() calls under a

// hot-cache workload, that's the trigger to revisit.

…und 4) Drops the "Gemini round N / PR #450" stamps and the 2026-05-22-triage reference from comments. For a 4-round PR on a small fix, the audit trail belongs in the PR description and commit log, not in the source. (The X1 PR kept references in-source because 12 review rounds genuinely needed the trail; calibrate per PR size.) Also "parallelising" → "parallelizing" to match the project's predominantly US spelling. No behavior change.

xe-nvdk · 2026-05-23T00:00:26Z

Round-4 fixes pushed in 80264bb — dropped the Gemini round N / PR #450 stamps and the triage-date reference from inline comments, and changed "parallelising" to "parallelizing" to match the project's predominantly US spelling. You're right that for a 4-round PR on a small fix, the audit trail belongs in the PR description + commit log, not in the source. (Different calibration than the X1 PR, which kept refs in-source because 12 review rounds genuinely needed the trail.)

No behavior change; 5 tests still green under -race.

PR is ready for merge from my side.

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

xe-nvdk merged commit 0a02877 into main May 23, 2026
6 checks passed

xe-nvdk deleted the fix/pruner-cache-janitor branch May 23, 2026 00:03

	// multiplying goroutines silently. Internal-review M1 on PR #450.
	// multiplying goroutines silently.

	// partition sweep behind it, or vice versa. (Gemini round 1 / PR #450.)
	// partition sweep behind it, or vice versa.

		func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) {
		ticker := time.NewTicker(interval)

		// fills that gap. See https://github.com/Basekick-Labs/arc/pull/450
		// (Q2 finding from the 2026-05-22 memory-retention triage).

	// fills that gap. See https://github.com/Basekick-Labs/arc/pull/450
	// (Q2 finding from the 2026-05-22 memory-retention triage).
	// fills that gap.

Conversation

xe-nvdk commented May 22, 2026

Summary

Why this is a real bug, not a perf knob

Test plan

Tests added

Out of scope

Uh oh!

xe-nvdk commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot commented May 22, 2026

Uh oh!

xe-nvdk commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

xe-nvdk commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

xe-nvdk commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

xe-nvdk commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant