Skip to content

fix(pruning): wire periodic janitor for partition pruner caches#450

Merged
xe-nvdk merged 5 commits into
mainfrom
fix/pruner-cache-janitor
May 23, 2026
Merged

fix(pruning): wire periodic janitor for partition pruner caches#450
xe-nvdk merged 5 commits into
mainfrom
fix/pruner-cache-janitor

Conversation

@xe-nvdk
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk commented May 22, 2026

Summary

  • Wires a background janitor goroutine to call pruner.CleanupGlobCache() + pruner.CleanupPartitionCache() every 30s. Pre-PR both methods existed and were ready to use but had zero production callers — both maps grew monotonically over the process lifetime.
  • One component of the 24h RSS climb on the satellite-data demo container (identified by the 2026-05-22 memory-retention triage).
  • Goroutine is owned by *PartitionPruner; started via QueryHandler.StartBackgroundWorkers(ctx) from cmd/arc/main.go with a shutdown hook at PriorityHTTPServer. Idempotent via cleanupStarted atomic — repeat calls are no-ops.

Why this is a real bug, not a perf knob

$ grep -rn "CleanupGlobCache\|CleanupPartitionCache" \
    --include='*.go' --exclude='*_test.go' \
    --exclude='*/partition_pruner.go'
# zero matches

get() on both caches returns "expired" as a cache miss but does not remove the stale entry (verified at partition_pruner.go:45-65 and :188-206). Neither cache has a max-size cap. Under satellite-shape workloads with high-cardinality glob patterns or distinct (path, sql) keys, both maps accumulate forever — capped only by post-compaction InvalidateAllCaches() (which itself is rare and not a designed eviction strategy). The 2026-01-25 progress doc explicitly says "prevents unbounded cache growth" about CleanupPartitionCache() and then nobody wired the cleanup.

Test plan

  • go test -race ./internal/pruning/... ./internal/api/... — all green, including 4 new tests
  • go vet ./internal/pruning/... ./internal/api/... ./cmd/arc/... clean
  • gofmt -l clean on all 4 touched files
  • Binary smoke (CLAUDE.md mandates this when touching cmd/arc/main.go):
    • Partition pruner cache janitor started interval=30000 at boot
    • Partition pruner cache janitor stopped on SIGINT
    • Process exits cleanly without SIGKILL
  • Internal review (configuration matrix + deep reviewer): 0 blockers / 0 high / 2 medium — both addressed (M1 idempotent guard via cleanupStarted.CompareAndSwap, M2 doc accuracy on StartBackgroundWorkers)
  • Gemini Code Assist review (requested below)

Tests added

  • TestStartCleanup_RemovesExpiredEntries — seeds both caches, starts janitor with sub-TTL interval, asserts entries gone within 500ms.
  • TestStartCleanup_StopsOnContextCancel — verifies the goroutine exits on ctx cancel by seeding a post-cancel entry and confirming it does NOT get swept.
  • TestStartCleanup_DefaultIntervalinterval=0DefaultCleanupInterval, no panic.
  • TestStartCleanup_Idempotent — repeat StartCleanup calls don't spawn additional goroutines; the first janitor remains alive and sweeping.

Out of scope

🤖 Generated with Claude Code

internal/pruning/partition_pruner.go has two TTL caches — globCache
(30s) and partitionCache (60s) — both used on every query. Their
get() returns "expired" as a miss but does NOT evict the stale
entry; neither cache has a max-size cap. The public
CleanupGlobCache() and CleanupPartitionCache() methods existed
since the 2024-12 and 2026-01 cache PRs but had ZERO production
callers (verified: only definitions + test files + a progress doc
reference them). High-cardinality workloads accumulate map entries
monotonically until either InvalidateAllCaches() runs post-
compaction or the process exits — one component of the 24h RSS
climb on the satellite-data demo.

Fix: PartitionPruner.StartCleanup(ctx, interval) spawns a janitor
goroutine that sweeps both caches at DefaultCleanupInterval = 30s
(matches the shorter TTL → bounds worst-case retention at ~2× TTL).
Exits cleanly on ctx cancel. Idempotent via cleanupStarted atomic
— repeat calls log a warn and return without spawning a second
goroutine.

QueryHandler.StartBackgroundWorkers(ctx) is the seam from main.go;
cmd/arc/main.go wires it via context.WithCancel + a shutdown hook
at PriorityHTTPServer (= 10, the earliest tier — the janitor has
no flush state, just a ticker + an in-memory map). Matches the WAL
maintenance pattern at line 730.

Identified by the 2026-05-22 memory-retention triage.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 22, 2026

@gemini-code-assist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a background janitor for the partition pruner to periodically evict expired entries from the glob and partition caches, addressing potential memory retention issues. The implementation includes a new StartCleanup method with idempotency guards, integration with the application's shutdown coordinator, and several unit tests verifying eviction and lifecycle management. Feedback suggests running the individual cache cleanup tasks concurrently to ensure that a delay in one does not impact the other.

Comment on lines +1000 to +1001
p.CleanupGlobCache()
p.CleanupPartitionCache()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the cleanup operations are expected to be fast, running them in parallel would make the janitor more robust. Since CleanupGlobCache and CleanupPartitionCache operate on different data structures with separate locks, they can be executed concurrently. This would prevent a potential slowdown in one cache's cleanup from delaying the other.

var wg sync.WaitGroup
			wg.Add(2)
			go func() {
				defer wg.Done()
				p.CleanupGlobCache()
			}()
			go func() {
				defer wg.Done()
				p.CleanupPartitionCache()
			}()
			wg.Wait()

@gemini-code-assist
Copy link
Copy Markdown
Contributor

The implementation looks excellent. You have correctly identified the memory growth issue and provided a robust, idiomatic solution. The use of atomic.Bool for idempotency and the integration with the existing shutdownCoordinator pattern are spot on. The added tests are thorough and correctly verify the janitor's behavior under various conditions, including cancellation and idempotency. This change effectively addresses the identified memory retention issue without introducing unnecessary complexity.

Gemini medium finding: run glob + partition cache cleanups on
independent goroutines per tick. The two mutexes are unrelated
(globCache.mu vs partitionCache.mu) so a future hot-glob-cache
that grows between sweeps shouldn't delay the partition sweep
behind it, or vice versa. WaitGroup keeps the tick boundary
sensible — both must finish before the next ticker fires.

Also adds the missing release notes entry under "Bug fixes" in
RELEASE_NOTES_2026.06.1.md (oversight in 3329e68).
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 22, 2026

Round-1 fix pushed in 09e788c, plus the release notes entry I forgot to include in the initial commit:

Concurrent sweeps: cleanupLoop now spawns a sub-goroutine for each cache and sync.WaitGroup.Wait()s before the next tick. The mutexes (globCache.mu vs partitionCache.mu) are unrelated so they parallelise cleanly, and the WaitGroup keeps the tick boundary sensible (next sweep starts only after both finish).

Release notes: added a "Partition pruner caches no longer grow unboundedly over process lifetime" entry under ## Bug fixes in RELEASE_NOTES_2026.06.1.md covering the root cause, fix shape, operator-visible log lines, and the LRU-cap follow-up.

All tests still green under -race.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a background janitor to periodically sweep expired entries from the partition pruner's glob and partition caches, addressing a memory leak where stale entries accumulated indefinitely. The implementation includes lifecycle management via the shutdown coordinator, idempotency guards using atomic flags, and new unit tests. Review feedback suggests removing internal metadata—such as PR numbers and AI model references—from source code comments. Additionally, it was recommended to simplify the cleanup loop by executing the cache sweeps sequentially, as the overhead of spawning goroutines and managing a WaitGroup likely exceeds the cost of the lightweight map operations being performed.

Comment thread internal/pruning/partition_pruner.go Outdated
// goroutine per pruner. Idempotent on repeat calls — second and
// later invocations log a warn and return without spawning. Guards
// against a future hot-reload or test refactor accidentally
// multiplying goroutines silently. Internal-review M1 on PR #450.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please remove internal development references such as PR numbers and specific review round identifiers from the code comments. These details are useful during the pull request phase but add unnecessary noise to the production codebase's long-term maintenance.

Suggested change
// multiplying goroutines silently. Internal-review M1 on PR #450.
// multiplying goroutines silently.

Comment thread internal/pruning/partition_pruner.go Outdated
// The two cleanups run on independent goroutines per tick because they
// take unrelated mutexes (globCache.mu vs partitionCache.mu). A future
// hot-glob-cache that grows entries between sweeps shouldn't delay the
// partition sweep behind it, or vice versa. (Gemini round 1 / PR #450.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid including references to specific AI models or internal PR numbers in production code comments. This information is meta-data that belongs in the commit history or PR description rather than the source code itself.

Suggested change
// partition sweep behind it, or vice versa. (Gemini round 1 / PR #450.)
// partition sweep behind it, or vice versa.

Comment thread internal/pruning/partition_pruner.go Outdated
Comment on lines +1004 to +1015
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
p.CleanupGlobCache()
}()
go func() {
defer wg.Done()
p.CleanupPartitionCache()
}()
wg.Wait()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Spawning two goroutines and using a sync.WaitGroup for every cleanup tick introduces unnecessary overhead for operations that are extremely lightweight (simple map iterations with a mutex). While the intent to prevent cross-cache delays is noted, map iteration and deletion are fast enough that sequential execution would be more idiomatic and efficient here. If you prefer to keep them concurrent, consider removing the WaitGroup and just spawning the goroutines, as the ticker naturally provides the necessary periodic scheduling without needing to block the loop.

		case <-ticker.C:
			p.CleanupGlobCache()
			p.CleanupPartitionCache()

Gemini round 2 reversed its round-1 suggestion: parallelising the
two cache sweeps adds more overhead (goroutine spawn + WaitGroup
sync ~ hundreds of ns) than it removes (the cleanup() itself is
~µs at realistic key cardinality). Round 1 was a pessimisation.

Reverting to sequential, but keeping the comment noting the
trigger for revisiting: a profile showing the sweep itself
blocking concurrent get() calls under a hot-cache workload.

Also drops the "concurrent sweep" bullet from the 26.06.1
release notes.

Declining the round-2 metadata-cleanup suggestion (remove PR
numbers and "Gemini round N" references from comments) — the
project convention, established across PRs #442#449, is to keep
these audit trails in code. Future reviewers benefit from seeing
"this shape exists because round N flagged X."
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 22, 2026

Round-2 fixes pushed in 780a1a9:

Sequential sweep (reversal of round-1 suggestion) — you're right, the goroutine-spawn + WaitGroup overhead exceeds the actual cleanup cost at realistic key cardinality. Reverted to the sequential shape. The comment now explicitly notes the trigger for revisiting (a profile showing sweep blocking concurrent get() under hot cache). Also dropped the now-stale 'concurrent sweep' bullet from the release notes.

Declining the metadata-cleanup suggestion — keeping the PR #450 and Gemini round N references in source comments is the established project convention (every recent security-fix PR — #442 through #449 — uses the same shape). These are useful audit trails for future reviewers: 'this shape exists because round N flagged X' has saved real review time on later PRs. Happy to revisit if the project ever changes that policy globally.

All tests still green under -race.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a background janitor goroutine to periodically sweep expired entries from the globCache and partitionCache within the PartitionPruner, addressing a memory leak where stale entries accumulated indefinitely. The janitor is integrated into the main application lifecycle via the shutdown coordinator and includes unit tests for eviction, cancellation, and idempotency. Review feedback suggests enforcing a minimum cleanup interval to prevent CPU-intensive spin loops and resetting the cleanupStarted flag upon goroutine termination to support component restarts or hot-reloads.

Comment on lines +978 to +980
if interval <= 0 {
interval = DefaultCleanupInterval
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the code correctly handles non-positive intervals by falling back to the default, it does not prevent extremely small positive intervals (e.g., 1ns). A very small interval could lead to a CPU-intensive spin loop in the background goroutine. It is recommended to enforce a minimum sensible interval (e.g., 100ms) to prevent accidental performance degradation.

Suggested change
if interval <= 0 {
interval = DefaultCleanupInterval
}
if interval <= 0 {
interval = DefaultCleanupInterval
} else if interval < 100*time.Millisecond {
interval = 100 * time.Millisecond
}
References
  1. Enforce defensive programming by ensuring that invalid inputs (like an extremely small ticker interval) are safely handled to prevent performance issues.

Comment on lines +997 to +998
func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) {
ticker := time.NewTicker(interval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cleanupStarted flag is currently set to true but never reset to false. If the background goroutine exits (e.g., because the context was cancelled), subsequent calls to StartCleanup will be ignored even if a new, valid context is provided. To support scenarios like hot-reloads or component restarts (as mentioned in the comments), the flag should be reset when the goroutine terminates.

Suggested change
func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) {
ticker := time.NewTicker(interval)
func (p *PartitionPruner) cleanupLoop(ctx context.Context, interval time.Duration) {
defer p.cleanupStarted.Store(false)
ticker := time.NewTicker(interval)

Gemini round-3 finding (1 of 2): a caller passing sub-millisecond
interval would pin a CPU in a tight ticker loop. Adds a 1ms floor
via minCleanupInterval; sub-floor requests log a Warn naming both
the requested and clamped values. 1ms is several orders of
magnitude below any plausible production interval AND below every
fast-iteration test in this package (which use 5-10ms).

Declining round-3 finding (2 of 2) — "reset cleanupStarted on
goroutine termination to support hot-reload": YAGNI. There is no
hot-reload path today and adding the reset expands the API
surface (state-machine invariants, more atomics, tests for the
overlap window between cancel and second-start). Will revisit
when a concrete restart use case lands.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 22, 2026

Round-3 fixes pushed in 6723ed1:

Accepted (1/2): minimum cleanup interval clamp. Added minCleanupInterval = 1ms floor; any request below that logs a Warn and clamps. 1ms is well below any production interval and below every fast-iteration test in this package (5-10ms). New test TestStartCleanup_ClampsTinyInterval pins it.

Declining (2/2): reset cleanupStarted on goroutine termination. YAGNI today — there is no hot-reload path in Arc. Adding the reset expands the API surface (state-machine invariants for a brief overlap window between cancel and second-start) for a hypothetical use case. Happy to revisit when a concrete restart path lands; the test TestStartCleanup_StopsOnContextCancel already pins the current contract (one-shot per pruner).

5 tests green under -race.

Side note: I want to flag something interesting that I think is worth knowing for both of us. Round 1 told us to parallelise the sweeps. Round 2 reversed that and told us to serialise. Both presented with the same confidence; neither pass measured. This is a pattern with LLM perf reviews — plausible-sounding architectural priors that can land on either side depending on context. The right move for the next perf-shaped suggestion is to bench first — same posture we already use for declining non-perf findings, just symmetric on the accept side.

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a background janitor to the partition pruner to prevent unbounded cache growth by periodically sweeping expired entries. The changes include wiring the janitor into the application lifecycle with graceful shutdown, implementing the cleanup logic with idempotency and safety clamping, and adding comprehensive tests. Review feedback focuses on improving code maintainability by removing ephemeral meta-comments, internal review references, and external links from the source code, as well as standardizing spelling.

Comment thread internal/pruning/partition_pruner.go Outdated
// goroutine per pruner. Idempotent on repeat calls — second and
// later invocations log a warn and return without spawning. Guards
// against a future hot-reload or test refactor accidentally
// multiplying goroutines silently. Internal-review M1 on PR #450.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Remove internal review references from code comments to maintain a clean codebase. These details are better suited for the PR description or commit messages.

Suggested change
// multiplying goroutines silently. Internal-review M1 on PR #450.
// multiplying goroutines silently.

Comment thread internal/pruning/partition_pruner.go Outdated
Comment on lines +951 to +957
// minCleanupInterval is the smallest interval StartCleanup will honor.
// Anything below this is clamped up to prevent a misconfigured caller
// (or a fuzz input that passes a sub-millisecond duration) from
// pinning a CPU core in a tight ticker loop. 1ms is a generous floor
// — orders of magnitude below any plausible production interval and
// any fast-iteration test (tests in this package use 5–10ms).
// (Gemini round 3 / PR #450.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Remove meta-comments referencing review rounds. This information is ephemeral and does not add value to the long-term maintenance of the code.

Suggested change
// minCleanupInterval is the smallest interval StartCleanup will honor.
// Anything below this is clamped up to prevent a misconfigured caller
// (or a fuzz input that passes a sub-millisecond duration) from
// pinning a CPU core in a tight ticker loop. 1ms is a generous floor
// — orders of magnitude below any plausible production interval and
// any fast-iteration test (tests in this package use 5–10ms).
// (Gemini round 3 / PR #450.)
// minCleanupInterval is the smallest interval StartCleanup will honor.
// Anything below this is clamped up to prevent a misconfigured caller
// (or a fuzz input that passes a sub-millisecond duration) from
// pinning a CPU core in a tight ticker loop. 1ms is a generous floor
// — orders of magnitude below any plausible production interval and
// any fast-iteration test (tests in this package use 5–10ms).

Comment thread internal/pruning/partition_pruner.go Outdated
Comment on lines +977 to +978
// fills that gap. See https://github.com/Basekick-Labs/arc/pull/450
// (Q2 finding from the 2026-05-22 memory-retention triage).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Remove external PR links and internal triage references from code comments. The technical context provided in the comment is sufficient for future maintainers.

Suggested change
// fills that gap. See https://github.com/Basekick-Labs/arc/pull/450
// (Q2 finding from the 2026-05-22 memory-retention triage).
// fills that gap.

Comment thread internal/pruning/partition_pruner.go Outdated
Comment on lines +1007 to +1012
// goroutine-spawn + WaitGroup-sync overhead that parallelising the
// two sweeps would add. Run sequentially; if a future profile ever
// shows the sweep itself blocking concurrent get() calls under a
// hot-cache workload, that's the trigger to revisit. (Gemini round 2
// / PR #450 reversed a round-1 suggestion to parallelise after
// pointing out the overhead inversion.)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Correct the spelling of 'parallelizing' to match the American English standard used throughout the project (e.g., 'optimized', 'initialization'). Also, remove the meta-comment referencing review history.

Suggested change
// goroutine-spawn + WaitGroup-sync overhead that parallelising the
// two sweeps would add. Run sequentially; if a future profile ever
// shows the sweep itself blocking concurrent get() calls under a
// hot-cache workload, that's the trigger to revisit. (Gemini round 2
// / PR #450 reversed a round-1 suggestion to parallelise after
// pointing out the overhead inversion.)
// goroutine-spawn + WaitGroup-sync overhead that parallelizing the
// two sweeps would add. Run sequentially; if a future profile ever
// shows the sweep itself blocking concurrent get() calls under a
// hot-cache workload, that's the trigger to revisit.

…und 4)

Drops the "Gemini round N / PR #450" stamps and the
2026-05-22-triage reference from comments. For a 4-round PR
on a small fix, the audit trail belongs in the PR description
and commit log, not in the source. (The X1 PR kept references
in-source because 12 review rounds genuinely needed the trail;
calibrate per PR size.)

Also "parallelising" → "parallelizing" to match the project's
predominantly US spelling.

No behavior change.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 23, 2026

Round-4 fixes pushed in 80264bb — dropped the Gemini round N / PR #450 stamps and the triage-date reference from inline comments, and changed "parallelising" to "parallelizing" to match the project's predominantly US spelling. You're right that for a 4-round PR on a small fix, the audit trail belongs in the PR description + commit log, not in the source. (Different calibration than the X1 PR, which kept refs in-source because 12 review rounds genuinely needed the trail.)

No behavior change; 5 tests still green under -race.

PR is ready for merge from my side.

@xe-nvdk xe-nvdk merged commit 0a02877 into main May 23, 2026
6 checks passed
@xe-nvdk xe-nvdk deleted the fix/pruner-cache-janitor branch May 23, 2026 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant