fix: use runtime heap stats for memory-based eviction#564
Conversation
estimatedMemoryMB() used a hardcoded formula (packets × 5120 + obs × 500) that ignored distHops (1.5M records), distPaths (93K records), and spIndex (4.1M entries). This caused the formula to report ~1.2GB while actual heap was ~5GB, so the memory-based eviction check barely fired and the process grew until OOM-killed. Replace estimatedMemoryMB() with runtime.ReadMemStats so all data structures are counted. Replace the inner eviction simulation loop (which re-used the wrong formula) with a proportional calculation: if heap is N× over budget, evict enough packets to keep (1/N × 0.9) of the current count, with a 10% buffer so the next ingest cycle doesn't immediately re-trigger. Add memoryEstimator func() float64 field to PacketStore for test injection, so eviction tests stay deterministic without spawning goroutines or relying on process heap size. Addresses the "Packet store estimated memory" item in epic Kpa-clawbot#559. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kpa-clawbot
left a comment
There was a problem hiding this comment.
Charlie Munger Review — Final
Mental models applied: Inversion, Incentive bias, Man with a hammer, Second-order effects.
The Inversion Question: "What would make this fail catastrophically?"
1. Over-eviction from non-PacketStore heap pressure (second-order effect)
runtime.ReadMemStats measures the entire process heap, not just the PacketStore. If a large HTTP request, a goroutine burst, or any transient allocation spikes HeapAlloc, the eviction logic will blame the PacketStore and aggressively shed packets that shouldn't be evicted.
However — applying inversion again — the old code failed by under-counting (OOM kill). The new code's failure mode is over-counting (evict too many packets). Over-eviction means data loss but the process stays alive. That's strictly better. The assumption that PacketStore dominates heap is stated in the PR description (production: ~5GB heap, most of it distHops/distPaths/spIndex) and appears well-founded. Not a must-fix — just acknowledge the tradeoff.
2. GC timing and eviction cascades
After evicting 80% of packets, do the underlying slice capacities actually shrink? The distHops[:0] append-filter pattern retains the backing array. s.packets uses copy but doesn't reallocate. So HeapAlloc might not drop as much as expected after one GC cycle, potentially triggering another round of eviction on the next 60s tick.
In practice: HeapAlloc reflects live objects, not capacity. The evicted StoreTx pointers and their observation slices become unreachable and will be collected. The empty capacity in the distHops/distPaths backing arrays is relatively small compared to the pointed-to data. Not a must-fix — the math is self-correcting over 1-2 GC cycles.
3. runtime.ReadMemStats STW pause
ReadMemStats stops the world. Called once per 60s eviction tick — negligible. But GetPerfStoreStats and GetPerfStoreStatsTyped now also call it on every /api/perf/store request. If a dashboard polls this endpoint every second, that's a STW pause per second. On a 5GB heap, that's noticeable.
Suggestion (not must-fix): Cache the ReadMemStats result with a short TTL (e.g., 5s) so rapid API hits don't cause repeated STW pauses.
4. The 0.9 buffer at boundary conditions
If heap is barely over budget (1.05×), fractionToKeep = (1/1.05) * 0.9 ≈ 0.857 — evicts ~14% of packets for a 5% overshoot. This is aggressive but prevents oscillation. Acceptable tradeoff. The incentive structure is correct: better to evict slightly too much once than to oscillate around the boundary forever.
5. Test coverage
The UnderestimatedHeap test uses a constant estimator (always returns 2500MB regardless of evictions). This correctly tests the initial eviction math but doesn't verify that the feedback loop works across multiple ticks. Since the real estimator uses ReadMemStats which naturally reflects state changes, this is fine for unit testing. Not a must-fix.
Verdict
The core fix is sound. The old formula was dangerously wrong — a classic "man with a hammer" problem where someone modeled memory as packets * constant and forgot the other data structures exist. Using actual heap stats is the right call. The proportional eviction replaces a buggy simulation loop with clean arithmetic. The tradeoffs are well-understood and acceptable.
Zero must-fix items. Approving.
Problem
Closes #563. Addresses the Packet store estimated memory item in #559.
estimatedMemoryMB()used a hardcoded formula:This ignored three data structures that grow continuously with every ingest cycle:
distHops []distHopRecorddistPaths []distPathRecordspIndex map[string]intResult: formula reported ~1.2 GB while actual heap was ~5 GB. With
maxMemoryMB: 1024, eviction calculated it only needed to shed ~200 MB, removed a handful of packets, and stopped. Memory kept growing until the OOM killer fired.Fix
Replace
estimatedMemoryMB()withruntime.ReadMemStatsso all data structures are automatically counted:Replace the eviction simulation loop (which re-used the same wrong formula) with a proportional calculation: if heap is N× over budget, evict enough packets to keep
(1/N) × 0.9of the current count. The 0.9 factor adds a 10% buffer so the next ingest cycle doesn't immediately re-trigger. All major data structures (distHops, distPaths, spIndex) scale with packet count, so removing a fraction of packets frees roughly the same fraction of total heap.Testing
TestEvictStale_MemoryBasedEvictionto inject a deterministic estimator via the newmemoryEstimatorfield.TestEvictStale_MemoryBasedEviction_UnderestimatedHeap: verifies that when actual heap is 5× over limit (the production failure scenario), eviction correctly removes ~80%+ of packets.Full suite:
go test ./...— ok (10.3s)Perf note
runtime.ReadMemStatsruns once per eviction tick (every 60 s) and once per/api/perf/storecall. Cost is negligible.