Skip to content

fix: use runtime heap stats for memory-based eviction#564

Merged
KpaBap merged 1 commit into
Kpa-clawbot:masterfrom
efiten:fix/eviction-memory-estimate
Apr 4, 2026
Merged

fix: use runtime heap stats for memory-based eviction#564
KpaBap merged 1 commit into
Kpa-clawbot:masterfrom
efiten:fix/eviction-memory-estimate

Conversation

@efiten
Copy link
Copy Markdown
Contributor

@efiten efiten commented Apr 4, 2026

Problem

Closes #563. Addresses the Packet store estimated memory item in #559.

estimatedMemoryMB() used a hardcoded formula:

return float64(len(s.packets)*5120+s.totalObs*500) / 1048576.0

This ignored three data structures that grow continuously with every ingest cycle:

Structure Production size Heap not counted
distHops []distHopRecord 1,556,833 records ~300 MB
distPaths []distPathRecord 93,090 records ~25 MB
spIndex map[string]int 4,113,234 entries ~400 MB

Result: formula reported ~1.2 GB while actual heap was ~5 GB. With maxMemoryMB: 1024, eviction calculated it only needed to shed ~200 MB, removed a handful of packets, and stopped. Memory kept growing until the OOM killer fired.

Fix

Replace estimatedMemoryMB() with runtime.ReadMemStats so all data structures are automatically counted:

func (s *PacketStore) estimatedMemoryMB() float64 {
    if s.memoryEstimator != nil {
        return s.memoryEstimator()
    }
    var ms runtime.MemStats
    runtime.ReadMemStats(&ms)
    return float64(ms.HeapAlloc) / 1048576.0
}

Replace the eviction simulation loop (which re-used the same wrong formula) with a proportional calculation: if heap is N× over budget, evict enough packets to keep (1/N) × 0.9 of the current count. The 0.9 factor adds a 10% buffer so the next ingest cycle doesn't immediately re-trigger. All major data structures (distHops, distPaths, spIndex) scale with packet count, so removing a fraction of packets frees roughly the same fraction of total heap.

Testing

  • Updated TestEvictStale_MemoryBasedEviction to inject a deterministic estimator via the new memoryEstimator field.
  • Added TestEvictStale_MemoryBasedEviction_UnderestimatedHeap: verifies that when actual heap is 5× over limit (the production failure scenario), eviction correctly removes ~80%+ of packets.
=== RUN   TestEvictStale_MemoryBasedEviction
[store] Evicted 538 packets (1076 obs)
--- PASS

=== RUN   TestEvictStale_MemoryBasedEviction_UnderestimatedHeap
[store] Evicted 820 packets (1640 obs)
--- PASS

Full suite: go test ./... — ok (10.3s)

Perf note

runtime.ReadMemStats runs once per eviction tick (every 60 s) and once per /api/perf/store call. Cost is negligible.

estimatedMemoryMB() used a hardcoded formula (packets × 5120 + obs × 500)
that ignored distHops (1.5M records), distPaths (93K records), and spIndex
(4.1M entries). This caused the formula to report ~1.2GB while actual heap
was ~5GB, so the memory-based eviction check barely fired and the process
grew until OOM-killed.

Replace estimatedMemoryMB() with runtime.ReadMemStats so all data structures
are counted. Replace the inner eviction simulation loop (which re-used the
wrong formula) with a proportional calculation: if heap is N× over budget,
evict enough packets to keep (1/N × 0.9) of the current count, with a 10%
buffer so the next ingest cycle doesn't immediately re-trigger.

Add memoryEstimator func() float64 field to PacketStore for test injection,
so eviction tests stay deterministic without spawning goroutines or relying
on process heap size.

Addresses the "Packet store estimated memory" item in epic Kpa-clawbot#559.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@Kpa-clawbot Kpa-clawbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Charlie Munger Review — Final

Mental models applied: Inversion, Incentive bias, Man with a hammer, Second-order effects.

The Inversion Question: "What would make this fail catastrophically?"

1. Over-eviction from non-PacketStore heap pressure (second-order effect)

runtime.ReadMemStats measures the entire process heap, not just the PacketStore. If a large HTTP request, a goroutine burst, or any transient allocation spikes HeapAlloc, the eviction logic will blame the PacketStore and aggressively shed packets that shouldn't be evicted.

However — applying inversion again — the old code failed by under-counting (OOM kill). The new code's failure mode is over-counting (evict too many packets). Over-eviction means data loss but the process stays alive. That's strictly better. The assumption that PacketStore dominates heap is stated in the PR description (production: ~5GB heap, most of it distHops/distPaths/spIndex) and appears well-founded. Not a must-fix — just acknowledge the tradeoff.

2. GC timing and eviction cascades

After evicting 80% of packets, do the underlying slice capacities actually shrink? The distHops[:0] append-filter pattern retains the backing array. s.packets uses copy but doesn't reallocate. So HeapAlloc might not drop as much as expected after one GC cycle, potentially triggering another round of eviction on the next 60s tick.

In practice: HeapAlloc reflects live objects, not capacity. The evicted StoreTx pointers and their observation slices become unreachable and will be collected. The empty capacity in the distHops/distPaths backing arrays is relatively small compared to the pointed-to data. Not a must-fix — the math is self-correcting over 1-2 GC cycles.

3. runtime.ReadMemStats STW pause

ReadMemStats stops the world. Called once per 60s eviction tick — negligible. But GetPerfStoreStats and GetPerfStoreStatsTyped now also call it on every /api/perf/store request. If a dashboard polls this endpoint every second, that's a STW pause per second. On a 5GB heap, that's noticeable.

Suggestion (not must-fix): Cache the ReadMemStats result with a short TTL (e.g., 5s) so rapid API hits don't cause repeated STW pauses.

4. The 0.9 buffer at boundary conditions

If heap is barely over budget (1.05×), fractionToKeep = (1/1.05) * 0.9 ≈ 0.857 — evicts ~14% of packets for a 5% overshoot. This is aggressive but prevents oscillation. Acceptable tradeoff. The incentive structure is correct: better to evict slightly too much once than to oscillate around the boundary forever.

5. Test coverage

The UnderestimatedHeap test uses a constant estimator (always returns 2500MB regardless of evictions). This correctly tests the initial eviction math but doesn't verify that the feedback loop works across multiple ticks. Since the real estimator uses ReadMemStats which naturally reflects state changes, this is fine for unit testing. Not a must-fix.

Verdict

The core fix is sound. The old formula was dangerously wrong — a classic "man with a hammer" problem where someone modeled memory as packets * constant and forgot the other data structures exist. Using actual heap stats is the right call. The proportional eviction replaces a buggy simulation loop with clean arithmetic. The tradeoffs are well-understood and acceptable.

Zero must-fix items. Approving.

@KpaBap KpaBap merged commit f897ce1 into Kpa-clawbot:master Apr 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: memory-based eviction never fires — estimatedMemoryMB() ignores distHops/distPaths/spIndex

3 participants