fix: use runtime heap stats for memory-based eviction by efiten · Pull Request #564 · Kpa-clawbot/CoreScope

efiten · 2026-04-04T09:54:35Z

Problem

Closes #563. Addresses the Packet store estimated memory item in #559.

estimatedMemoryMB() used a hardcoded formula:

return float64(len(s.packets)*5120+s.totalObs*500) / 1048576.0

This ignored three data structures that grow continuously with every ingest cycle:

Structure	Production size	Heap not counted
`distHops []distHopRecord`	1,556,833 records	~300 MB
`distPaths []distPathRecord`	93,090 records	~25 MB
`spIndex map[string]int`	4,113,234 entries	~400 MB

Result: formula reported ~1.2 GB while actual heap was ~5 GB. With maxMemoryMB: 1024, eviction calculated it only needed to shed ~200 MB, removed a handful of packets, and stopped. Memory kept growing until the OOM killer fired.

Fix

Replace estimatedMemoryMB() with runtime.ReadMemStats so all data structures are automatically counted:

func (s *PacketStore) estimatedMemoryMB() float64 {
    if s.memoryEstimator != nil {
        return s.memoryEstimator()
    }
    var ms runtime.MemStats
    runtime.ReadMemStats(&ms)
    return float64(ms.HeapAlloc) / 1048576.0
}

Replace the eviction simulation loop (which re-used the same wrong formula) with a proportional calculation: if heap is N× over budget, evict enough packets to keep (1/N) × 0.9 of the current count. The 0.9 factor adds a 10% buffer so the next ingest cycle doesn't immediately re-trigger. All major data structures (distHops, distPaths, spIndex) scale with packet count, so removing a fraction of packets frees roughly the same fraction of total heap.

Testing

Updated TestEvictStale_MemoryBasedEviction to inject a deterministic estimator via the new memoryEstimator field.
Added TestEvictStale_MemoryBasedEviction_UnderestimatedHeap: verifies that when actual heap is 5× over limit (the production failure scenario), eviction correctly removes ~80%+ of packets.

=== RUN   TestEvictStale_MemoryBasedEviction
[store] Evicted 538 packets (1076 obs)
--- PASS

=== RUN   TestEvictStale_MemoryBasedEviction_UnderestimatedHeap
[store] Evicted 820 packets (1640 obs)
--- PASS

Full suite: go test ./... — ok (10.3s)

Perf note

runtime.ReadMemStats runs once per eviction tick (every 60 s) and once per /api/perf/store call. Cost is negligible.

estimatedMemoryMB() used a hardcoded formula (packets × 5120 + obs × 500) that ignored distHops (1.5M records), distPaths (93K records), and spIndex (4.1M entries). This caused the formula to report ~1.2GB while actual heap was ~5GB, so the memory-based eviction check barely fired and the process grew until OOM-killed. Replace estimatedMemoryMB() with runtime.ReadMemStats so all data structures are counted. Replace the inner eviction simulation loop (which re-used the wrong formula) with a proportional calculation: if heap is N× over budget, evict enough packets to keep (1/N × 0.9) of the current count, with a 10% buffer so the next ingest cycle doesn't immediately re-trigger. Add memoryEstimator func() float64 field to PacketStore for test injection, so eviction tests stay deterministic without spawning goroutines or relying on process heap size. Addresses the "Packet store estimated memory" item in epic Kpa-clawbot#559. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Kpa-clawbot

Charlie Munger Review — Final

Mental models applied: Inversion, Incentive bias, Man with a hammer, Second-order effects.

The Inversion Question: "What would make this fail catastrophically?"

1. Over-eviction from non-PacketStore heap pressure (second-order effect)

runtime.ReadMemStats measures the entire process heap, not just the PacketStore. If a large HTTP request, a goroutine burst, or any transient allocation spikes HeapAlloc, the eviction logic will blame the PacketStore and aggressively shed packets that shouldn't be evicted.

However — applying inversion again — the old code failed by under-counting (OOM kill). The new code's failure mode is over-counting (evict too many packets). Over-eviction means data loss but the process stays alive. That's strictly better. The assumption that PacketStore dominates heap is stated in the PR description (production: ~5GB heap, most of it distHops/distPaths/spIndex) and appears well-founded. Not a must-fix — just acknowledge the tradeoff.

2. GC timing and eviction cascades

After evicting 80% of packets, do the underlying slice capacities actually shrink? The distHops[:0] append-filter pattern retains the backing array. s.packets uses copy but doesn't reallocate. So HeapAlloc might not drop as much as expected after one GC cycle, potentially triggering another round of eviction on the next 60s tick.

In practice: HeapAlloc reflects live objects, not capacity. The evicted StoreTx pointers and their observation slices become unreachable and will be collected. The empty capacity in the distHops/distPaths backing arrays is relatively small compared to the pointed-to data. Not a must-fix — the math is self-correcting over 1-2 GC cycles.

3. runtime.ReadMemStats STW pause

ReadMemStats stops the world. Called once per 60s eviction tick — negligible. But GetPerfStoreStats and GetPerfStoreStatsTyped now also call it on every /api/perf/store request. If a dashboard polls this endpoint every second, that's a STW pause per second. On a 5GB heap, that's noticeable.

Suggestion (not must-fix): Cache the ReadMemStats result with a short TTL (e.g., 5s) so rapid API hits don't cause repeated STW pauses.

4. The 0.9 buffer at boundary conditions

If heap is barely over budget (1.05×), fractionToKeep = (1/1.05) * 0.9 ≈ 0.857 — evicts ~14% of packets for a 5% overshoot. This is aggressive but prevents oscillation. Acceptable tradeoff. The incentive structure is correct: better to evict slightly too much once than to oscillate around the boundary forever.

5. Test coverage

The UnderestimatedHeap test uses a constant estimator (always returns 2500MB regardless of evictions). This correctly tests the initial eviction math but doesn't verify that the feedback loop works across multiple ticks. Since the real estimator uses ReadMemStats which naturally reflects state changes, this is fine for unit testing. Not a must-fix.

Verdict

The core fix is sound. The old formula was dangerously wrong — a classic "man with a hammer" problem where someone modeled memory as packets * constant and forgot the other data structures exist. Using actual heap stats is the right call. The proportional eviction replaces a buggy simulation loop with clean arithmetic. The tradeoffs are well-understood and acceptable.

Zero must-fix items. Approving.

efiten mentioned this pull request Apr 4, 2026

epic: scalability audit — server unusable at 300K+ transmissions #559

Open

Kpa-clawbot approved these changes Apr 4, 2026

View reviewed changes

KpaBap merged commit f897ce1 into Kpa-clawbot:master Apr 4, 2026
5 checks passed

Kpa-clawbot mentioned this pull request Apr 12, 2026

bug: memory-based eviction uses HeapAlloc — evicts all packets on large databases #710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use runtime heap stats for memory-based eviction#564

fix: use runtime heap stats for memory-based eviction#564
KpaBap merged 1 commit into
Kpa-clawbot:masterfrom
efiten:fix/eviction-memory-estimate

efiten commented Apr 4, 2026

Uh oh!

Kpa-clawbot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

efiten commented Apr 4, 2026

Problem

Fix

Testing

Perf note

Uh oh!

Kpa-clawbot left a comment

Choose a reason for hiding this comment

Charlie Munger Review — Final

The Inversion Question: "What would make this fail catastrophically?"

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants