Skip to content

data/query: memory leak under load (4.8GB cumulative observed) — causes indexer+persona cascade #945

@joelteply

Description

@joelteply

Observed

During shared-cognition shim validation (branch feature/shared-cognition-rust, 2026-04-19), both memento and anvil hit the same failure mode on independent dev machines:

  • continuum-core-server RSS climbed steadily under moderate load (indexer + a few personas)
  • `data/query` request timeouts started firing repeatedly (Rust-side WARN in rag.log: "CodebaseIndexer: Failed to load content hashes: Request timeout: data/query")
  • anvil's independent measurement: data/query leaked ~4.8GB cumulative before Rust core OOM-crashed
  • My machine: Rust core RSS hit 939MB, system swap 11.6GB, memory pressure 100% for ~2 minutes, Rust process then crashed and restarted

Impact

Chat becomes unresponsive system-wide. Personas get stuck mid-evaluation when ChatRAGBuilder's ORM.query hits data/query, never returning. Result: end-user ping → 10+ min silence or no response.

Workaround

Added `SKIP_CODEBASE_INDEX=1` env gate to `ServiceInitializer.initializeCodebaseIndexing` (commit 048a823). Skipping the 120s-after-boot codebase index prevents the saturation event.

Real fix

Two layers:

  1. data/query shouldn't leak. Whatever object holds query result sets isn't being dropped. Candidate suspects: SQLite row set in orm/sqlite.rs, in-flight IPC response buffers, or a retained tokio channel. Instrument an RSS delta per query invocation and bisect.
  2. Indexer backpressure. Even without the leak, the indexer's 16-embeddings-per-500ms rate is not throttled by data/query saturation. Add a circuit breaker: if data/query p99 > N ms, pause the indexer; resume when it recovers. The indexer is already yield-polite (EMBEDDING_BATCH_PAUSE_MS), but yielding alone isn't enough when downstream storage is degraded.

Repro

  1. `npm start` (cold)
  2. Wait 120s for the indexer to kick in
  3. Post a few chat messages to general
  4. Watch RSS in `.continuum/jtag/logs/system/core.log` (memory_pressure ticks)
  5. Observe data/query timeouts in `.continuum/jtag/logs/system/rag.log`
  6. Within a few minutes, Rust core hits critical memory and OOM-crashes

Related

  • feature/shared-cognition-rust branch — indexer disable is tracked there, not the actual leak fix

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions