feat(paging): EmbeddingCache → PagedResourcePool (Phase 2 — fixes 0/64 hit rate) by joelteply · Pull Request #933 · CambrianTech/continuum

joelteply · 2026-04-18T03:06:08Z

Summary

The 0/64 hit rate fix the architecture doc promised. Internal Rust callers (`DataModule.backfill_vectors`, `ModuleBackedEmbeddingProvider`, anywhere using `generate_embedding{,s_batch}`) were silently bypassing the embedding cache because it lived inside the IPC handler — `handle_generate` had its own ad-hoc `HashMap<(String, u64), CachedEmbedding>` with djb2 keys + 5-minute TTL + 10k count cap. Internal callers went around it.

This PR moves the cache one layer down: the pool sits behind `generate_embedding{,s_batch}`, so every embedder — IPC handler and Rust caller — hits the same cache uniformly. The IPC handler becomes a thin wrapper around the batched function. The architecture-doc claim is now true.

What's in this commit

`EmbeddingKey { model: String, content_hash: [u8; 32] }` — structured key (Joel's pass-the-struct rule), model-namespaced (different models map text → different vectors → distinct slots), SHA-256 content hash (fixed-size, collision-free vs djb2)
`EMBEDDING_POOL: PagedResourcePool<EmbeddingKey, Vec>` replaces `EMBEDDING_CACHE`. 256 MB byte-driven budget (~170k entries at 384d FP32, vs the old 10k count cap), `size_weighted_lru` eviction. Eventually overridden by recipe-declared budgets (Phase 9) and pressure broker (Phase 7b).
`generate_embedding` / `generate_embeddings_batch` are pool-aware — the actual user-visible behavior change. Batch path: per-text cache check → one `model.embed()` for the miss subset → per-text insert. One model invocation per batch (vs N for per-text single-flight) keeps the GPU/ONNX path efficient.
`handle_generate` collapses to ~15 lines (was ~70) by delegating to `generate_embeddings_batch`
`handle_cache_stats` now exposes `pressure` + `maxBytes` + `evictions` + `pinned` alongside hits/misses — broker-aware cache visibility for free
`handle_cache_clear` → `pool.clear()` (new pool method that drains entries + resets hit/miss/eviction counters)
`PagedResourcePool::clear()` — admin-level reset, drops pinned too. Distinct from `evict_under_pressure` (which respects pins). Documented inline.

Tests (5/5 passing — `cargo test --lib modules::embedding::tests`)

`generate_embeddings_batch_hits_pool_before_loading_model` — the regression test for the 0/64 bug. Pre-populates pool with vector for model "nonexistent-model-name"; the call returns the cached vector without trying to load that nonexistent model. If the cache path were broken, model load would fail — proves the short-circuit works.
`single_embedding_hits_pool_before_loading_model` — same proof for single-text path
`embedding_key_is_model_namespaced` — same text + different model = distinct cache slots (correctness invariant)
`pool_clear_resets_stats_and_drops_entries` — `pool.clear()` drains entries AND resets counters
`batch_with_partial_hits_records_correct_hit_count` — all-cached batch returns without `model.embed()`

Tests serialize via `TEST_LOCK` on the global pool — `EMBEDDING_POOL` is a process-global singleton (matches IPC reality) and parallel test execution would race shared hit/miss counters. The pure key-shape test doesn't need the lock.

50/50 paging tests still pass — no regressions.

Branch base

This PR is based on `feature/pressure-broker` (PR #932), because Phase 2 needs `stats_blocking` (sync stats for the cache_stats handler called from sync IPC code). When #932 lands, this rebases cleanly onto main.

Test plan

`cargo test --features metal,accelerate --lib modules::embedding::tests` — 5/5 pass
`cargo test --features metal,accelerate --lib paging` — 50/50 pass (no regression)
`cargo build --features metal,accelerate -p continuum-core` — clean (0 errors, only pre-existing warnings)
Live verification (after merge): run codebase indexer, watch `./jtag embedding/cache/stats` — hit rate should climb past 0% as the same chunk text hits the cache from both backfill_vectors and IPC paths

🤖 Generated with Claude Code

The PagedResourcePool primitive is the per-resource brain. The broker is the cross-resource brain: one orchestrator that reads pressure from every registered pool, decides which to relieve, and pulls the eviction lever. Same broker is the future home of recipe-aware priority arbitration, ML-policy-driven tiering, and LLM-mediated control for novel pressure situations — those land in 7b/7c without changing the pool API. What lands here: - PressureSource trait — name + pressure + evict_some + stats - PressureBroker struct with register/unregister/relieve/snapshot/spawn_tick - PressureTier (Normal/Warning/High/Critical at 0.6/0.8/0.95) - Tier-driven relief: Normal/Warning observe; High evicts worst pool; Critical evicts all over-budget pools - Blanket impl: every PagedResourcePool<K, V> auto-satisfies PressureSource — consumers register Arc<pool>, no adapter struct - PagedResourcePool::evict_under_pressure (broker-callable eviction) - PagedResourcePool::stats_blocking (sync stats for non-async trait) - PagedResourcePool::config_name (stable name accessor) Tests (8/8 passing): - tier_thresholds_match_gpu_pressure_watcher - no_action_when_all_pools_normal - high_pressure_evicts_only_worst_pool - critical_pressure_evicts_all_over_budget_pools - registration_dedups_by_name - unregister_removes_pool - real_paged_resource_pool_plugs_into_broker_via_blanket_impl - snapshot_orders_pools_by_pressure_descending The integration test proves the architectural point: build a real PagedResourcePool<String, Vec<u8>>, fill it past 0.80 pressure, register it via blanket impl, call broker.relieve(), watch pressure drop. No adapter struct, no per-consumer wiring boilerplate. Doc updated: RESOURCE-ARCHITECTURE.md Phase 7 status + 7a/7b/7c split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 0/64 hit rate fix. Internal Rust callers (DataModule.backfill_vectors, ModuleBackedEmbeddingProvider, anywhere using generate_embedding{,s_batch}) were silently bypassing the embedding cache because it was wired only into the IPC handler — handle_generate had its own ad-hoc HashMap<(String, u64), CachedEmbedding> with djb2 keys + 5-minute TTL + 10k count cap. This commit moves the cache one layer down: the pool lives behind generate_embedding{,s_batch}, so every embedder — IPC and Rust — hits the same cache uniformly. The IPC handler becomes a thin wrapper around the batched function. The architecture-doc claim that the migration "fixes the 0/64 hit rate" is now true. What's in this commit: - EmbeddingKey { model: String, content_hash: [u8; 32] } — structured key, model-namespaced (different models map text → different vectors, so they need distinct cache slots), SHA-256 content hash (fixed-size, collision-free vs djb2) - EMBEDDING_POOL: PagedResourcePool<EmbeddingKey, Vec<f32>> replaces EMBEDDING_CACHE. 256 MB byte-driven budget (~170k entries at 384d FP32, >> previous 10k count cap), size_weighted_lru eviction, eventually overridden by recipe-declared budgets (Phase 9) + pressure broker (7b) - generate_embedding / generate_embeddings_batch are pool-aware — the actual user-visible behavior change. Batch path: per-text cache check, one model.embed() for the miss subset, per-text insert. Single path: pool.get → fall through to model.embed → pool.insert. - handle_generate collapses to ~15 lines (was ~70) by delegating to generate_embeddings_batch - handle_cache_stats now exposes pressure + max_bytes + evictions in addition to hits/misses/size — broker-aware cache visibility - handle_cache_clear → pool.clear() (new pool method that drains entries + resets hit/miss/eviction counters) - PagedResourcePool::clear() — admin-level reset, drops pinned too, distinct from evict_under_pressure (which respects pins) Tests (5/5 passing, 50/50 paging tests still passing): - generate_embeddings_batch_hits_pool_before_loading_model — proves cache hit short-circuits BEFORE fastembed model load. Pre-populates pool with vector for "nonexistent-model-name"; if the cache path were broken, model load would fail. This is the regression test for the 0/64 bug. - single_embedding_hits_pool_before_loading_model — same proof for single-text generate_embedding - embedding_key_is_model_namespaced — same text + different model = distinct cache slots (correctness invariant) - pool_clear_resets_stats_and_drops_entries — pool.clear() drains entries AND resets hit/miss/eviction counters - batch_with_partial_hits_records_correct_hit_count — partial-hit batch (no misses, all-cached) returns without model.embed() Tests serialize via TEST_LOCK on the global pool because EMBEDDING_POOL is a process-global singleton (matches IPC reality) and parallel test execution would race the shared hit/miss counters. Branch base: feature/pressure-broker (PR #932), needs stats_blocking + PressureSource. When #932 lands, this rebases cleanly onto main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3a of the LoRAAdapterPool migration: ship the eviction lever the PressureBroker needs to drive genome adapter eviction without an activate_skill call, plus the value-aware EvictionPriority<V> the deeper Phase 3b migration will need to express priority-weighted scoring as a pool callback. Bridge step — keeps GenomePagingEngine's 21 existing tests intact while making it broker-driveable. What's in this commit: - PagedResourcePool: EvictionPriority generalized to EvictionPriority<V> so consumers can read domain-specific metadata from the value during eviction scoring (an adapter's priority field, an MoE expert's routing weight, a memory-recall entry's salience). lru_priority and size_weighted_lru stay value-blind via the V param being unused. This is the lever Phase 3b's full pool migration of GenomePagingEngine will use to express the existing `age_seconds / (priority * 10)` formula as a pool callback. - GenomePagingEngine::evict_under_pressure(target_pressure: f32) -> u64 Drives existing select_eviction_victim in a loop until memory pressure drops to target, returning bytes freed. Same victim-selection formula and critical-adapter protection (priority > 0.9) as activate_skill's implicit eviction — no behavioral divergence, just a different driver. Same code path picked the victim either way; this commit just lets the broker call it. - cognition/genome-evict-under-pressure IPC handler — exposes the lever so it's testable manually + ready for the broker singleton to call once Phase 3b wires per-persona PressureSource wrappers into the DashMap<persona_id, PersonaCognition>. Tests (4 new + 21 existing, all passing — 38 total in genome_paging suite, 54 in paging, 5 in modules::embedding = 97 across the stack): - test_evict_under_pressure_no_op_when_below_target — already-healthy pool returns 0 without touching anything - test_evict_under_pressure_drops_lru_until_target — three normal adapters at 90% pressure → drop oldest two to reach ≤ 50% target - test_evict_under_pressure_never_drops_critical_adapters — all-critical pool yields zero bytes freed; loop terminates honestly without panicking or dropping a priority>0.9 adapter - test_evict_under_pressure_releases_gpu_guard_for_each_victim — verifies guard cleanup + available-map repatriation, even with no gpu_manager set (clean fallback) Phase 3b (separate PR — needs broker singleton wiring): - PressureSource wrapper struct over Arc<Mutex<GenomePagingEngine>> - Internal pool migration (active HashMap → PagedResourcePool with EvictionPriority<V> reading adapter.priority) - GpuAllocationGuard moves into the pool's value type (Drop releases GPU memory on eviction without engine intervention) Branch base: feature/embedding-cache-pool (PR #933) → which is based on feature/pressure-broker (PR #932). When both upstream PRs land, this rebases cleanly onto main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3a of the LoRAAdapterPool migration: ship the eviction lever the PressureBroker needs to drive genome adapter eviction without an activate_skill call, plus the value-aware EvictionPriority<V> the deeper Phase 3b migration will need to express priority-weighted scoring as a pool callback. Bridge step — keeps GenomePagingEngine's 21 existing tests intact while making it broker-driveable. What's in this commit: - PagedResourcePool: EvictionPriority generalized to EvictionPriority<V> so consumers can read domain-specific metadata from the value during eviction scoring (an adapter's priority field, an MoE expert's routing weight, a memory-recall entry's salience). lru_priority and size_weighted_lru stay value-blind via the V param being unused. This is the lever Phase 3b's full pool migration of GenomePagingEngine will use to express the existing `age_seconds / (priority * 10)` formula as a pool callback. - GenomePagingEngine::evict_under_pressure(target_pressure: f32) -> u64 Drives existing select_eviction_victim in a loop until memory pressure drops to target, returning bytes freed. Same victim-selection formula and critical-adapter protection (priority > 0.9) as activate_skill's implicit eviction — no behavioral divergence, just a different driver. Same code path picked the victim either way; this commit just lets the broker call it. - cognition/genome-evict-under-pressure IPC handler — exposes the lever so it's testable manually + ready for the broker singleton to call once Phase 3b wires per-persona PressureSource wrappers into the DashMap<persona_id, PersonaCognition>. Tests (4 new + 21 existing, all passing — 38 total in genome_paging suite, 54 in paging, 5 in modules::embedding = 97 across the stack): - test_evict_under_pressure_no_op_when_below_target — already-healthy pool returns 0 without touching anything - test_evict_under_pressure_drops_lru_until_target — three normal adapters at 90% pressure → drop oldest two to reach ≤ 50% target - test_evict_under_pressure_never_drops_critical_adapters — all-critical pool yields zero bytes freed; loop terminates honestly without panicking or dropping a priority>0.9 adapter - test_evict_under_pressure_releases_gpu_guard_for_each_victim — verifies guard cleanup + available-map repatriation, even with no gpu_manager set (clean fallback) Phase 3b (separate PR — needs broker singleton wiring): - PressureSource wrapper struct over Arc<Mutex<GenomePagingEngine>> - Internal pool migration (active HashMap → PagedResourcePool with EvictionPriority<V> reading adapter.priority) - GpuAllocationGuard moves into the pool's value type (Drop releases GPU memory on eviction without engine intervention) Branch base: feature/embedding-cache-pool (PR #933) → which is based on feature/pressure-broker (PR #932). When both upstream PRs land, this rebases cleanly onto main. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joelteply and others added 2 commits April 17, 2026 21:49

github-actions Bot added the size: L label Apr 18, 2026

joelteply mentioned this pull request Apr 18, 2026

feat(paging): GenomePagingEngine broker eviction lever (Phase 3a) #934

Merged

5 tasks

Base automatically changed from feature/pressure-broker to main April 19, 2026 00:35

joelteply merged commit 8981c8e into main Apr 19, 2026
3 checks passed

joelteply deleted the feature/embedding-cache-pool branch April 19, 2026 00:35

This was referenced Apr 19, 2026

fix: set ORT_DYLIB_PATH for Mac — enables embeddings on Apple Silicon #938

Merged

docs(arch): SHARED-COGNITION.md — shared objective analysis + LoRA-rendered specialty per persona #941

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(paging): EmbeddingCache → PagedResourcePool (Phase 2 — fixes 0/64 hit rate)#933

feat(paging): EmbeddingCache → PagedResourcePool (Phase 2 — fixes 0/64 hit rate)#933
joelteply merged 2 commits intomainfrom
feature/embedding-cache-pool

joelteply commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented Apr 18, 2026

Summary

What's in this commit

Tests (5/5 passing — `cargo test --lib modules::embedding::tests`)

Branch base

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant