Skip to content

feat(paging): EmbeddingCache → PagedResourcePool (Phase 2 — fixes 0/64 hit rate)#933

Merged
joelteply merged 2 commits intomainfrom
feature/embedding-cache-pool
Apr 19, 2026
Merged

feat(paging): EmbeddingCache → PagedResourcePool (Phase 2 — fixes 0/64 hit rate)#933
joelteply merged 2 commits intomainfrom
feature/embedding-cache-pool

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

The 0/64 hit rate fix the architecture doc promised. Internal Rust callers (`DataModule.backfill_vectors`, `ModuleBackedEmbeddingProvider`, anywhere using `generate_embedding{,s_batch}`) were silently bypassing the embedding cache because it lived inside the IPC handler — `handle_generate` had its own ad-hoc `HashMap<(String, u64), CachedEmbedding>` with djb2 keys + 5-minute TTL + 10k count cap. Internal callers went around it.

This PR moves the cache one layer down: the pool sits behind `generate_embedding{,s_batch}`, so every embedder — IPC handler and Rust caller — hits the same cache uniformly. The IPC handler becomes a thin wrapper around the batched function. The architecture-doc claim is now true.

What's in this commit

  • `EmbeddingKey { model: String, content_hash: [u8; 32] }` — structured key (Joel's pass-the-struct rule), model-namespaced (different models map text → different vectors → distinct slots), SHA-256 content hash (fixed-size, collision-free vs djb2)
  • `EMBEDDING_POOL: PagedResourcePool<EmbeddingKey, Vec>` replaces `EMBEDDING_CACHE`. 256 MB byte-driven budget (~170k entries at 384d FP32, vs the old 10k count cap), `size_weighted_lru` eviction. Eventually overridden by recipe-declared budgets (Phase 9) and pressure broker (Phase 7b).
  • `generate_embedding` / `generate_embeddings_batch` are pool-aware — the actual user-visible behavior change. Batch path: per-text cache check → one `model.embed()` for the miss subset → per-text insert. One model invocation per batch (vs N for per-text single-flight) keeps the GPU/ONNX path efficient.
  • `handle_generate` collapses to ~15 lines (was ~70) by delegating to `generate_embeddings_batch`
  • `handle_cache_stats` now exposes `pressure` + `maxBytes` + `evictions` + `pinned` alongside hits/misses — broker-aware cache visibility for free
  • `handle_cache_clear` → `pool.clear()` (new pool method that drains entries + resets hit/miss/eviction counters)
  • `PagedResourcePool::clear()` — admin-level reset, drops pinned too. Distinct from `evict_under_pressure` (which respects pins). Documented inline.

Tests (5/5 passing — `cargo test --lib modules::embedding::tests`)

  • `generate_embeddings_batch_hits_pool_before_loading_model` — the regression test for the 0/64 bug. Pre-populates pool with vector for model "nonexistent-model-name"; the call returns the cached vector without trying to load that nonexistent model. If the cache path were broken, model load would fail — proves the short-circuit works.
  • `single_embedding_hits_pool_before_loading_model` — same proof for single-text path
  • `embedding_key_is_model_namespaced` — same text + different model = distinct cache slots (correctness invariant)
  • `pool_clear_resets_stats_and_drops_entries` — `pool.clear()` drains entries AND resets counters
  • `batch_with_partial_hits_records_correct_hit_count` — all-cached batch returns without `model.embed()`

Tests serialize via `TEST_LOCK` on the global pool — `EMBEDDING_POOL` is a process-global singleton (matches IPC reality) and parallel test execution would race shared hit/miss counters. The pure key-shape test doesn't need the lock.

50/50 paging tests still pass — no regressions.

Branch base

This PR is based on `feature/pressure-broker` (PR #932), because Phase 2 needs `stats_blocking` (sync stats for the cache_stats handler called from sync IPC code). When #932 lands, this rebases cleanly onto main.

Test plan

  • `cargo test --features metal,accelerate --lib modules::embedding::tests` — 5/5 pass
  • `cargo test --features metal,accelerate --lib paging` — 50/50 pass (no regression)
  • `cargo build --features metal,accelerate -p continuum-core` — clean (0 errors, only pre-existing warnings)
  • Live verification (after merge): run codebase indexer, watch `./jtag embedding/cache/stats` — hit rate should climb past 0% as the same chunk text hits the cache from both backfill_vectors and IPC paths

🤖 Generated with Claude Code

joelteply and others added 2 commits April 17, 2026 21:49
The PagedResourcePool primitive is the per-resource brain. The broker is
the cross-resource brain: one orchestrator that reads pressure from every
registered pool, decides which to relieve, and pulls the eviction lever.
Same broker is the future home of recipe-aware priority arbitration,
ML-policy-driven tiering, and LLM-mediated control for novel pressure
situations — those land in 7b/7c without changing the pool API.

What lands here:

  - PressureSource trait — name + pressure + evict_some + stats
  - PressureBroker struct with register/unregister/relieve/snapshot/spawn_tick
  - PressureTier (Normal/Warning/High/Critical at 0.6/0.8/0.95)
  - Tier-driven relief: Normal/Warning observe; High evicts worst pool;
    Critical evicts all over-budget pools
  - Blanket impl: every PagedResourcePool<K, V> auto-satisfies
    PressureSource — consumers register Arc<pool>, no adapter struct
  - PagedResourcePool::evict_under_pressure (broker-callable eviction)
  - PagedResourcePool::stats_blocking (sync stats for non-async trait)
  - PagedResourcePool::config_name (stable name accessor)

Tests (8/8 passing):

  - tier_thresholds_match_gpu_pressure_watcher
  - no_action_when_all_pools_normal
  - high_pressure_evicts_only_worst_pool
  - critical_pressure_evicts_all_over_budget_pools
  - registration_dedups_by_name
  - unregister_removes_pool
  - real_paged_resource_pool_plugs_into_broker_via_blanket_impl
  - snapshot_orders_pools_by_pressure_descending

The integration test proves the architectural point: build a real
PagedResourcePool<String, Vec<u8>>, fill it past 0.80 pressure,
register it via blanket impl, call broker.relieve(), watch pressure
drop. No adapter struct, no per-consumer wiring boilerplate.

Doc updated: RESOURCE-ARCHITECTURE.md Phase 7 status + 7a/7b/7c split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 0/64 hit rate fix. Internal Rust callers (DataModule.backfill_vectors,
ModuleBackedEmbeddingProvider, anywhere using generate_embedding{,s_batch})
were silently bypassing the embedding cache because it was wired only into
the IPC handler — handle_generate had its own ad-hoc HashMap<(String, u64),
CachedEmbedding> with djb2 keys + 5-minute TTL + 10k count cap.

This commit moves the cache one layer down: the pool lives behind
generate_embedding{,s_batch}, so every embedder — IPC and Rust — hits the
same cache uniformly. The IPC handler becomes a thin wrapper around the
batched function. The architecture-doc claim that the migration "fixes
the 0/64 hit rate" is now true.

What's in this commit:

  - EmbeddingKey { model: String, content_hash: [u8; 32] } — structured
    key, model-namespaced (different models map text → different vectors,
    so they need distinct cache slots), SHA-256 content hash (fixed-size,
    collision-free vs djb2)
  - EMBEDDING_POOL: PagedResourcePool<EmbeddingKey, Vec<f32>> replaces
    EMBEDDING_CACHE. 256 MB byte-driven budget (~170k entries at 384d FP32,
    >> previous 10k count cap), size_weighted_lru eviction, eventually
    overridden by recipe-declared budgets (Phase 9) + pressure broker (7b)
  - generate_embedding / generate_embeddings_batch are pool-aware — the
    actual user-visible behavior change. Batch path: per-text cache check,
    one model.embed() for the miss subset, per-text insert. Single path:
    pool.get → fall through to model.embed → pool.insert.
  - handle_generate collapses to ~15 lines (was ~70) by delegating to
    generate_embeddings_batch
  - handle_cache_stats now exposes pressure + max_bytes + evictions in
    addition to hits/misses/size — broker-aware cache visibility
  - handle_cache_clear → pool.clear() (new pool method that drains entries
    + resets hit/miss/eviction counters)
  - PagedResourcePool::clear() — admin-level reset, drops pinned too,
    distinct from evict_under_pressure (which respects pins)

Tests (5/5 passing, 50/50 paging tests still passing):

  - generate_embeddings_batch_hits_pool_before_loading_model — proves
    cache hit short-circuits BEFORE fastembed model load. Pre-populates
    pool with vector for "nonexistent-model-name"; if the cache path
    were broken, model load would fail. This is the regression test for
    the 0/64 bug.
  - single_embedding_hits_pool_before_loading_model — same proof for
    single-text generate_embedding
  - embedding_key_is_model_namespaced — same text + different model =
    distinct cache slots (correctness invariant)
  - pool_clear_resets_stats_and_drops_entries — pool.clear() drains
    entries AND resets hit/miss/eviction counters
  - batch_with_partial_hits_records_correct_hit_count — partial-hit
    batch (no misses, all-cached) returns without model.embed()

Tests serialize via TEST_LOCK on the global pool because EMBEDDING_POOL
is a process-global singleton (matches IPC reality) and parallel test
execution would race the shared hit/miss counters.

Branch base: feature/pressure-broker (PR #932), needs stats_blocking +
PressureSource. When #932 lands, this rebases cleanly onto main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply added a commit that referenced this pull request Apr 18, 2026
Phase 3a of the LoRAAdapterPool migration: ship the eviction lever the
PressureBroker needs to drive genome adapter eviction without an
activate_skill call, plus the value-aware EvictionPriority<V> the deeper
Phase 3b migration will need to express priority-weighted scoring as a
pool callback. Bridge step — keeps GenomePagingEngine's 21 existing
tests intact while making it broker-driveable.

What's in this commit:

  - PagedResourcePool: EvictionPriority generalized to EvictionPriority<V>
    so consumers can read domain-specific metadata from the value during
    eviction scoring (an adapter's priority field, an MoE expert's routing
    weight, a memory-recall entry's salience). lru_priority and
    size_weighted_lru stay value-blind via the V param being unused. This
    is the lever Phase 3b's full pool migration of GenomePagingEngine
    will use to express the existing `age_seconds / (priority * 10)` formula
    as a pool callback.

  - GenomePagingEngine::evict_under_pressure(target_pressure: f32) -> u64
    Drives existing select_eviction_victim in a loop until memory pressure
    drops to target, returning bytes freed. Same victim-selection formula
    and critical-adapter protection (priority > 0.9) as activate_skill's
    implicit eviction — no behavioral divergence, just a different driver.
    Same code path picked the victim either way; this commit just lets the
    broker call it.

  - cognition/genome-evict-under-pressure IPC handler — exposes the lever
    so it's testable manually + ready for the broker singleton to call
    once Phase 3b wires per-persona PressureSource wrappers into the
    DashMap<persona_id, PersonaCognition>.

Tests (4 new + 21 existing, all passing — 38 total in genome_paging
suite, 54 in paging, 5 in modules::embedding = 97 across the stack):

  - test_evict_under_pressure_no_op_when_below_target — already-healthy
    pool returns 0 without touching anything
  - test_evict_under_pressure_drops_lru_until_target — three normal
    adapters at 90% pressure → drop oldest two to reach ≤ 50% target
  - test_evict_under_pressure_never_drops_critical_adapters — all-critical
    pool yields zero bytes freed; loop terminates honestly without
    panicking or dropping a priority>0.9 adapter
  - test_evict_under_pressure_releases_gpu_guard_for_each_victim —
    verifies guard cleanup + available-map repatriation, even with no
    gpu_manager set (clean fallback)

Phase 3b (separate PR — needs broker singleton wiring):

  - PressureSource wrapper struct over Arc<Mutex<GenomePagingEngine>>
  - Internal pool migration (active HashMap → PagedResourcePool with
    EvictionPriority<V> reading adapter.priority)
  - GpuAllocationGuard moves into the pool's value type (Drop releases
    GPU memory on eviction without engine intervention)

Branch base: feature/embedding-cache-pool (PR #933) → which is based on
feature/pressure-broker (PR #932). When both upstream PRs land, this
rebases cleanly onto main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from feature/pressure-broker to main April 19, 2026 00:35
@joelteply joelteply merged commit 8981c8e into main Apr 19, 2026
3 checks passed
@joelteply joelteply deleted the feature/embedding-cache-pool branch April 19, 2026 00:35
joelteply added a commit that referenced this pull request Apr 19, 2026
Phase 3a of the LoRAAdapterPool migration: ship the eviction lever the
PressureBroker needs to drive genome adapter eviction without an
activate_skill call, plus the value-aware EvictionPriority<V> the deeper
Phase 3b migration will need to express priority-weighted scoring as a
pool callback. Bridge step — keeps GenomePagingEngine's 21 existing
tests intact while making it broker-driveable.

What's in this commit:

  - PagedResourcePool: EvictionPriority generalized to EvictionPriority<V>
    so consumers can read domain-specific metadata from the value during
    eviction scoring (an adapter's priority field, an MoE expert's routing
    weight, a memory-recall entry's salience). lru_priority and
    size_weighted_lru stay value-blind via the V param being unused. This
    is the lever Phase 3b's full pool migration of GenomePagingEngine
    will use to express the existing `age_seconds / (priority * 10)` formula
    as a pool callback.

  - GenomePagingEngine::evict_under_pressure(target_pressure: f32) -> u64
    Drives existing select_eviction_victim in a loop until memory pressure
    drops to target, returning bytes freed. Same victim-selection formula
    and critical-adapter protection (priority > 0.9) as activate_skill's
    implicit eviction — no behavioral divergence, just a different driver.
    Same code path picked the victim either way; this commit just lets the
    broker call it.

  - cognition/genome-evict-under-pressure IPC handler — exposes the lever
    so it's testable manually + ready for the broker singleton to call
    once Phase 3b wires per-persona PressureSource wrappers into the
    DashMap<persona_id, PersonaCognition>.

Tests (4 new + 21 existing, all passing — 38 total in genome_paging
suite, 54 in paging, 5 in modules::embedding = 97 across the stack):

  - test_evict_under_pressure_no_op_when_below_target — already-healthy
    pool returns 0 without touching anything
  - test_evict_under_pressure_drops_lru_until_target — three normal
    adapters at 90% pressure → drop oldest two to reach ≤ 50% target
  - test_evict_under_pressure_never_drops_critical_adapters — all-critical
    pool yields zero bytes freed; loop terminates honestly without
    panicking or dropping a priority>0.9 adapter
  - test_evict_under_pressure_releases_gpu_guard_for_each_victim —
    verifies guard cleanup + available-map repatriation, even with no
    gpu_manager set (clean fallback)

Phase 3b (separate PR — needs broker singleton wiring):

  - PressureSource wrapper struct over Arc<Mutex<GenomePagingEngine>>
  - Internal pool migration (active HashMap → PagedResourcePool with
    EvictionPriority<V> reading adapter.priority)
  - GpuAllocationGuard moves into the pool's value type (Drop releases
    GPU memory on eviction without engine intervention)

Branch base: feature/embedding-cache-pool (PR #933) → which is based on
feature/pressure-broker (PR #932). When both upstream PRs land, this
rebases cleanly onto main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
joelteply added a commit that referenced this pull request Apr 19, 2026
Phase 3a of the LoRAAdapterPool migration: ship the eviction lever the
PressureBroker needs to drive genome adapter eviction without an
activate_skill call, plus the value-aware EvictionPriority<V> the deeper
Phase 3b migration will need to express priority-weighted scoring as a
pool callback. Bridge step — keeps GenomePagingEngine's 21 existing
tests intact while making it broker-driveable.

What's in this commit:

  - PagedResourcePool: EvictionPriority generalized to EvictionPriority<V>
    so consumers can read domain-specific metadata from the value during
    eviction scoring (an adapter's priority field, an MoE expert's routing
    weight, a memory-recall entry's salience). lru_priority and
    size_weighted_lru stay value-blind via the V param being unused. This
    is the lever Phase 3b's full pool migration of GenomePagingEngine
    will use to express the existing `age_seconds / (priority * 10)` formula
    as a pool callback.

  - GenomePagingEngine::evict_under_pressure(target_pressure: f32) -> u64
    Drives existing select_eviction_victim in a loop until memory pressure
    drops to target, returning bytes freed. Same victim-selection formula
    and critical-adapter protection (priority > 0.9) as activate_skill's
    implicit eviction — no behavioral divergence, just a different driver.
    Same code path picked the victim either way; this commit just lets the
    broker call it.

  - cognition/genome-evict-under-pressure IPC handler — exposes the lever
    so it's testable manually + ready for the broker singleton to call
    once Phase 3b wires per-persona PressureSource wrappers into the
    DashMap<persona_id, PersonaCognition>.

Tests (4 new + 21 existing, all passing — 38 total in genome_paging
suite, 54 in paging, 5 in modules::embedding = 97 across the stack):

  - test_evict_under_pressure_no_op_when_below_target — already-healthy
    pool returns 0 without touching anything
  - test_evict_under_pressure_drops_lru_until_target — three normal
    adapters at 90% pressure → drop oldest two to reach ≤ 50% target
  - test_evict_under_pressure_never_drops_critical_adapters — all-critical
    pool yields zero bytes freed; loop terminates honestly without
    panicking or dropping a priority>0.9 adapter
  - test_evict_under_pressure_releases_gpu_guard_for_each_victim —
    verifies guard cleanup + available-map repatriation, even with no
    gpu_manager set (clean fallback)

Phase 3b (separate PR — needs broker singleton wiring):

  - PressureSource wrapper struct over Arc<Mutex<GenomePagingEngine>>
  - Internal pool migration (active HashMap → PagedResourcePool with
    EvictionPriority<V> reading adapter.priority)
  - GpuAllocationGuard moves into the pool's value type (Drop releases
    GPU memory on eviction without engine intervention)

Branch base: feature/embedding-cache-pool (PR #933) → which is based on
feature/pressure-broker (PR #932). When both upstream PRs land, this
rebases cleanly onto main.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant