Skip to content

feature(semantic-cache): add discovery marker protocol (0.2.0)#127

Merged
jamby77 merged 8 commits into
masterfrom
feature/semantic-cache-discovery-markers
Apr 29, 2026
Merged

feature(semantic-cache): add discovery marker protocol (0.2.0)#127
jamby77 merged 8 commits into
masterfrom
feature/semantic-cache-discovery-markers

Conversation

@jamby77
Copy link
Copy Markdown
Collaborator

@jamby77 jamby77 commented Apr 24, 2026

Summary

Implements the semantic-cache side of a shared __betterdb:* discovery marker protocol. BetterDB Monitor (and the upcoming Cache Intelligence MCP tools) can now enumerate semantic-cache instances on any Valkey they already monitor — no customer configuration beyond constructing the library as before.

This is the first of a pair: the agent-cache counterpart ships in a separate PR.

Protocol

  • __betterdb:caches — hash. Field = options.name, value = JSON metadata (type, prefix, version, protocol_version, capabilities, index/stats/config keys, default threshold, etc).
  • __betterdb:heartbeat:<name> — string, TTL 60s, refreshed every 30s.
  • __betterdb:protocol — string, SET NX 1. Version marker.

No sensitive data is written — only cache metadata.

What lands

  • initialize() registers the instance in __betterdb:caches and starts a 30s heartbeat with a 60s TTL
  • flush() and a new dispose() method stop the heartbeat and delete the heartbeat key, preserving the registry entry so Monitor retains history after the cache is gone
  • Name collisions across cache types (e.g. constructing a semantic-cache where an agent-cache already holds the same name) throw SemanticCacheUsageError on initialize(). All other Valkey write failures are best-effort and increment a new semantic_cache_discovery_write_failed_total counter so operators can alert on ACL issues
  • Opt out via SemanticCacheOptions.discovery = { enabled: false }. heartbeatIntervalMs and includeCategories are also exposed
  • threshold_adjust is intentionally omitted from published capabilities until SemanticCache.check() re-reads the config hash at runtime — that's a separate follow-up

Bumps @betterdb/semantic-cache 0.1.0 → 0.2.0.

Test plan

  • pnpm --filter @betterdb/semantic-cache test — 43 tests pass, including 13 new unit tests covering schema correctness, cross-type collision (throws), version-mismatch warn+overwrite, ACL-denied resilience, heartbeat TTL, opt-out, and registry-not-touched-on-stop
  • pnpm --filter @betterdb/semantic-cache typecheck clean
  • pnpm --filter @betterdb/semantic-cache build clean
  • Integration tests (2 new) run in CI against valkey/valkey-bundle with the Search module: verify HGETALL __betterdb:caches is populated after initialize(), heartbeat TTL in range, dispose() clears heartbeat key, discovery: { enabled: false } writes nothing

Note

Medium Risk
Adds new best-effort Valkey writes and a background heartbeat tied to initialize()/flush()/dispose(), which could affect deployments with strict ACLs or unexpected keyspace policies. Core cache behavior is largely unchanged, but lifecycle/initialization paths now have more moving parts and timing concerns.

Overview
Adds a BetterDB discovery-marker protocol to @betterdb/semantic-cache (0.1.00.2.0): initialize() now registers cache metadata in __betterdb:caches, sets __betterdb:protocol, and maintains a 60s-TTL __betterdb:heartbeat:<name> refreshed on an interval (opt-out via SemanticCacheOptions.discovery).

Introduces cache.dispose() for graceful shutdown and updates flush()/initialization concurrency handling to stop/delete the heartbeat without deleting the registry entry; all discovery writes are best-effort with a new Prometheus counter semantic_cache_discovery_write_failed_total for ACL/permission failures, plus tests covering collisions, ACL-denied behavior, and heartbeat semantics.

Reviewed by Cursor Bugbot for commit 2d136fc. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/semantic-cache/src/discovery.ts
Comment thread packages/semantic-cache/src/SemanticCache.ts Outdated
Comment thread packages/semantic-cache/src/discovery.ts
jamby77 added a commit that referenced this pull request Apr 24, 2026
Three findings from Cursor Bugbot on PR #127:

- register() now writes the initial heartbeat key before scheduling the
  setInterval, so Monitor sees the cache as alive immediately after
  initialize() returns instead of waiting up to heartbeatIntervalMs for
  the first scheduled tick
- tickHeartbeat() SET failures now call onWriteFailed(), so the
  semantic_cache_discovery_write_failed_total counter reflects heartbeat
  ACL issues as documented
- SemanticCache._doInitialize() now sets _initialized = true only after
  registerDiscovery() succeeds, so a name-collision error correctly
  leaves the cache unusable (previous order let check()/store() still
  work after initialize() rejected)

Regression tests added for the initial heartbeat write and the
heartbeat-failure counter path.
Comment thread packages/semantic-cache/src/SemanticCache.ts Outdated
@jamby77
Copy link
Copy Markdown
Collaborator Author

jamby77 commented Apr 24, 2026

Code review

Found 1 issue:

  1. _doInitialize widens an existing race window: the _initGeneration guard at L365 runs once, before the newly-added await this.registerDiscovery() at L367. If flush() fires during the multiple network awaits inside DiscoveryManager.register() (HGET + HSET + SET protocol + tickHeartbeat), it will bump the generation and set _initialized = false, but once registerDiscovery() resumes, L368 re-asserts _initialized = true and the heartbeat interval — scheduled from inside manager.register() before the method returned — keeps running against a cache whose index has already been dropped. Likewise for dispose() called during the same window: this.discovery is still null at the time of the dispose() call, so it no-ops, but manager.register() completes, schedules the interval, and then assigns this.discovery = manager — yielding a leaked heartbeat the caller thinks they've stopped. Consider re-checking this._initGeneration !== gen (and/or a shutdown/disposed flag) after the await registerDiscovery() resolves, and tearing the manager down if it changed — mirroring the shutdownCalled check that @betterdb/agent-cache uses around its fire-and-forget analytics init.

private async _doInitialize(): Promise<void> {
const gen = this._initGeneration;
return this.traced('initialize', async () => {
const dim = await this.ensureIndexAndGetDimension();
// If flush() ran while we were initializing, don't overwrite its state.
if (this._initGeneration !== gen) return;
this._dimension = dim;
await this.registerDiscovery();
this._initialized = true;
});
}

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

jamby77 added a commit that referenced this pull request Apr 24, 2026
…dispose

Second review pass on PR #127: the _initGeneration guard in
_doInitialize runs before the new await this.registerDiscovery(), but
registerDiscovery has its own internal awaits (HGET + HSET + SET +
tickHeartbeat) during which flush() or dispose() can slip in. With the
old code, those concurrent calls saw this.discovery still null and
no-opped, then registerDiscovery finished, the manager's interval was
already scheduled, and _initialized = true silently overrode the
concurrent flush's _initialized = false — leaving the cache with a
leaked heartbeat and no index.

Fix:
- Return the DiscoveryManager from registerDiscovery() instead of
  assigning to this.discovery internally
- In _doInitialize, after the await, re-check both _initGeneration
  (flush guard) and a new _disposed flag. If either indicates the cache
  was torn down mid-register, stop the manager we just created
  (deleting the heartbeat key) and return without setting
  _initialized = true or this.discovery = manager
- dispose() sets _disposed = true before stopping the existing manager
  so an in-flight registration sees the signal
- _doInitialize resets _disposed = false at entry so a cache can be
  re-initialized after dispose()
Comment thread packages/semantic-cache/src/SemanticCache.ts
jamby77 added a commit that referenced this pull request Apr 27, 2026
CacheProposalService with type-specific validation, duplicate-pending
rejection, and per-connection 30/hour rate limit. Resolves caches via
HGETALL __betterdb:caches (discovery markers from PRs #127/#128).
Typed errors (validation, invalid cache type, not found, duplicate,
rate limited) are mapped to HTTP status codes at the MCP controller.

Three MCP tools registered:
- cache_propose_threshold_adjust (semantic_cache only, 0..2 range)
- cache_propose_tool_ttl_adjust  (agent_cache only, 10..86400 range)
- cache_propose_invalidate       (filter_kind discriminates by type;
                                  warns when estimated_affected > 10000)

All proposals start as 'pending' with a 24h expiry, awaiting human
approval (Day 5 spec). The module imports StorageModule for proposal
CRUD and ConnectionsModule for the Valkey client used to read markers.

Stacks on feature/cache-proposal-data-model (PR #134).
jamby77 added a commit that referenced this pull request Apr 27, 2026
CacheProposalService with type-specific validation, duplicate-pending
rejection, and per-connection 30/hour rate limit. Resolves caches via
HGETALL __betterdb:caches (discovery markers from PRs #127/#128).
Typed errors (validation, invalid cache type, not found, duplicate,
rate limited) are mapped to HTTP status codes at the MCP controller.

Three MCP tools registered:
- cache_propose_threshold_adjust (semantic_cache only, 0..2 range)
- cache_propose_tool_ttl_adjust  (agent_cache only, 10..86400 range)
- cache_propose_invalidate       (filter_kind discriminates by type;
                                  warns when estimated_affected > 10000)

All proposals start as 'pending' with a 24h expiry, awaiting human
approval (Day 5 spec). The module imports StorageModule for proposal
CRUD and ConnectionsModule for the Valkey client used to read markers.

Stacks on feature/cache-proposal-data-model (PR #134).
@KIvanow
Copy link
Copy Markdown
Member

KIvanow commented Apr 27, 2026

A few things

  1. Please resolve the conflicts
  2. The version and changelog will need to be bumped
  3. What new metadata does this PR start introducing other than flagging if the DB is used from our library or not?
  4. This will need to be ported in a separate PR in the python library as well, but I can take a crack at that after we clear the first 3 questions

jamby77 added 5 commits April 27, 2026 19:59
Implements the semantic-cache side of the shared __betterdb:* discovery
marker protocol. Consumers (BetterDB Monitor, the upcoming cache
intelligence MCP tools) can now enumerate semantic caches on any
monitored Valkey instance without customer configuration.

- initialize() registers the instance in the __betterdb:caches hash
  and starts a 30s heartbeat with a 60s TTL
- flush() and new dispose() method stop the heartbeat and remove the
  heartbeat key, preserving the registry entry so Monitor retains
  history
- Name collisions across cache types (semantic vs agent) throw; all
  other Valkey write failures are best-effort and increment the new
  semantic_cache_discovery_write_failed_total counter
- Opt out via SemanticCacheOptions.discovery.enabled = false
- threshold_adjust intentionally omitted from capabilities until
  SemanticCache.check() re-reads the config hash at runtime

Bumps @betterdb/semantic-cache to 0.2.0. Spec:
docs/plans/specs/spec-semantic-cache-discovery-markers.md
Three findings from Cursor Bugbot on PR #127:

- register() now writes the initial heartbeat key before scheduling the
  setInterval, so Monitor sees the cache as alive immediately after
  initialize() returns instead of waiting up to heartbeatIntervalMs for
  the first scheduled tick
- tickHeartbeat() SET failures now call onWriteFailed(), so the
  semantic_cache_discovery_write_failed_total counter reflects heartbeat
  ACL issues as documented
- SemanticCache._doInitialize() now sets _initialized = true only after
  registerDiscovery() succeeds, so a name-collision error correctly
  leaves the cache unusable (previous order let check()/store() still
  work after initialize() rejected)

Regression tests added for the initial heartbeat write and the
heartbeat-failure counter path.
- CacheType in this package is always 'semantic_cache' — dropped the
  sibling-union, exported the literal as CACHE_TYPE, and switched the
  collision check to compare against that constant directly
- Read PACKAGE_VERSION from the library's own package.json at runtime
  instead of duplicating the version string in source
- Stripped explanatory comments from production source per project rule
…dispose

Second review pass on PR #127: the _initGeneration guard in
_doInitialize runs before the new await this.registerDiscovery(), but
registerDiscovery has its own internal awaits (HGET + HSET + SET +
tickHeartbeat) during which flush() or dispose() can slip in. With the
old code, those concurrent calls saw this.discovery still null and
no-opped, then registerDiscovery finished, the manager's interval was
already scheduled, and _initialized = true silently overrode the
concurrent flush's _initialized = false — leaving the cache with a
leaked heartbeat and no index.

Fix:
- Return the DiscoveryManager from registerDiscovery() instead of
  assigning to this.discovery internally
- In _doInitialize, after the await, re-check both _initGeneration
  (flush guard) and a new _disposed flag. If either indicates the cache
  was torn down mid-register, stop the manager we just created
  (deleting the heartbeat key) and return without setting
  _initialized = true or this.discovery = manager
- dispose() sets _disposed = true before stopping the existing manager
  so an in-flight registration sees the signal
- _doInitialize resets _disposed = false at entry so a cache can be
  re-initialized after dispose()
…ignalling abort

Previous fix used a _disposed flag that made _doInitialize early-return
on concurrent dispose(). The downside was the stuck-state bug bugbot
flagged: _initPromise retained a resolved promise after the early
return, so a subsequent initialize() short-circuited via
!this._initPromise and never re-ran _doInitialize — leaving the cache
permanently un-initialized with no recovery path other than flush().

Switch to the same pattern used by @betterdb/agent-cache's shutdown():
dispose() awaits the in-flight _initPromise before tearing down the
discovery manager. This lets _doInitialize complete normally — assigning
this.discovery = manager and _initialized = true — then dispose's
existing code stops the manager and deletes the heartbeat key. Removes
the _disposed field and the second guard check; _initGeneration still
covers the flush()-during-init race.

_initPromise.catch(() => {}) guards against propagating a reject from
the in-flight init (e.g. collision error) into dispose()'s caller.
@jamby77 jamby77 force-pushed the feature/semantic-cache-discovery-markers branch from 2d136fc to 6c7b1f1 Compare April 27, 2026 17:04
Comment thread packages/semantic-cache/src/discovery.ts
@jamby77 jamby77 force-pushed the feature/semantic-cache-discovery-markers branch from 6c7b1f1 to ee0f92d Compare April 27, 2026 17:20
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ee0f92d. Configure here.

Comment thread packages/semantic-cache/src/SemanticCache.ts
@jamby77
Copy link
Copy Markdown
Collaborator Author

jamby77 commented Apr 27, 2026

A few things

  1. Please resolve the conflicts

Rebased

  1. The version and changelog will need to be bumped

Bumped in package.json and CHANGELOG.md

  1. What new metadata does this PR start introducing other than flagging if the DB is used from our library or not?

The PR adds three DB keys, all under a __betterdb:* namespace:

  • __betterdb:caches is a hash where each field is a cache name and the value is a small JSON blob describing that cache. That's how Monitor will enumerate live caches without us having to ship per-tenant config.
  • __betterdb:heartbeat:<name> is a short-TTL key the cache rewrites every 30 seconds. If it's there, the cache is alive; if it's gone, the process died or shut down cleanly. dispose() deletes it explicitly.
  • __betterdb:protocol is just a version string so Monitor can tell when it's talking to an older library.

The JSON in the registry hash has: type (semantic_cache here, agent_cache for the agent-cache package), version, the prefix the cache uses for its own keys (so Monitor can scope SCANs correctly), protocol_version, and a capabilities array — that last one tells Monitor which control-plane operations the running version actually supports, e.g. ["threshold_adjust"].

For semantic-cache specifically, the marker also carries defaultThreshold, categoryThresholds, and uncertaintyBand so Monitor can render the active config without an RPC. If any customer treats their category names as sensitive, they can set discovery: { includeCategories: false } and those fields drop out.

Nothing else changes — the existing {name}:entry:* / {name}:embed:* / {name}:__stats keys are untouched, and the whole protocol can be turned off entirely with discovery: { enabled: false }. All writes are best-effort, so if Valkey rejects them (ACLs, OOM, whatever) check/store keep working.

jamby77 added 2 commits April 27, 2026 20:35
…re await

bugbot LOW: flush() updated _initialized / _initPromise / _initGeneration
synchronously but called 'await this.discovery.stop()' before nulling
this.discovery. A concurrent _doInitialize() that passed the generation
guard could complete during that await and assign a fresh DiscoveryManager
to this.discovery; flush() would then overwrite it with null and leak
the new manager's heartbeat interval. Capture the reference and null
this.discovery before the await, mirroring how the other state fields
are handled at the top of flush().
@KIvanow KIvanow force-pushed the feature/semantic-cache-discovery-markers branch 2 times, most recently from 2d136fc to af4eefd Compare April 28, 2026 22:35
jamby77 added a commit that referenced this pull request Apr 29, 2026
## Summary

The 6 read-only MCP tools from
[`spec-cache-mcp-readonly-tools.md`](../../tree/feature/cache-mcp-readonly-tools/docs/plans/specs/spec-cache-mcp-readonly-tools.md).
**Stacked on #134** (cache proposal data model + service); base branch
is \`feature/cache-proposal-data-model\` so it'll auto-rebase to
\`master\` when #134 merges.

- \`cache_list\` — enumerate \`__betterdb:caches\`, return hit rate +
total ops + live/stale status per cache.
- \`cache_health\` — discriminated by cache_type. Semantic returns
category breakdown + uncertain_hit_rate; agent returns per-tool
breakdown. No nullable-field fudging.
- \`cache_threshold_recommendation\` — semantic only. Replicates
\`SemanticCache.thresholdEffectiveness()\` by reading
\`{prefix}:__similarity_window\` directly.
- \`cache_tool_effectiveness\` — agent only. Reads \`{prefix}:__stats\`
and applies the same TTL-recommendation rules as
\`AgentCache.toolEffectiveness()\`.
- \`cache_similarity_distribution\` — exactly 20 buckets of width 0.1
across the 0..2 cosine-distance range, with optional category +
window-hours filters.
- \`cache_recent_changes\` — recent proposals (any status) for a single
cache, newest first, so agents can avoid re-proposing already-pending
changes.

Per-type dispatch surfaces \`INVALID_CACHE_TYPE\` (HTTP 400) when a
semantic-only or agent-only tool is called against the wrong cache type.
\`CACHE_NOT_FOUND\` (HTTP 404) when the cache isn't in the discovery
markers.

The cache-library methods (\`thresholdEffectiveness\`,
\`toolEffectiveness\`) require \`initialize()\` which
side-effect-creates the FT search index. Calling them from an HTTP
request would mutate the cache as a side effect of an inspection. The
service reads the underlying Valkey keys directly instead.

## Test plan
- [x] \`pnpm --filter api test -- --testPathPatterns=cache-readonly\` —
14 tests pass (list both types, health discriminated union,
INVALID_CACHE_TYPE for both wrong-type calls, distribution 20 buckets
w=0.1 with non-negative integer counts, recent_changes ordering+limit,
threshold_recommendation insufficient_data + tighten_threshold)
- [x] \`pnpm --filter api exec tsc --noEmit\`
- [x] \`pnpm --filter @betterdb/mcp build\`
- [ ] After #127 + #128 + #134 merge: integration test driving the full
read flow against real Valkey + real Postgres

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adds new endpoints and background expiry plus code that can delete
cache keys and modify runtime cache config/TTL policies; mistakes could
cause unintended invalidations or incorrect proposal state transitions.
> 
> **Overview**
> Adds a **cache proposal review + execution pipeline**: new
`CacheProposalController` endpoints to list pending/history, fetch
details with audit, and `approve`/`reject`/`edit-and-approve`;
`CacheProposalService` now enforces pending/expiry rules, records audit
events with `actor_source`, supports proposal expiry, and (optionally)
triggers apply via `CacheApplyService`.
> 
> Implements proposal application to Valkey via `CacheApplyDispatcher`
(semantic threshold overrides, agent tool TTL policy updates, and both
cache types’ invalidate flows), including safer error propagation
(`ApplyFailedError` w/ proposalId) and metrics like
`actualAffected`/duration.
> 
> Introduces `CacheReadonlyService` plus new MCP HTTP endpoints and
`@betterdb/mcp` tools for cache listing/health, threshold
recommendations, tool effectiveness, similarity distributions, and
recent changes, and adds a web “Cache Proposals” page with unread
badge/polling, pending cards (approve/reject/edit), and history/detail
views.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
1f6442e. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
@jamby77 jamby77 merged commit c84b52e into master Apr 29, 2026
5 checks passed
@jamby77 jamby77 deleted the feature/semantic-cache-discovery-markers branch April 29, 2026 07:02
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants