Skip to content

feature(agent-cache): add discovery marker protocol (0.5.0)#128

Merged
jamby77 merged 6 commits into
masterfrom
feature/agent-cache-discovery-markers
Apr 30, 2026
Merged

feature(agent-cache): add discovery marker protocol (0.5.0)#128
jamby77 merged 6 commits into
masterfrom
feature/agent-cache-discovery-markers

Conversation

@jamby77
Copy link
Copy Markdown
Collaborator

@jamby77 jamby77 commented Apr 24, 2026

Summary

Companion to #127 (@betterdb/semantic-cache 0.2.0). Implements the agent-cache side of the shared __betterdb:* discovery marker protocol so BetterDB Monitor (and the upcoming Cache Intelligence MCP tools) can enumerate agent-cache instances on any Valkey they already monitor — no customer configuration beyond constructing the library as before.

Protocol (shared with semantic-cache)

  • __betterdb:caches — hash. Field = options.name, value = JSON metadata (type, prefix, version, protocol_version, capabilities, tier TTL defaults, tool policy names, cost-table presence, etc).
  • __betterdb:heartbeat:<name> — string, TTL 60s, refreshed every 30s.
  • __betterdb:protocolSET NX 1. Version marker.

No sensitive data is written — only cache metadata.

What lands

  • Constructor kicks off a fire-and-forget registerDiscovery() alongside the existing tool.loadPolicies() and createAnalytics() patterns — the cache is usable immediately
  • Heartbeat ticks refresh the registry metadata so tool_policies added via cache.tool.setPolicy(...) become visible within 30s, and re-assert the protocol NX key as belt-and-braces against LRU eviction
  • tool_policies is capped at 500 names with a tool_policies_truncated flag
  • shutdown() now stops the heartbeat and deletes the heartbeat key; the registry entry is preserved so Monitor retains history
  • New cache.ensureDiscoveryReady() lets callers opt into strict collision enforcement — rejects with AgentCacheUsageError when another cache type is registered under the same name on this Valkey
  • New ToolCache.listPolicyNames() exposes registered tool names for the marker
  • Opt out via AgentCacheOptions.discovery = { enabled: false }. heartbeatIntervalMs and includeToolPolicies are also exposed
  • agent_cache_discovery_write_failed_total counter so ACL-denied writes are alertable

Bumps @betterdb/agent-cache 0.4.0 → 0.5.0.

Test plan

  • pnpm --filter @betterdb/agent-cache test — 221 tests pass (12 new unit tests + 1 new integration test for marker registration, heartbeat TTL, metadata refresh after setPolicy, cross-type collision, version mismatch warn+overwrite, ACL-denied resilience)
  • pnpm --filter @betterdb/agent-cache typecheck clean
  • pnpm --filter @betterdb/agent-cache build clean
  • Integration tests run in CI against valkey:8.1: verify HGETALL __betterdb:caches populated, heartbeat TTL in range, shutdown() clears heartbeat key, registry preserved

Notes for review

  • Fire-and-forget registration is intentional and matches the existing pattern for analytics.createAnalytics() / tool.loadPolicies() in the same constructor. Consumers who want strict collision enforcement can await cache.ensureDiscoveryReady() right after construction.
  • The collision-enforcement trade-off is documented in the README and in the PR of the spec. If the team decides we need every tier operation to gate on discovery readiness, that's a follow-up — v1 covers the common case (construct → first op is async anyway → race is vanishingly rare on a healthy Valkey).

Note

Medium Risk
Adds new background Valkey writes and timers during AgentCache lifecycle (constructor/heartbeat/shutdown), which could surface ACL/permission issues or unexpected load in some deployments. Core caching behavior is intended to be unchanged, but lifecycle and monitoring side effects are new.

Overview
Introduces a BetterDB discovery-marker protocol for @betterdb/agent-cache (bumping to 0.5.0): each AgentCache now best-effort registers JSON metadata in __betterdb:caches, maintains a periodic __betterdb:heartbeat:<name> key (and sets __betterdb:protocol), and refreshes metadata on each heartbeat.

Adds AgentCacheOptions.discovery controls (enable/disable, heartbeat interval, include tool policies), a new cache.ensureDiscoveryReady() for callers that want cross-cache-type collision enforcement, and updates shutdown() to stop the heartbeat and delete the heartbeat key while leaving the registry entry intact. Observability is extended with a new Prometheus counter *_discovery_write_failed_total, and tests/docs are updated to cover registration, TTL behavior, collision handling, and ACL-denied write resilience.

Reviewed by Cursor Bugbot for commit 3b504ca. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/agent-cache/src/AgentCache.ts
Comment thread packages/agent-cache/src/discovery.ts
Comment thread packages/agent-cache/src/discovery.ts Outdated
jamby77 added a commit that referenced this pull request Apr 24, 2026
Three findings from Cursor Bugbot on PR #128:

- ensureDiscoveryReady() now re-checks this.discoveryError after awaiting
  the registration promise. The internal .catch() handler resolves the
  promise with undefined on collision, so the await alone silently
  succeeded. The re-check surfaces the AgentCacheUsageError to the first
  caller
- register() now writes the initial heartbeat synchronously via a new
  private writeHeartbeat() before scheduling the setInterval, so Monitor
  sees the cache as alive immediately after construction instead of
  waiting up to heartbeatIntervalMs
- writeHeartbeat() (and therefore tickHeartbeat()) now calls
  onWriteFailed() on SET failure, so the
  agent_cache_discovery_write_failed_total counter reflects heartbeat
  ACL issues as documented

Regression tests added for all three paths (ensureDiscoveryReady rejects
on collision without races, initial heartbeat is written during
register(), heartbeat SET failure bumps the counter).
Comment thread packages/agent-cache/src/AgentCache.ts
jamby77 added a commit that referenced this pull request Apr 24, 2026
Bugbot finding on PR #128: shutdown() checked this.discovery but never
awaited this.discoveryReady. When shutdown() raced with an in-flight
register(), this.discovery was still null, so shutdown() no-opped on
discovery cleanup and returned. register() then completed,
startHeartbeat() ran, and the .then() handler's shutdownCalled guard
eventually stopped the manager — but by then the caller already had a
resolved shutdown() promise and may have closed the Valkey client,
causing late heartbeat writes to hit a dead connection.

Fix: await this.discoveryReady inside shutdown() before deciding on
cleanup. The .catch() handler always resolves the promise (collision
errors are captured on discoveryError, not re-thrown), so awaiting is
safe. By the time the await settles, the .then() handler has either
stored the manager on this.discovery or torn it down via the
shutdownCalled branch — both paths are then handled by the existing
code below.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0de940d. Configure here.

Comment thread packages/agent-cache/src/__tests__/AgentCache.integration.test.ts
Comment thread packages/agent-cache/src/AgentCache.ts
jamby77 added a commit that referenced this pull request Apr 24, 2026
Two bugbot findings on PR #128:

- registerDiscovery's .catch handler only stored AgentCacheUsageError
  in this.discoveryError, silently dropping any other error thrown out
  of manager.register(). Most paths are already wrapped in
  safeCall/safeHget, but checkCollision calls buildMetadata() without
  a try-catch, giving a concrete escape route. Store any error via
  `err instanceof Error ? err : new Error(String(err))` so
  ensureDiscoveryReady() reflects the real outcome
- The discovery integration test was accidentally nested inside the
  describe('Flush', ...) block, making its "Flush" test-suite label
  misleading. Moved it into its own describe('Discovery markers') block
jamby77 added a commit that referenced this pull request Apr 27, 2026
CacheProposalService with type-specific validation, duplicate-pending
rejection, and per-connection 30/hour rate limit. Resolves caches via
HGETALL __betterdb:caches (discovery markers from PRs #127/#128).
Typed errors (validation, invalid cache type, not found, duplicate,
rate limited) are mapped to HTTP status codes at the MCP controller.

Three MCP tools registered:
- cache_propose_threshold_adjust (semantic_cache only, 0..2 range)
- cache_propose_tool_ttl_adjust  (agent_cache only, 10..86400 range)
- cache_propose_invalidate       (filter_kind discriminates by type;
                                  warns when estimated_affected > 10000)

All proposals start as 'pending' with a 24h expiry, awaiting human
approval (Day 5 spec). The module imports StorageModule for proposal
CRUD and ConnectionsModule for the Valkey client used to read markers.

Stacks on feature/cache-proposal-data-model (PR #134).
jamby77 added a commit that referenced this pull request Apr 27, 2026
CacheProposalService with type-specific validation, duplicate-pending
rejection, and per-connection 30/hour rate limit. Resolves caches via
HGETALL __betterdb:caches (discovery markers from PRs #127/#128).
Typed errors (validation, invalid cache type, not found, duplicate,
rate limited) are mapped to HTTP status codes at the MCP controller.

Three MCP tools registered:
- cache_propose_threshold_adjust (semantic_cache only, 0..2 range)
- cache_propose_tool_ttl_adjust  (agent_cache only, 10..86400 range)
- cache_propose_invalidate       (filter_kind discriminates by type;
                                  warns when estimated_affected > 10000)

All proposals start as 'pending' with a 24h expiry, awaiting human
approval (Day 5 spec). The module imports StorageModule for proposal
CRUD and ConnectionsModule for the Valkey client used to read markers.

Stacks on feature/cache-proposal-data-model (PR #134).
jamby77 added 6 commits April 27, 2026 20:21
Implements the agent-cache side of the shared __betterdb:* discovery
marker protocol (companion to @betterdb/semantic-cache 0.2.0). BetterDB
Monitor and the upcoming Cache Intelligence MCP tools can now enumerate
agent-cache instances on any monitored Valkey with zero customer
configuration.

- Constructor kicks off a fire-and-forget registerDiscovery() alongside
  the existing tool.loadPolicies() and createAnalytics() patterns; the
  cache is usable immediately
- Heartbeat ticks every 30s refresh the registry metadata so tool
  policies added via tool.setPolicy(...) become visible within 30s, and
  re-assert the protocol NX key as belt-and-braces against LRU eviction
- tool_policies is capped at 500 names with tool_policies_truncated flag
- shutdown() now stops the heartbeat and deletes the heartbeat key,
  preserving the registry entry so Monitor retains history
- ensureDiscoveryReady() lets callers opt into strict collision
  enforcement — rejects with AgentCacheUsageError when another cache
  type is registered under the same name on this Valkey
- Opt out via AgentCacheOptions.discovery = { enabled: false }
- New ToolCache.listPolicyNames() exposes registered tool names for the
  marker
- agent_cache_discovery_write_failed_total counter for ACL alerting

Bumps @betterdb/agent-cache to 0.5.0.
Three findings from Cursor Bugbot on PR #128:

- ensureDiscoveryReady() now re-checks this.discoveryError after awaiting
  the registration promise. The internal .catch() handler resolves the
  promise with undefined on collision, so the await alone silently
  succeeded. The re-check surfaces the AgentCacheUsageError to the first
  caller
- register() now writes the initial heartbeat synchronously via a new
  private writeHeartbeat() before scheduling the setInterval, so Monitor
  sees the cache as alive immediately after construction instead of
  waiting up to heartbeatIntervalMs
- writeHeartbeat() (and therefore tickHeartbeat()) now calls
  onWriteFailed() on SET failure, so the
  agent_cache_discovery_write_failed_total counter reflects heartbeat
  ACL issues as documented

Regression tests added for all three paths (ensureDiscoveryReady rejects
on collision without races, initial heartbeat is written during
register(), heartbeat SET failure bumps the counter).
- CacheType in this package is always 'agent_cache' — dropped the
  sibling-union, exported the literal as CACHE_TYPE, simplified the
  collision check to compare against that constant directly
- Removed cacheType from DiscoveryManagerDeps (no longer needed now that
  the collision check references the constant)
- Read PACKAGE_VERSION from the library's own package.json at runtime
  instead of duplicating the version string in source
- Removed the cached this.telemetry field; pass Telemetry into
  registerDiscovery() instead
- Stripped explanatory comments from production source per project rule
Bugbot finding on PR #128: shutdown() checked this.discovery but never
awaited this.discoveryReady. When shutdown() raced with an in-flight
register(), this.discovery was still null, so shutdown() no-opped on
discovery cleanup and returned. register() then completed,
startHeartbeat() ran, and the .then() handler's shutdownCalled guard
eventually stopped the manager — but by then the caller already had a
resolved shutdown() promise and may have closed the Valkey client,
causing late heartbeat writes to hit a dead connection.

Fix: await this.discoveryReady inside shutdown() before deciding on
cleanup. The .catch() handler always resolves the promise (collision
errors are captured on discoveryError, not re-thrown), so awaiting is
safe. By the time the await settles, the .then() handler has either
stored the manager on this.discovery or torn it down via the
shutdownCalled branch — both paths are then handled by the existing
code below.
Two bugbot findings on PR #128:

- registerDiscovery's .catch handler only stored AgentCacheUsageError
  in this.discoveryError, silently dropping any other error thrown out
  of manager.register(). Most paths are already wrapped in
  safeCall/safeHget, but checkCollision calls buildMetadata() without
  a try-catch, giving a concrete escape route. Store any error via
  `err instanceof Error ? err : new Error(String(err))` so
  ensureDiscoveryReady() reflects the real outcome
- The discovery integration test was accidentally nested inside the
  describe('Flush', ...) block, making its "Flush" test-suite label
  misleading. Moved it into its own describe('Discovery markers') block
@jamby77 jamby77 force-pushed the feature/agent-cache-discovery-markers branch from 55d312e to 3b504ca Compare April 27, 2026 17:23
jamby77 added a commit that referenced this pull request Apr 29, 2026
## Summary

The 6 read-only MCP tools from
[`spec-cache-mcp-readonly-tools.md`](../../tree/feature/cache-mcp-readonly-tools/docs/plans/specs/spec-cache-mcp-readonly-tools.md).
**Stacked on #134** (cache proposal data model + service); base branch
is \`feature/cache-proposal-data-model\` so it'll auto-rebase to
\`master\` when #134 merges.

- \`cache_list\` — enumerate \`__betterdb:caches\`, return hit rate +
total ops + live/stale status per cache.
- \`cache_health\` — discriminated by cache_type. Semantic returns
category breakdown + uncertain_hit_rate; agent returns per-tool
breakdown. No nullable-field fudging.
- \`cache_threshold_recommendation\` — semantic only. Replicates
\`SemanticCache.thresholdEffectiveness()\` by reading
\`{prefix}:__similarity_window\` directly.
- \`cache_tool_effectiveness\` — agent only. Reads \`{prefix}:__stats\`
and applies the same TTL-recommendation rules as
\`AgentCache.toolEffectiveness()\`.
- \`cache_similarity_distribution\` — exactly 20 buckets of width 0.1
across the 0..2 cosine-distance range, with optional category +
window-hours filters.
- \`cache_recent_changes\` — recent proposals (any status) for a single
cache, newest first, so agents can avoid re-proposing already-pending
changes.

Per-type dispatch surfaces \`INVALID_CACHE_TYPE\` (HTTP 400) when a
semantic-only or agent-only tool is called against the wrong cache type.
\`CACHE_NOT_FOUND\` (HTTP 404) when the cache isn't in the discovery
markers.

The cache-library methods (\`thresholdEffectiveness\`,
\`toolEffectiveness\`) require \`initialize()\` which
side-effect-creates the FT search index. Calling them from an HTTP
request would mutate the cache as a side effect of an inspection. The
service reads the underlying Valkey keys directly instead.

## Test plan
- [x] \`pnpm --filter api test -- --testPathPatterns=cache-readonly\` —
14 tests pass (list both types, health discriminated union,
INVALID_CACHE_TYPE for both wrong-type calls, distribution 20 buckets
w=0.1 with non-negative integer counts, recent_changes ordering+limit,
threshold_recommendation insufficient_data + tighten_threshold)
- [x] \`pnpm --filter api exec tsc --noEmit\`
- [x] \`pnpm --filter @betterdb/mcp build\`
- [ ] After #127 + #128 + #134 merge: integration test driving the full
read flow against real Valkey + real Postgres

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Adds new endpoints and background expiry plus code that can delete
cache keys and modify runtime cache config/TTL policies; mistakes could
cause unintended invalidations or incorrect proposal state transitions.
> 
> **Overview**
> Adds a **cache proposal review + execution pipeline**: new
`CacheProposalController` endpoints to list pending/history, fetch
details with audit, and `approve`/`reject`/`edit-and-approve`;
`CacheProposalService` now enforces pending/expiry rules, records audit
events with `actor_source`, supports proposal expiry, and (optionally)
triggers apply via `CacheApplyService`.
> 
> Implements proposal application to Valkey via `CacheApplyDispatcher`
(semantic threshold overrides, agent tool TTL policy updates, and both
cache types’ invalidate flows), including safer error propagation
(`ApplyFailedError` w/ proposalId) and metrics like
`actualAffected`/duration.
> 
> Introduces `CacheReadonlyService` plus new MCP HTTP endpoints and
`@betterdb/mcp` tools for cache listing/health, threshold
recommendations, tool effectiveness, similarity distributions, and
recent changes, and adds a web “Cache Proposals” page with unread
badge/polling, pending cards (approve/reject/edit), and history/detail
views.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
1f6442e. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Comment on lines +330 to +332
if (this.discoveryError) {
throw this.discoveryError;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases will we enter this if, when line 325 does the same check first already?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Await on line 329 can result on setting this.discoveryError

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ensureDiscoveryReady() is called while registration is still in flight, line 325 sees discoveryError === null so it doesn't throw, then line 329 awaits the in-flight promise. The promise itself never rejects — line 379 ends the chain with a .catch() that swallows the rejection and stores it in this.discoveryError instead. So if registration fails during that await, the only way to surface it is to re-check discoveryError after the await resolves — that's what 330–332 do.

The .catch() is there because discoveryReady has multiple potential awaiters (also shutdown() at line 315) and is started from the constructor — without it, an unawaited rejection would log UnhandledPromiseRejection. The trade-off is we have to read the error back through the field rather than the promise.

A simpler form would be to drop the .catch() and let the promise reject; then line 329 would throw and lines 330–332 become unreachable. Happy to refactor it that way if you prefer.

@KIvanow
Copy link
Copy Markdown
Member

KIvanow commented Apr 29, 2026

@jamby77 looks great overall! Had only 1 comment

@jamby77 jamby77 merged commit 32d8a47 into master Apr 30, 2026
3 checks passed
@jamby77 jamby77 deleted the feature/agent-cache-discovery-markers branch April 30, 2026 07:15
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 30, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants