Skip to content

feat(rc.11) PR-B: typed ACK errors + RPC caching + LU-6 runbook + provenance telemetry#672

Closed
branarakic wants to merge 9 commits into
feat/rc11-A-honest-ack-tentative-vm-cleanupfrom
feat/rc11-B-typed-errors-runbook-provenance
Closed

feat(rc.11) PR-B: typed ACK errors + RPC caching + LU-6 runbook + provenance telemetry#672
branarakic wants to merge 9 commits into
feat/rc11-A-honest-ack-tentative-vm-cleanupfrom
feat/rc11-B-typed-errors-runbook-provenance

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

PR-B of the rc.11 two-PR train. Stacks on top of #671
(feat/rc11-A-honest-ack-tentative-vm-cleanup); change the base to
main after #671 lands.

This PR covers T3 (typed ACK errors + RPC caching), T4 (LU-6
Phase B verification + runbook + MCP tool)
, and T5 (ACK-provenance
telemetry)
.

T3 — Honest publish error reporting + RPC caching

  • packages/publisher/src/ack-errors.ts (NEW): typed
    ACKProviderError hierarchy with
    • RpcPreconditionError — RPC pre-flight died on chainId /
      KAV10 / minRequiredSignatures.
    • QuorumUnmetError — dialled N, got M, needed K, with
      PeerOutcome[] carrying
      {peerId, dialOk, protocolSupported, swmHostModeAdvertised, reason}
      so an operator can see which core declined and why.
    • wrapAsRpcPreconditionIfApplicable helper for the agent's V10
      provider wrapper to upgrade legacy throws into the typed shape.
  • packages/chain/src/evm-adapter.ts: 1h-TTL cache for chainId,
    KAV10 address, and minimumRequiredSignatures. Steady-state cost:
    zero RPC per publish (previously one round-trip per attempt).
    invalidatePublishPreflightCache exposed on the EVM adapter only;
    the mock-parity test documents this as EVM-only.
  • packages/cli/src/daemon/lifecycle.ts + publisher-runner.ts:
    chainRpcUrl flows through from config. Daemon emits a startup
    WARN when an operator runs in publishing role against a known
    public RPC.
  • packages/cli/test/lifecycle-public-rpc-warn.test.ts (NEW):
    pins the WARN for the dzudza-style failure mode.

T4 — LU-6 Phase B verification + CHANGELOG + runbook + MCP tool

  • CHANGELOG.md: "Deferred (Phase A sub-task)" stub rewritten into
    "Added — OT-RFC-38 Phase B (cores auto-host curated CGs)"
    documenting the four discovery paths actually wired in rc.10:
    chain-event auto-subscribe (dkg-agent.ts:1737-1803), discovery
    beacon (1810-1838), periodic reconciler (9347-9376), and
    operator-driven enrolment (memory.ts:960-977). Genuine remaining
    gaps (sharding-table-aware selection once shards > 1; cross-core
    ciphertext replay) stay in Deferred.
  • docs/runbooks/RUNBOOK_HOST_MODE_MANUAL_SUBSCRIBE.md (NEW):
    operator playbook for the manual fourth-path
    POST /api/shared-memory/host-mode/subscribe. When / why / how,
    response-shape reference, anti-patterns, undo procedure.
  • packages/mcp-dkg/src/{client.ts,tools/setup.ts}: new
    dkg_request_hosting MCP tool wrapping the manual subscribe
    endpoint. Operator UX sugar; refuses cleanly when local is already
    a CG member (memberMode handler takes precedence) or on non-core
    daemons.
  • packages/mcp-dkg/test/drop-sweep.test.ts: locked count 25 → 26.

T5 — ACK-provenance telemetry

  • packages/core/src/proto/storage-ack.ts:
    StorageACKMsg.subscriptionSource (wire field 8, strictly
    additive); closed SUBSCRIPTION_SOURCES enum
    (chain-event | beacon | reconciler | manual | member).
  • packages/agent/src/dkg-agent.ts: swmHostModeSubscribed
    converted from Set<string> to Map<string, SubscriptionSource>.
    The four Phase B paths call
    reconcileSwmHostModeSubscription(cgId, source) with their
    respective sources; wireSwmHostModeHandler records it. New public
    getSwmSubscriptionSource(...candidateIds) resolver handles the
    cleartext / wire-hash key-shape difference per path. Bound to
    StorageACKHandlerConfig.getSubscriptionSourceForCg so every
    signed ACK carries provenance.
  • packages/publisher/src/storage-ack-handler.ts: populates new
    subscriptionSource field on both signed-ACK paths
    (encrypted + plaintext).
  • packages/publisher/src/ack-collector.ts:
    CollectedACK.subscriptionSource? plumbed through;
    snapshotPeerOutcomes populates swmHostModeAdvertised from the
    ACK source so QuorumUnmetError.peers[].swmHostModeAdvertised is
    meaningful for diagnosis.
  • packages/publisher/src/dkg-publisher.ts: per-publish
    ACK-provenance summary line
    (V10: Collected N core node ACKs [peer:source, ...]) emitted
    after successful ACK collection. Lets an operator answer "why did
    this CG get hosted by these cores?" from the daemon log alone
    instead of cross-referencing chain poller + beacon + reconciler.

Test plan

  • pnpm -r build clean on this branch.
  • pnpm -r build clean on the integration branch
    (release/v10.0.0-rc.11).
  • RFC-38 round-2 cold-start devnet sweep on the integration
    branch: 11/11 PASS.
  • pnpm -r test:unit on the integration branch: 317/317 PASS.
  • Targeted devnet scenarios on the integration branch:
    devnet-test-rc11-promote-crash-recovery.sh GREEN +
    devnet-test-rc11-shutdown-mid-publish.sh GREEN.

Notes for reviewers

  • The new QuorumUnmetError shape is strictly additive to the
    wire format (proto field 8). Pre-rc.11 cores send no
    subscriptionSource; their ACKs appear in the publisher's
    per-publish provenance line as ? (the honest "we don't know yet"
    shape until they upgrade).
  • The 1h-TTL cache invalidator is intentionally EVM-adapter only.
    Adding it to the MockAdapter would create false test confidence;
    the mock-parity test pins this asymmetry.
  • After feat(rc.11) PR-A: delete self-signed ACK fallback + delete tentative-VM concept #671 lands on main, please retarget this PR's base from
    feat/rc11-A-honest-ack-tentative-vm-cleanup back to main.

Made with Cursor

…ce telemetry

PR-B in the 5-PR rc.11 train (T3+T4+T5 in the train commit 543c2c14).
Stacks on top of feat/rc11-A-honest-ack-tentative-vm-cleanup which
ships T1+T2.

T3 — Honest publish error reporting + RPC caching
- packages/publisher/src/ack-errors.ts (NEW): typed `ACKProviderError`
  hierarchy with `RpcPreconditionError` (RPC pre-flight died on
  chainId / KAV10 / minRequiredSignatures) vs `QuorumUnmetError`
  (dialled N, got M, needed K — with `PeerOutcome[]` carrying
  `{peerId, dialOk, protocolSupported, swmHostModeAdvertised,
  reason}`). Includes `wrapAsRpcPreconditionIfApplicable` helper
  for the agent's V10 provider wrapper to upgrade legacy throws.
- packages/chain/src/evm-adapter.ts: 1h-TTL cache for chainId, KAV10
  address, and minimumRequiredSignatures. Steady-state cost: zero
  RPC per publish (previously one round-trip per attempt). Cache
  invalidator (`invalidatePublishPreflightCache`) exposed on the
  EVM adapter only — mock parity test docs this as EVM-only.
- packages/cli/src/daemon/lifecycle.ts + publisher-runner.ts:
  `chainRpcUrl` flows through from config. Daemon emits startup
  WARN when an operator runs in publishing role against a known
  public RPC ("node is configured to publish but using public RPC X
  — ACK pre-flight will be rate-limited, recommend setting
  chainRpcUrl to a private endpoint").
- packages/cli/test/lifecycle-public-rpc-warn.test.ts (NEW): pins
  the WARN for the dzudza-style failure mode.

T4 — LU-6 Phase B verification + CHANGELOG + runbook + MCP tool
- CHANGELOG.md: "Deferred (Phase A sub-task)" stub rewritten into
  "Added — OT-RFC-38 Phase B (cores auto-host curated CGs)"
  documenting the four discovery paths actually wired in rc.10:
  chain-event auto-subscribe (dkg-agent.ts:1737-1803), discovery
  beacon (1810-1838), periodic reconciler (9347-9376), and
  operator-driven enrolment (memory.ts:960-977). Genuine remaining
  gaps (sharding-table-aware selection once shards>1; cross-core
  ciphertext replay) stay in Deferred.
- docs/runbooks/RUNBOOK_HOST_MODE_MANUAL_SUBSCRIBE.md (NEW): operator
  playbook for the manual fourth-path `POST /api/shared-memory/
  host-mode/subscribe`. When / why / how, response-shape reference,
  anti-patterns, undo procedure.
- packages/mcp-dkg/src/{client.ts,tools/setup.ts}: new
  `dkg_request_hosting` MCP tool wrapping the manual subscribe
  endpoint. Operator UX sugar; refuses cleanly when local is
  already a CG member (memberMode handler takes precedence) or
  on non-core daemons.
- packages/mcp-dkg/test/drop-sweep.test.ts: locked count 25 → 26.

T5 — ACK-provenance telemetry
- packages/core/src/proto/storage-ack.ts: `StorageACKMsg.subscription
  Source` (wire field 8, strictly additive); closed `SUBSCRIPTION_
  SOURCES` enum (chain-event | beacon | reconciler | manual | member).
- packages/agent/src/dkg-agent.ts: `swmHostModeSubscribed` converted
  from `Set<string>` to `Map<string, SubscriptionSource>`. The four
  Phase B paths call `reconcileSwmHostModeSubscription(cgId, source)`
  with their respective sources; `wireSwmHostModeHandler` records it.
  New public `getSwmSubscriptionSource(...candidateIds)` resolver
  handles the cleartext / wire-hash key-shape difference per path.
  Bound to `StorageACKHandlerConfig.getSubscriptionSourceForCg` so
  every signed ACK carries provenance.
- packages/publisher/src/storage-ack-handler.ts: populates new
  `subscriptionSource` field on both signed-ACK paths
  (encrypted + plaintext).
- packages/publisher/src/ack-collector.ts: `CollectedACK.subscription
  Source?` plumbed through; `snapshotPeerOutcomes` populates
  `swmHostModeAdvertised` from the ACK source so `QuorumUnmetError.
  peers[].swmHostModeAdvertised` is meaningful for diagnosis.
- packages/publisher/src/dkg-publisher.ts: per-publish ACK-provenance
  summary line ("V10: Collected N core node ACKs [peer:source, ...]")
  emitted after successful ACK collection. Lets an operator answer
  "*why* did this CG get hosted by these cores?" from the daemon log
  alone instead of cross-referencing chain poller + beacon + reconciler.

Evidence:
- Full repo `pnpm -r build` clean on this branch.
- Integration branch (`release/v10.0.0-rc.11`, includes this PR
  + PR-A + all 15 branarakic rc.11 PRs) sweep:
    * RFC-38 round-2 cold-start: 11/11 PASS.
    * `pnpm -r test:unit`: 317/317 PASS.
    * Two new targeted devnet scenarios (async-promote crash +
      restart, SIGTERM mid-publish under load): both GREEN.

Co-authored-by: Cursor <cursoragent@cursor.com>
`nodeIdentityId ${this.config.nodeIdentityId} exceeds uint64 wire format`,
);
}
const encryptedSubscriptionSource = this.config.getSubscriptionSourceForCg?.(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This resolver only sees the numeric cgId plus the optional cleartext swmGraphId, but DKGAgent.swmHostModeSubscribed stores chain-event/beacon subscriptions under the wire hash. On those common host-only paths getSwmSubscriptionSource() will miss and emit no subscriptionSource, so the new ACK-provenance logs/reporting degrade to ?. Pass the wire hash/gossip topic here, or canonicalize candidates through gossipWireIdFor() before lookup.


export function isLikelyPublicRpc(url: string): boolean {
const lower = url.toLowerCase();
return KNOWN_PUBLIC_RPC_HOSTS.some((host) => lower.includes(host));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: includes() is checking the entire URL string, not just the hostname. That means a private proxy URL like https://rpc.my-company.example/https://sepolia.base.org will be warned as "public" even though the host is private. Parse the URL and match against hostname only, which is what the comment above says this helper is doing.


1. The daemon log on the core should include
`Host-mode subscribed CG=<id>` (or the persistence-queue equivalent).
2. `GET /api/shared-memory/host-mode/list` reports the CG (along with
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: This verification step points operators at GET /api/shared-memory/host-mode/list, but the daemon route added in this PR is /api/shared-memory/host-mode/stats. As written, the runbook's "verify it worked" step will 404. Update the runbook to use the existing stats endpoint (or add the list route if that was the intended API).

3 issues from `github-actions[bot]` Codex review on PR #672. Strictly
additive — no new behaviour, just closes the misses Codex flagged on
the PR5 ACK-provenance resolver, the PR3 public-RPC heuristic, and
the LU-6 manual-subscribe runbook.

## 1. dkg-agent.ts:1627 — canonicalize ACK resolver candidates
       through `gossipWireIdFor()` (🔴)

The PR5 `getSubscriptionSourceForCg` closure passed only the numeric
on-chain `cgId` and the cleartext `swmGraphId` to
`getSwmSubscriptionSource(...candidateIds)`. But chain-event AND
discovery-beacon paths key `swmHostModeSubscribed` by the WIRE form
(the curator-committed `nameHash`, normalized through
`gossipWireIdFor` — see `dkg-agent.ts:9317-9325`). The resolver
therefore missed every chain-event / beacon host-only subscription
and the publisher's ACK-provenance log line degraded to `?` on the
common host-only paths.

Fix: in the same closure, also compute the wire form of each
non-empty candidate via `this.gossipWireIdFor(...)` and pass it as an
extra candidate. `getSwmSubscriptionSource` is already variadic and
dedupes via its internal `seen` Set, so over-passing is cheap,
order-independent, and a no-op when one of the four shapes already
hits. `gossipWireIdFor` is a pure local computation — no extra
network or chain reads.

## 2. lifecycle.ts:408 — `isLikelyPublicRpc` parses the URL and
       matches against `hostname` only (🟡)

Pre-fix: `lower.includes(host)` against the WHOLE URL string. A
private proxy URL that embeds a known-public hostname in its PATH
— e.g. `https://rpc.my-company.example/upstream/sepolia.base.org`
or `https://proxy.my-company.example/?target=https://rpc.ankr.com`
— would be misclassified as public and the daemon would emit a
spurious WARN. Worse: an operator who suppressed the WARN by
setting `chain.rpcUrl` to that same URL would still trip the same
check.

Fix: parse `new URL(url).hostname.toLowerCase()` and match with
endpoint semantics (`===` OR `endsWith('.' + host)`). Subdomains of
known-public hosts (e.g. `eu.rpc.ankr.com`) still flag correctly.
Adjacent-suffix lookalikes (e.g. `notsepolia.base.org`) do NOT.
Falls back to the old substring scan for unparseable inputs (bare
hostnames without a scheme) so a malformed `chain.rpcUrl` an
operator still wants flagged keeps tripping the WARN.

Added 4 new test cases to `lifecycle-public-rpc-warn.test.ts`:
two private-proxy-with-embedded-public-path regression cases, and
two subdomain-handling cases (`eu.rpc.ankr.com` → true,
`notsepolia.base.org` → false). All 17 cases pass.

## 3. RUNBOOK_HOST_MODE_MANUAL_SUBSCRIBE.md — point operators at the
       endpoint that actually exists (🟡)

The "How to verify the subscribe took effect" section referenced
`GET /api/shared-memory/host-mode/list`, but the daemon only exposes
`GET /api/shared-memory/host-mode/stats` (served by
`packages/cli/src/daemon/routes/memory.ts:947-957`, backed by
`DKGAgent.getSwmHostModeStats()`). As written, an operator following
the runbook would hit a 404.

Fix: point at `host-mode/stats`, document its actual response shape
(`{ enabled, cgCount, totalBytes, totalEntries, subscribedCgIds }`),
and explain why `totalBytes` / `totalEntries` stay at zero until the
first gossiped envelope ingests (so an operator following step 3
isn't surprised by zero counters on a fresh subscribe).

Co-authored-by: Cursor <cursoragent@cursor.com>
@branarakic
Copy link
Copy Markdown
Contributor Author

Codex review addressed in 6cbfaeb9. Three fixes, all strictly additive:

  1. 🔴 dkg-agent.ts:1627 — canonicalize ACK resolver candidates via gossipWireIdFor() — the closure now also computes this.gossipWireIdFor(cgId) and this.gossipWireIdFor(swmGraphId) and passes them as extra candidates to getSwmSubscriptionSource(...candidateIds). Chain-event AND discovery-beacon paths key swmHostModeSubscribed by the WIRE form (the curator-committed nameHash, normalized through gossipWireIdFor — see dkg-agent.ts:9317-9325); the resolver was missing those subscriptions entirely. getSwmSubscriptionSource is already variadic and dedupes via its internal seen Set, so over-passing is cheap and order-independent. ACK-provenance no longer degrades to ? on the common host-only paths.

  2. 🟡 lifecycle.ts:408 — match against parsed URL.hostnameisLikelyPublicRpc now parses new URL(url).hostname.toLowerCase() and matches with endpoint semantics (=== OR endsWith('.' + host)). Falls back to the old substring scan for unparseable inputs (bare hostnames without a scheme) so a malformed chain.rpcUrl still trips the WARN. Subdomains of known-public hosts (e.g. eu.rpc.ankr.com) still flag correctly; adjacent-suffix lookalikes (notsepolia.base.org) and private proxies embedding a public host in the PATH (https://rpc.my-company.example/upstream/sepolia.base.org) no longer false-positive.

    Added 4 new test cases to lifecycle-public-rpc-warn.test.ts: two private-proxy-with-embedded-public-path regression cases, two subdomain-handling cases. All 17 cases pass locally.

  3. 🟡 RUNBOOK_HOST_MODE_MANUAL_SUBSCRIBE.md — point at the endpoint that exists — step 2 of "How to verify the subscribe took effect" now references GET /api/shared-memory/host-mode/stats (the actual route, served by packages/cli/src/daemon/routes/memory.ts:947-957) instead of the nonexistent host-mode/list. Documented the actual response shape ({ enabled, cgCount, totalBytes, totalEntries, subscribedCgIds }) and added an explanation of why totalBytes / totalEntries stay at zero until the first gossiped envelope ingests.

Builds clean on dkg-publisher, dkg-agent, and @origintrail-official/dkg. CI re-running.

Comment thread packages/mcp-dkg/src/tools/setup.ts Outdated
// call returns a clean `hostingEnabled: false` rather than a
// throw — the tool surfaces that as a helpful no-op message
// instead of an error.
server.registerTool(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This registers the manual host-mode subscribe path as a normal MCP tool, but the runbook/agent instructions treat /api/shared-memory/host-mode/subscribe as operator-only because it changes who hosts a curated CG. Exposing it without any explicit human confirmation lets agents invoke a hosting change autonomously and bypass that trust boundary. Please gate this behind an operator-confirmed flow (or keep it out of the default tool surface).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e913ec77. Removed the dkg_request_hosting MCP tool registration entirely (packages/mcp-dkg/src/tools/setup.ts) — the route stays in place for operator use via curl / dkg CLI, but it's no longer agent-invokable through MCP. Added the tool name to DROPPED_TOOLS in packages/mcp-dkg/test/drop-sweep.test.ts so a future re-registration trips the regression guard (locked tool count goes 26 → 25). Updated docs/runbooks/RUNBOOK_HOST_MODE_MANUAL_SUBSCRIBE.md with an operator-only callout up top and strengthened the existing anti-pattern entry. DkgClient.requestHostMode() stays so operator-side scripts that wrap the daemon API still work; only the agent-facing MCP surface drops.

Comment thread packages/agent/src/dkg-agent.ts Outdated
const wireCgId = this.gossipWireIdFor(contextGraphId);
const swmTopic = contextGraphWorkspaceTopic(wireCgId);
this.swmHostModeSubscribed.add(contextGraphId);
this.swmHostModeSubscribed.set(contextGraphId, source);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: swmHostModeSubscribed is now keyed by the raw contextGraphId, but this PR also documents that the same CG can arrive as a cleartext id, numeric id, or wire hash. After a chain-event/beacon subscribe (hash key), a later manual subscribe with the cleartext id will miss has() and wire a second handler on the same topic, causing duplicate host-mode ingest/persistence and ambiguous provenance. Canonicalize the bookkeeping key (e.g. wireCgId) or check both local+wire forms before wiring.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e913ec77. Introduced DKGAgent.canonicalSwmHostModeKey() (thin alias around the existing gossipWireIdFor helper that resolves cleartext / numeric / pre-canonical / hash shapes all to the curator-committed wire hash) and routed every swmHostModeSubscribed / swmHostModeHandlers insert/lookup/delete site through it. Touch sites:

  • wireSwmHostModeHandler — canonicalise first, key both maps off wireCgId so re-entry under a different shape short-circuits at the has() check.
  • unwireSwmHostModeHandler — canonical-key lookup so a cleartext subscribe + hash unwire (or vice versa) releases the wired handler.
  • reconcileSharedMemoryGossipSubscription — canonical-key delete when member authorisation is lost so the immediate host-mode reconcile call sees the entry cleared.
  • reconcileSwmHostModeSubscription + enableSwmHostModeFor — idempotent re-entry has() checks go through the canonical key.
  • getSwmSubscriptionSource — canonicalise each candidate before host-mode lookup; seen Set dedups raw + canonical forms; updated the lifecycle.ts getSubscriptionSourceForCg callback comment to reflect that internal canonicalisation.

Regression coverage: new packages/agent/test/swm/host-mode-key-canonicalization.test.ts (4 tests, added to the unit config) verifies cleartext-then-hash AND hash-then-cleartext subscribe orderings each leave exactly one canonical-keyed entry in both bookkeeping maps, with no duplicate gossip handler wired on the same topic, and that a mixed-shape unwire releases the wire path. All 115 agent unit tests + 218 mcp-dkg tests pass.

A small side effect: getSwmHostModeStats().subscribedCgIds now always returns the canonical wire-hash form (regardless of subscribe path / shape), so operators reading the diagnostics endpoint will see the hash even after a cleartext POST. Documented in the runbook with an updated example response and a pointer to canonicalSwmHostModeKey.

quads: [q(`${ENTITY}/D6revert`, 'http://schema.org/name', '"FakeId"')],
publisherNodeIdentityIdOverride: fakeId,
}),
).rejects.toThrow();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: rejects.toThrow() is too broad for the behavior this test is trying to pin. With the new PR1/PR3 failure modes, this will also pass on unrelated rejections (ACK quorum failure, RPC preflight, missing attestation, etc.), so the fake-identity contract revert can silently stop being covered. Assert on the expected revert shape/message here to keep the regression targeted.

// reject a publish that used the stale quorum so a single
// mis-routed retry past the boundary is the worst-case symptom.
const now = Date.now();
if (EVMChainAdapter.preflightCacheFresh(this.cachedMinRequiredSignatures, now)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: caching minimumRequiredSignatures() for 1 hour can reject otherwise valid publishes after governance lowers the quorum (for example 3 -> 2). The contract only protects the stale-low case; in the stale-high case the publisher bails out locally before submit. This value needs a much shorter TTL or an explicit invalidation path on parameter changes.

`nodeIdentityId ${this.config.nodeIdentityId} exceeds uint64 wire format`,
);
}
const encryptedSubscriptionSource = this.config.getSubscriptionSourceForCg?.(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: the new getSubscriptionSourceForCg contract says resolvers can inspect the gossip-topic/wire-form candidate, but both ACK paths only pass cgId and swmGraphId. On direct publishes where swmGraphId is omitted because it equals the numeric contextGraphId, the resolver never sees a key that matches member/host subscriptions, so subscriptionSource falls back to undefined for a normal ACK. Pass the third candidate here (and in the non-encrypted path) or narrow the API/docs.

// without a source field are pre-PR5; we leave
// `swmHostModeAdvertised` undefined rather than asserting
// either way.
const advertised = ack.subscriptionSource !== undefined ? true : undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: protobufjs decodes an unset string field as "", not undefined (you already rely on that for declineCode). With this check, legacy/pre-PR5 ACKs get reported as swmHostModeAdvertised: true even though they did not send a provenance source. Gate this on a validated non-empty enum value (isSubscriptionSource) before setting the telemetry flag.

Comment thread packages/agent/src/dkg-agent.ts Outdated
contextGraphId: string,
source: SubscriptionSource = SUBSCRIPTION_SOURCES.RECONCILER,
): void {
if (this.swmHostModeHandlers.has(contextGraphId)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This idempotency check keys host-mode subscriptions by the raw contextGraphId, but the same CG can reach this method as a wire hash (chain-event/beacon) or as a cleartext id (reconciler/manual). After a hash-keyed subscribe, a later cleartext pass misses this guard, registers a second handler on the same gossip topic, and ingests each envelope twice under two IDs. Canonicalize the key (for example the wire id) or dedupe across equivalent ids before wiring.

}

async getMinimumRequiredSignatures(): Promise<number> {
// PR3 / RC11: TTL-cached. Governance vote that changes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: Caching minimumRequiredSignatures for a full hour creates a correctness window after any governance change. Once the on-chain quorum changes, the publisher will keep using the stale threshold until TTL expiry, so it can either wait for ACKs that are no longer needed or submit with too few ACKs and get reverted for up to an hour. This value needs a much shorter invalidation path (or no long-lived cache), even if chainId and KAV10 stay cached.

curl -X POST \
-H "Authorization: Bearer $AUTH" \
-d '{ "contextGraphId": "<id>" }' \
http://127.0.0.1:9201/api/shared-memory/host-mode/unsubscribe
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This runbook tells operators to call /api/shared-memory/host-mode/unsubscribe, but this PR does not add that route and there is no matching handler in packages/cli/src/daemon/routes/memory.ts. Following these steps will 404 while the doc claims unsubscribe is supported and idempotent. Either implement the endpoint or remove/update the undo section.

Branimir Rakic and others added 3 commits May 26, 2026 10:51
… MCP tool

PR #672 Codex review pass — two trust/correctness fixes flagged on
PR-B (`feat/rc11-B-typed-errors-runbook-provenance`).

### 1. `swmHostModeSubscribed` key canonicalisation (id=3302086589)

The four LU-6 Phase B discovery paths deliver the same CG to host-mode
wiring in different shapes: chain-event/beacon carry the curator-
committed wire hash, while reconciler/manual carry the cleartext local
id (or whatever string the operator POSTed). The previous code keyed
`swmHostModeSubscribed` / `swmHostModeHandlers` by the raw caller-
supplied id, so a chain-event-driven hash subscribe followed by a
manual cleartext subscribe for the SAME CG would miss `has()` and wire
a second handler on the same gossip topic — doubling host-mode ingest,
duplicating persistence, and producing ambiguous ACK provenance.

Fix: introduce `DKGAgent.canonicalSwmHostModeKey()` (thin alias around
the existing `gossipWireIdFor` helper) and route every host-mode
bookkeeping insert/lookup/delete through it so the maps are always
keyed by the curator-committed wire hash. Touch sites updated:

  * `wireSwmHostModeHandler` — canonicalise first, key both maps off
    `wireCgId` so re-entry under a different shape short-circuits the
    handler set.
  * `unwireSwmHostModeHandler` — canonical-key lookup on the unwire
    path so a cleartext subscribe + hash unwire (or vice versa)
    releases the wired handler.
  * `reconcileSharedMemoryGossipSubscription` — delete-by-canonical
    when member authorisation is lost so the immediate host-mode
    reconcile call sees the entry cleared.
  * `reconcileSwmHostModeSubscription` + `enableSwmHostModeFor` — the
    idempotent re-entry `has()` checks go through the canonical key.
  * `getSwmSubscriptionSource` — canonicalise each candidate before
    the host-mode lookup; the `seen` Set now dedups raw + canonical
    forms.
  * `lifecycle.ts` `getSubscriptionSourceForCg` callback — docs
    updated to reflect that `getSwmSubscriptionSource` now does the
    canonicalisation internally on the host-mode path.

Regression coverage: new
`packages/agent/test/swm/host-mode-key-canonicalization.test.ts`
(added to the unit config) exercises cleartext-then-hash and
hash-then-cleartext subscribe orderings plus mixed-shape unwire,
asserting both bookkeeping maps end with exactly one canonical-keyed
entry per CG and zero residue after unwire.

### 2. Remove `dkg_request_hosting` MCP tool (id=3302086584)

The PR-4 honest-ACK cleanup added an MCP tool that let any connected
agent flip its core into a host for any curated CG by name. That
changes a trust-boundary decision the curator owns — the operator-only
`POST /api/shared-memory/host-mode/subscribe` route is for direct
`curl` / `dkg` CLI use, not for autonomous agent invocation.

Fix: removed the `server.registerTool('dkg_request_hosting', …)` call
from `packages/mcp-dkg/src/tools/setup.ts` and added it to the
`DROPPED_TOOLS` allowlist in `drop-sweep.test.ts` so a future revival
trips the regression guard. Locked tool count goes 26 → 25.

`DkgClient.requestHostMode()` stays for operator scripts that wrap
the daemon API for their own internal tooling — only the MCP-
discoverable tool surface drops.

`docs/runbooks/RUNBOOK_HOST_MODE_MANUAL_SUBSCRIBE.md` gains an
operator-only callout up top and a corrected `subscribedCgIds` example
showing the canonical wire-hash form returned post-fix #1.

### Tests

`pnpm --filter @origintrail-official/dkg-mcp test`: 218 passed.
`pnpm --filter @origintrail-official/dkg-agent test:unit`: 115 passed
(includes the 4 new canonicalisation tests + existing 25 host-mode-
store, 16 host-catchup-sign, 13 host-catchup-wire suites).

Co-authored-by: Cursor <cursoragent@cursor.com>
* second adapter pointed at a different chain has its own cache
* with no cross-talk.
*/
private static readonly PREFLIGHT_TTL_MS = 60 * 60 * 1000;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: a single 1h TTL for all pre-flight reads makes publishes fail after legitimate chain changes. minimumRequiredSignatures can change via governance and KnowledgeAssetsV10 can rotate via Hub; until this cache expires, the publisher keeps using stale quorum/address data, so every publish is built against the wrong pre-flight values and is rejected on-chain. Consider only caching immutable data like chainId, or add explicit invalidation for the mutable reads.

…aleatoric/dev/dkg-pr-rc11-a into feat/rc11-B-typed-errors-runbook-provenance
@branarakic
Copy link
Copy Markdown
Contributor Author

Folded into #680 (release/v10.0.0-rc.11 integration branch), now on main as commit 6c09049. All commits on this branch are ancestors of origin/main — verified via git log origin/main..HEAD --oneline returning empty.

Closing as superseded. Any unaddressed Codex findings on this PR will be tracked in #676 (rc.12 follow-ups) and addressed in a follow-up PR.

@branarakic branarakic closed this May 26, 2026
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
… into release/rc.11

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	packages/agent/src/dkg-agent.ts
matic031 pushed a commit to KilianTrunk/dkg that referenced this pull request Jun 2, 2026
Bump root + 17 workspace packages from 10.0.0-rc.10 to 10.0.0-rc.11.
Promote the CHANGELOG "Unreleased" block to the dated rc.11 section.

Release contents (PR OriginTrail#680 — release/rc.11 integration branch):

  Core-stability hardening (rc.10 deadlock workstream):
    OriginTrail#655 hard shutdown timeout
    OriginTrail#657 async-promote queue library
    OriginTrail#659 auto-update install-source override
    OriginTrail#669 AbortSignal plumbing through DKGNode.stop()
    OriginTrail#670 chain provider filter log-spam silencer
    OriginTrail#666 dkg migrate-to-npm CLI subcommand
    OriginTrail#668 AutoNAT boot self-probe
    OriginTrail#661 core relay capability sanity check
    OriginTrail#662 relay metrics in /api/status
    OriginTrail#664 supervisor positive-liveness probe

  ERC-721 mint ordering:
    OriginTrail#681 CEI mint-last at every mint site (supersedes OriginTrail#663,
         which proposed _safeMint and was rejected as a public-API
         break for older Gnosis Safes / DAO timelocks / strategy
         wrappers). Keeps _mint; reorders so _mint is the last
         state-changing call. relock moves _burn before _mint.

  Async-promote queue stack:
    OriginTrail#660 /promote-async route wiring with worker-readiness gate
    OriginTrail#665 async-promote worker supervisor
    OriginTrail#667 async-promote queue config + e2e tests

  Honest ACK + tentative VM cleanup:
    OriginTrail#671 delete self-signed ACK fallback + tentative-VM concept
    OriginTrail#672 typed errors + LU-6 runbook + provenance telemetry

  Test infra:
    OriginTrail#673 rc.11 test infrastructure fixes

Verification on the integration branch (release/rc.11):
  pnpm -r build                              clean
  pnpm --filter @origintrail-official/dkg test:unit   403/403 PASS
  evm-module 278/278 PASS (NFT + CG contract tests)
  devnet-test-rc11-promote-crash-recovery.sh GREEN
  devnet-test-rc11-shutdown-mid-publish.sh   GREEN (549ms shutdown,
                                             0 [shutdown-timeout] lines)
  devnet-test-rfc38-all.sh                   10/11 PASS (lj is the
                                             pre-existing documented
                                             LU-6 cores-only gap)
  devnet-test.sh                             343/347 PASS — 4 fails
                                             tracked in OriginTrail#676 as stale
                                             test expectations against
                                             OriginTrail#671's seal contract +
                                             V10 auto-registration.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant