Skip to content

fix(publisher): filter ACK quorum candidates by <contextGraphsServed>#556

Open
branarakic wants to merge 4 commits into
mainfrom
fix/publisher-ack-quorum-filter-by-hosting
Open

fix(publisher): filter ACK quorum candidates by <contextGraphsServed>#556
branarakic wants to merge 4 commits into
mainfrom
fix/publisher-ack-quorum-filter-by-hosting

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

Summary

Fixes the publish failure shape described in #541 by stopping the ACK collector from dialling cores that don't advertise hosting the target context graph.

When the ACK collector picks candidates for a quorum, it currently uses every connected core regardless of which context graphs that core has advertised hosting. Cores that don't host the target CG can only respond by throwing inside their StorageACK handler (typically No data found in SWM graph …), which the publisher sees as a libp2p stream reset — and the publish times out / fails on-chain with MinSignaturesRequirementNotMet.

What's in this PR

  • New optional getCorePeersHostingContextGraph(cgIdStr) dependency on ACKCollectorDeps. When provided, the collector queries it before ACK collection:
    • Filtered set ≥ requiredACKs → uses it; logs how many cores were excluded with their last-8 ids.
    • Filtered set < requiredACKs → logs a single WARN naming the CG and spelling out which connected cores do vs. don't advertise hosting it, then falls back to the full connected-core set. This deliberately keeps publishes live during discovery races / stale-registry windows, but makes hosting-coverage bugs visible in the log instead of presenting as opaque ACK timeouts.
  • Implementation: packages/publisher/src/hosting-resolver.ts runs a single SPARQL SELECT against the agent-registry graph and is delimiter-aware so e.g. …/repnet never falsely matches …/repnet-edge-smoke.
  • Both call sites — packages/agent/src/dkg-agent.ts:createV10ACKProvider and the daemon's ackTransportFactory in packages/cli/src/daemon/lifecycle.ts — wire the dep through. packages/cli/src/publisher-runner.ts:ACKTransportFactory threads the optional method through so existing embedders keep working unchanged.

The decision to warn-and-fall-back rather than hard-fail keeps the same publish path live for legitimate cases where the hosting registry is incomplete (e.g. discovery just hasn't caught up yet), while making the actual #541 root cause — beacon-3 not advertising repnet-v2-official despite the CG being replicationPolicy=full — pop out of the warning logs immediately. A separate issue covers fixing beacon-3's chain-event poller / CG-discovery loop so it reconciles against the on-chain CG list rather than relying purely on live events.

Test plan

  • pnpm --filter @origintrail-official/dkg-publisher build — green
  • pnpm --filter @origintrail-official/dkg-agent build — green
  • pnpm --filter "./packages/cli" build — green
  • pnpm --filter @origintrail-official/dkg-publisher test — 921 passed / 1 skipped (62 files), all green
  • Spot-checked agent suites that exercise the V10 ACK provider (p2p-resilience.test.ts, dkg-agent-diagnostics.test.ts) — green
  • Six new tests in packages/publisher/test/v10-ack-edge-cases.test.ts cover the new collector behaviour: filtered-ok, partial→fallback+WARN, empty→fallback+WARN, throw→fallback+WARN, async filter, and no-dep parity with today
  • Five new tests cover resolvePeersHostingContextGraph directly: exact-UAL match, prefix-isolation guard, the literal RepNet CG15 publish cannot collect identity 1 ACK while public graph can #541 hosting layout (cores A and B advertise the CG, C doesn't, only A and B come back), empty-UAL, and SPARQL-escaped pathological UAL

Backward compatibility

The new dependency is optional: any embedder that didn't pass getCorePeersHostingContextGraph keeps the exact pre-existing collector behaviour. Existing ACKCollectorDeps consumers compile and run unchanged.

Related

Made with Cursor

When the ACK collector picks core-node candidates for a publish quorum,
it currently uses every connected core regardless of which context
graphs that core has advertised hosting. Cores that don't host the
target CG can only respond by throwing inside their StorageACK handler
(typically `No data found in SWM graph ...`), which the publisher sees
as a libp2p stream reset and has to time out / retry against. In
practice this surfaces as opaque quorum failures (see GitHub issue #541
— `repnet-v2-official` is `accessPolicy=public, replicationPolicy=full`,
so every core must host it; one beacon failed to enroll the CG after a
chain redeploy, the publisher kept selecting it as an ACK candidate,
and the publish failed with `MinSignaturesRequirementNotMet`).

Add an optional `getCorePeersHostingContextGraph(cgIdStr)` dependency
on `ACKCollectorDeps`. When provided, the collector queries it before
ACK collection:

- Filtered set ≥ `requiredACKs` → use it; log how many cores were
  excluded by the filter (with their last-8 ids).
- Filtered set < `requiredACKs` → log a single WARN naming the CG and
  spelling out which connected cores do vs. don't advertise hosting it,
  then **fall back to the full connected-core set**. This deliberately
  keeps publishes live during discovery races / stale-registry windows
  but makes hosting-coverage bugs visible in the log instead of as
  opaque ACK timeouts.

Implementation lives in `packages/publisher/src/hosting-resolver.ts` and
runs a single SPARQL `SELECT` against the agent registry graph,
delimiter-aware so `repnet` never falsely matches `repnet-edge-smoke`.
Both call sites — `packages/agent/src/dkg-agent.ts:createV10ACKProvider`
and the daemon's `ackTransportFactory` in
`packages/cli/src/daemon/lifecycle.ts` — wire the dep through; the
indirection in `packages/cli/src/publisher-runner.ts:ACKTransportFactory`
threads the optional method through so existing embedders keep working
unchanged.

Tests in `packages/publisher/test/v10-ack-edge-cases.test.ts` cover:
filtered-ok, partial→fallback+WARN, empty→fallback+WARN, throw→
fallback+WARN, async filter, no-dep parity with today, and a small
`resolvePeersHostingContextGraph` suite incl. the literal #541 hosting
layout (peer-A and peer-B advertise the CG, peer-C doesn't, only A and
B come back).

Co-authored-by: Cursor <cursoragent@cursor.com>
PREFIX skill: <https://dkg.origintrail.io/skill#>
SELECT DISTINCT ?peerId WHERE {
GRAPH ?g {
?agent dkg:peerId ?peerId ;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: This query never constrains the result set to core-node profiles (dkg:nodeRole "core" / dkg:CoreNode). During early startup getConnectedCorePeers() intentionally falls back to all connected peers until protocol discovery completes, so a few edge nodes that advertise the CG can satisfy this filter and the collector will send /storage-ack requests only to peers that do not implement that protocol. Filter to core nodes here, or intersect against the discovered core-peer set before these matches are treated as ACK candidates.

Comment thread packages/publisher/src/ack-collector.ts Outdated
const matched = allConnected.filter(p => hostingSet.has(p));
const excluded = allConnected.filter(p => !hostingSet.has(p));

if (matched.length >= REQUIRED_ACKS) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: When matched.length >= REQUIRED_ACKS, the hosting-filtered set becomes the only candidate list. That turns the registry into a hard gate: one stale/false-positive advertisement in matched can drop the publish to 2/3 ACKs even though excluded connected peers could have satisfied quorum, which is a behavioral regression from the previous 'try every connected core' path. Safer behavior is to prioritize matched first, then continue with the remaining connected peers if quorum is still missing.

- hosting-resolver: constrain SPARQL to dkg:CoreNode agents so an
  edge node that publishes contextGraphsServed (e.g. for join-time
  discovery) can no longer leak into the ACK candidate set; without
  this the early-startup fallback in getConnectedCorePeers() to
  "all connected peers" could let the collector dial a peer that
  just stream-resets /dkg/10.0.0/storage-ack.

- ack-collector: replace the hard-gate "matched OR fallback" choice
  with two-wave dialling. The priority wave (advertised cores) is
  dialled first; the fallback wave (the rest of the connected pool)
  is dialled only if the priority wave fails to reach quorum. This
  removes the regression where a single stale advertisement on a
  matched peer could fail a publish the rest of the pool could have
  satisfied, while still avoiding wasted dials on cores that don't
  host the CG in the happy path.

- tests: cover edge-node filtering in the resolver, fallback-on-
  stale-advertisement and "no fallback dialled when priority alone
  satisfies quorum" in the collector. Update fixture to emit
  rdf:type dkg:CoreNode/EdgeNode to match the new query.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/publisher/src/ack-collector.ts Outdated
if (this.deps.getCorePeersHostingContextGraph) {
let hostingPeers: string[] = [];
try {
hostingPeers = await Promise.resolve(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: The hosting-filter lookup is awaited before the ACK timeout starts. If the local registry query hangs or gets slow, publish can block indefinitely without ever dialing peers. Please bound this lookup with a short fail-open timeout, or start the overall timeout before awaiting it.

PREFIX dkg: <https://dkg.network/ontology#>
PREFIX skill: <https://dkg.origintrail.io/skill#>
SELECT DISTINCT ?peerId WHERE {
GRAPH ?g {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: GRAPH ?g makes ACK routing trust matching triples from any named graph in the store, not just the agent-registry graph where profiles are published. A copied or stale profile elsewhere can skew the priority wave toward the wrong peers. Scope this query to the authoritative registry data graph instead of all graphs.

- ack-collector: bound the optional `getCorePeersHostingContextGraph`
  lookup with `HOSTING_FILTER_TIMEOUT_MS` (1.5s). The lookup runs
  against the local triple store BEFORE the `ACK_TIMEOUT_MS` budget
  even begins, so a slow / hung registry query could previously
  block a publish indefinitely. On timeout we treat the result as
  "no hosting signal" and fall back to the legacy single-wave path.

- hosting-resolver: pin the SPARQL to the authoritative agent-
  registry named graph (`did:dkg:context-graph:agents`, mirroring
  `AGENT_REGISTRY_GRAPH` in `packages/agent/src/profile.ts`).
  `GRAPH ?g` previously let ACK routing trust profile triples
  landing in any named graph (e.g. a stale snapshot held elsewhere
  in the store), which could skew the priority wave toward the
  wrong peers. Re-export the constant as `AGENT_REGISTRY_NAMED_GRAPH`
  for downstream callers / tests.

- tests: add coverage for both — a lookup-hangs-forever case (must
  fall back to single-wave dialling well below the ACK timeout) and
  a stray-named-graph case (a copied profile in a non-registry
  graph must NOT be picked up).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread packages/publisher/src/ack-collector.ts Outdated
let hostingPeers: string[] = [];
try {
const lookupPromise = Promise.resolve(
this.deps.getCorePeersHostingContextGraph(contextGraphIdStr),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: The hosting lookup is keyed by contextGraphIdStr, but the ACK handler loads data from swmGraphId ?? contextGraphIdStr. In the remap flow, peers advertise the source SWM graph, not the target on-chain ACK domain, so this will prioritize the wrong peers or miss the hosting signal entirely. Use params.swmGraphId ?? contextGraphIdStr for the hosting-filter query.

Comment thread packages/agent/src/dkg-agent.ts Outdated
.map(p => p.toString())
.filter(id => id !== this.peerId)
.filter(id => advertisedSet.has(id));
} catch {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: Catching and returning [] here hides real resolver/store failures from ACKCollector, so production regressions become indistinguishable from a legitimate 'no hosting signal' fallback. The collector already handles failures safely and logs the cause; either let the error propagate to it, or log the exception before returning [] here (same issue in packages/cli/src/daemon/lifecycle.ts).

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review produced 2 comment(s) but all targeted lines outside the diff and were dropped. Check the workflow logs for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant