Skip to content

feat(retrieval): parallel IPNI + transport with retrievalTransportStatus counter#538

Merged
SgtPooki merged 5 commits into
mainfrom
feat/retrieval-parallel-transport-counter
May 19, 2026
Merged

feat(retrieval): parallel IPNI + transport with retrievalTransportStatus counter#538
SgtPooki merged 5 commits into
mainfrom
feat/retrieval-parallel-transport-counter

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

What changed

In pgBoss mode, `performAllRetrievals` gated the `/ipfs` transport on IPNI verification: any IPNI lag bypassed the transport path entirely and pinned `retrievalStatus=failure.*`. Combined with the silent-timeout bug (#524), the failure mode was invisible in logs.

This PR runs IPNI and transport concurrently and composes the terminal status as `AND(discoverabilityStatus, retrievalTransportStatus)`. A new `retrievalTransportStatus` counter records the `/ipfs`-only outcome so a single dashboard panel can attribute a `retrievalStatus` failure to IPNI vs transport.

Implements option A from #524. Matches the parallel flow already documented in `docs/checks/retrievals.md`.

Files

  • `apps/backend/src/retrieval/retrieval.service.ts` — parallel orchestration via `Promise.allSettled`, classify + compose helpers.
  • `apps/backend/src/metrics-prometheus/{metrics-prometheus.module.ts,check-metrics.service.ts}` — register `retrievalTransportStatus` counter and `recordTransportStatus` method.
  • `docs/checks/{events-and-metrics.md,retrievals.md}` — document the new counter and the composite-status rule.
  • `apps/backend/src/retrieval/retrieval.service.spec.ts` — 4 new tests covering IPNI+transport success, IPNI fail / transport success, IPNI timeout / transport success, and concurrent execution ordering.

How to verify

```
pnpm --filter dealbot-backend typecheck
pnpm --filter dealbot-backend lint
pnpm --filter dealbot-backend test
```

After deploy, watch:

  • `retrievalTransportStatus{value=success}` should equal `/ipfs` success volume per SP.
  • For approved SPs with high `retrievalStatus=failure.*` but high `retrievalTransportStatus=success`, the failure is in IPNI discoverability, not transport.

Notes

`retrievalStatus` semantics unchanged (AND of discoverability and transport). No dashboard migration needed; the new `retrievalTransportStatus` chart is opt-in.

Observability follow-ups for the silent-timeout cases (the catch-order bug in `ipni-verification.service.ts` and the missing `retrieval_skipped_ipni_failure` log) ship as a separate PR (option C from #524) on top of this one.

…s counters

In pgBoss mode, performAllRetrievals previously gated /ipfs transport on
IPNI verification: any IPNI lag or failure marked the deal as
retrievalStatus=failure.* without ever exercising the transport path. This
conflated discoverability and transport failures on the dashboard, and
when paired with the silent-timeout bug (#524) made the failure mode
invisible in logs.

Run both stages concurrently via Promise.allSettled and compose the
terminal retrievalStatus as AND(discoverabilityStatus, retrievalTransportStatus).
A new retrievalTransportStatus counter records the /ipfs-only outcome so
operators can attribute a retrievalStatus failure to IPNI vs transport
from a single dashboard.

Tracks dealbot#524 option A.

Tests: 9/9 retrieval.service.spec pass, 354/354 backend pass.
Typecheck + biome lint clean.
Copilot AI review requested due to automatic review settings May 14, 2026 18:40
@FilOzzy FilOzzy added this to FOC May 14, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC May 14, 2026
@SgtPooki SgtPooki self-assigned this May 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the backend Retrieval check to run IPNI verification and /ipfs transport in parallel, and introduces a new Prometheus counter (retrievalTransportStatus) to separately attribute failures to transport vs discoverability while keeping retrievalStatus as the composite (AND) outcome.

Changes:

  • Parallelize IPNI + transport execution in RetrievalService using Promise.allSettled, then classify and compose terminal status from both sub-statuses.
  • Add and register the retrievalTransportStatus Prometheus counter and expose a recordTransportStatus helper on RetrievalCheckMetrics.
  • Update retrieval/check docs and add tests covering combined outcomes plus a concurrency-ordering test.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docs/checks/retrievals.md Documents parallel IPNI + transport flow and clarifies composite retrievalStatus semantics.
docs/checks/events-and-metrics.md Defines the new retrievalTransportStatus metric and documents the composite rule for retrievalStatus.
apps/backend/src/retrieval/retrieval.service.ts Runs IPNI + transport concurrently; records retrievalTransportStatus; composes terminal retrievalStatus.
apps/backend/src/retrieval/retrieval.service.spec.ts Adds pgBoss-mode tests for combined IPNI/transport outcomes and a concurrency assertion.
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts Registers the new retrievalTransportStatus counter provider.
apps/backend/src/metrics-prometheus/check-metrics.service.ts Injects the new counter and adds recordTransportStatus.

Comment thread apps/backend/src/retrieval/retrieval.service.ts
Comment thread apps/backend/src/retrieval/retrieval.service.spec.ts
- Split retrievals_completed message by allSuccess so partial-failure
  runs are obvious from the log line (Copilot #538 review).
- Replace setTimeout-based concurrency test with deferred-promise
  barrier; both sides resolve their started signal before awaiting the
  shared release. Removes wall-clock dependency.
- Drop "pgBoss mode" from describe block; the conditional now reads as
  IPNI orchestration, not a separate run mode.
SgtPooki added a commit that referenced this pull request May 14, 2026
Two adjacent observability bugs left IPNI failures invisible whenever
the outer pg-boss retrieval-job timeout fired:

1. IpniVerificationService.verify checked signal?.aborted before
   verificationSignal.aborted. On a race where both signals were
   asserted by the time the catch handler ran, signal.throwIfAborted()
   re-threw before the ipni_verification_timed_out log could fire.
   Inner check now runs first so the inner timeout log fires whenever
   the inner timeout signal aborted, regardless of outer state.

2. RetrievalService.verifyIpniForRetrieval's catch returned a silent
   failure.timedout when signal?.aborted, with no log. Added
   retrieval_ipni_verification_timed_out so the retrieval-side caller
   records the abort.

Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
BigLep
BigLep previously requested changes May 15, 2026
Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to get aligned on what counters are emitted and when.

Comment thread docs/checks/events-and-metrics.md
Comment thread docs/checks/events-and-metrics.md Outdated
Comment thread docs/checks/retrievals.md Outdated
@github-project-automation github-project-automation Bot moved this from 📌 Triage to ⌨️ In Progress in FOC May 15, 2026
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code portion looks good overall. Just one comment.

Comment thread apps/backend/src/retrieval/retrieval.service.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

apps/backend/src/retrieval/retrieval.service.spec.ts:506

  • Blocker: Similar to the previous case, this test expects retrievalStatus=success when IPNI times out. If retrievalStatus is the composite retrieval outcome (as documented), an IPNI timeout should make retrievalStatus a failure (with a separate transport-only metric tracking /ipfs success).
  it("keeps retrievalStatus=success when IPNI times out but transport succeeds (timeout recorded on discoverabilityStatus)", async () => {
    service = await createService();
    setupCommonMocks();
    mockIpniVerificationService.verify.mockResolvedValue({
      verified: 0,
      unverified: 1,
      total: 1,
      rootCIDVerified: false,
      durationMs: 10_000,
      failedCIDs: [{ cid: "x", reason: "timeout" }],
      verifiedAt: new Date().toISOString(),
    });
    mockRetrievalAddonsService.testAllRetrievalMethods.mockResolvedValue(successfulTransport);

    await service.performAllRetrievals(buildDealWithIpni());

    expect(mockRetrievalMetrics.recordStatus).toHaveBeenCalledWith(labels, "success");
    expect(mockDiscoverabilityMetrics.recordStatus).toHaveBeenCalledWith(labels, "failure.timedout");
  });

Comment thread apps/backend/src/retrieval/retrieval.service.ts
Comment thread apps/backend/src/retrieval/retrieval.service.spec.ts
@SgtPooki SgtPooki requested review from BigLep and silent-cipher May 18, 2026 14:34
SgtPooki added a commit that referenced this pull request May 18, 2026
Two adjacent observability bugs left IPNI failures invisible whenever
the outer pg-boss retrieval-job timeout fired:

1. IpniVerificationService.verify checked signal?.aborted before
   verificationSignal.aborted. On a race where both signals were
   asserted by the time the catch handler ran, signal.throwIfAborted()
   re-threw before the ipni_verification_timed_out log could fire.
   Inner check now runs first so the inner timeout log fires whenever
   the inner timeout signal aborted, regardless of outer state.

2. RetrievalService.verifyIpniForRetrieval's catch returned a silent
   failure.timedout when signal?.aborted, with no log. Added
   retrieval_ipni_verification_timed_out so the retrieval-side caller
   records the abort.

Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment to review, but looks good to me from a docs regard. I'll leave to @silent-cipher to give the code approval.

Comment thread docs/checks/retrievals.md Outdated
@SgtPooki SgtPooki moved this from ⌨️ In Progress to 🔎 Awaiting review in FOC May 18, 2026
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@SgtPooki SgtPooki dismissed BigLep’s stale review May 19, 2026 12:38

already addressed.

@SgtPooki SgtPooki merged commit cbdc2c7 into main May 19, 2026
7 checks passed
@SgtPooki SgtPooki deleted the feat/retrieval-parallel-transport-counter branch May 19, 2026 12:38
@github-project-automation github-project-automation Bot moved this from 🔎 Awaiting review to 🎉 Done in FOC May 19, 2026
SgtPooki added a commit that referenced this pull request May 19, 2026
Two adjacent observability bugs left IPNI failures invisible whenever
the outer pg-boss retrieval-job timeout fired:

1. IpniVerificationService.verify checked signal?.aborted before
   verificationSignal.aborted. On a race where both signals were
   asserted by the time the catch handler ran, signal.throwIfAborted()
   re-threw before the ipni_verification_timed_out log could fire.
   Inner check now runs first so the inner timeout log fires whenever
   the inner timeout signal aborted, regardless of outer state.

2. RetrievalService.verifyIpniForRetrieval's catch returned a silent
   failure.timedout when signal?.aborted, with no log. Added
   retrieval_ipni_verification_timed_out so the retrieval-side caller
   records the abort.

Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
SgtPooki added a commit that referenced this pull request May 19, 2026
Two adjacent observability bugs left IPNI failures invisible whenever
the outer pg-boss retrieval-job timeout fired:

1. IpniVerificationService.verify checked signal?.aborted before
   verificationSignal.aborted. On a race where both signals were
   asserted by the time the catch handler ran, signal.throwIfAborted()
   re-threw before the ipni_verification_timed_out log could fire.
   Inner check now runs first so the inner timeout log fires whenever
   the inner timeout signal aborted, regardless of outer state.

2. RetrievalService.verifyIpniForRetrieval's catch returned a silent
   failure.timedout when signal?.aborted, with no log. Added
   retrieval_ipni_verification_timed_out so the retrieval-side caller
   records the abort.

Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🎉 Done

Development

Successfully merging this pull request may close these issues.

5 participants