Skip to content

feat(retrieval): parallel IPNI + transport with retrievalTransportStatus counter#538

Open
SgtPooki wants to merge 2 commits into
mainfrom
feat/retrieval-parallel-transport-counter
Open

feat(retrieval): parallel IPNI + transport with retrievalTransportStatus counter#538
SgtPooki wants to merge 2 commits into
mainfrom
feat/retrieval-parallel-transport-counter

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

What changed

In pgBoss mode, `performAllRetrievals` gated the `/ipfs` transport on IPNI verification: any IPNI lag bypassed the transport path entirely and pinned `retrievalStatus=failure.*`. Combined with the silent-timeout bug (#524), the failure mode was invisible in logs.

This PR runs IPNI and transport concurrently and composes the terminal status as `AND(discoverabilityStatus, retrievalTransportStatus)`. A new `retrievalTransportStatus` counter records the `/ipfs`-only outcome so a single dashboard panel can attribute a `retrievalStatus` failure to IPNI vs transport.

Implements option A from #524. Matches the parallel flow already documented in `docs/checks/retrievals.md`.

Files

  • `apps/backend/src/retrieval/retrieval.service.ts` — parallel orchestration via `Promise.allSettled`, classify + compose helpers.
  • `apps/backend/src/metrics-prometheus/{metrics-prometheus.module.ts,check-metrics.service.ts}` — register `retrievalTransportStatus` counter and `recordTransportStatus` method.
  • `docs/checks/{events-and-metrics.md,retrievals.md}` — document the new counter and the composite-status rule.
  • `apps/backend/src/retrieval/retrieval.service.spec.ts` — 4 new tests covering IPNI+transport success, IPNI fail / transport success, IPNI timeout / transport success, and concurrent execution ordering.

How to verify

```
pnpm --filter dealbot-backend typecheck
pnpm --filter dealbot-backend lint
pnpm --filter dealbot-backend test
```

After deploy, watch:

  • `retrievalTransportStatus{value=success}` should equal `/ipfs` success volume per SP.
  • For approved SPs with high `retrievalStatus=failure.*` but high `retrievalTransportStatus=success`, the failure is in IPNI discoverability, not transport.

Notes

`retrievalStatus` semantics unchanged (AND of discoverability and transport). No dashboard migration needed; the new `retrievalTransportStatus` chart is opt-in.

Observability follow-ups for the silent-timeout cases (the catch-order bug in `ipni-verification.service.ts` and the missing `retrieval_skipped_ipni_failure` log) ship as a separate PR (option C from #524) on top of this one.

…s counters

In pgBoss mode, performAllRetrievals previously gated /ipfs transport on
IPNI verification: any IPNI lag or failure marked the deal as
retrievalStatus=failure.* without ever exercising the transport path. This
conflated discoverability and transport failures on the dashboard, and
when paired with the silent-timeout bug (#524) made the failure mode
invisible in logs.

Run both stages concurrently via Promise.allSettled and compose the
terminal retrievalStatus as AND(discoverabilityStatus, retrievalTransportStatus).
A new retrievalTransportStatus counter records the /ipfs-only outcome so
operators can attribute a retrievalStatus failure to IPNI vs transport
from a single dashboard.

Tracks dealbot#524 option A.

Tests: 9/9 retrieval.service.spec pass, 354/354 backend pass.
Typecheck + biome lint clean.
Copilot AI review requested due to automatic review settings May 14, 2026 18:40
@FilOzzy FilOzzy added this to FOC May 14, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC May 14, 2026
@SgtPooki SgtPooki self-assigned this May 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the backend Retrieval check to run IPNI verification and /ipfs transport in parallel, and introduces a new Prometheus counter (retrievalTransportStatus) to separately attribute failures to transport vs discoverability while keeping retrievalStatus as the composite (AND) outcome.

Changes:

  • Parallelize IPNI + transport execution in RetrievalService using Promise.allSettled, then classify and compose terminal status from both sub-statuses.
  • Add and register the retrievalTransportStatus Prometheus counter and expose a recordTransportStatus helper on RetrievalCheckMetrics.
  • Update retrieval/check docs and add tests covering combined outcomes plus a concurrency-ordering test.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
docs/checks/retrievals.md Documents parallel IPNI + transport flow and clarifies composite retrievalStatus semantics.
docs/checks/events-and-metrics.md Defines the new retrievalTransportStatus metric and documents the composite rule for retrievalStatus.
apps/backend/src/retrieval/retrieval.service.ts Runs IPNI + transport concurrently; records retrievalTransportStatus; composes terminal retrievalStatus.
apps/backend/src/retrieval/retrieval.service.spec.ts Adds pgBoss-mode tests for combined IPNI/transport outcomes and a concurrency assertion.
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts Registers the new retrievalTransportStatus counter provider.
apps/backend/src/metrics-prometheus/check-metrics.service.ts Injects the new counter and adds recordTransportStatus.

Comment on lines +203 to +210
const successCount = retrievals.filter((r) => r.status === RetrievalStatus.SUCCESS).length;
this.logger.log({
...retrievalLogContext,
event: "retrievals_completed",
message: "Retrievals successful",
successCount,
totalCount: retrievals.length,
});
Comment on lines +516 to +535
mockIpniVerificationService.verify.mockImplementation(async () => {
order.push("ipni-start");
await new Promise((r) => setTimeout(r, 20));
order.push("ipni-end");
return {
verified: 1,
unverified: 0,
total: 1,
rootCIDVerified: true,
durationMs: 20,
failedCIDs: [],
verifiedAt: new Date().toISOString(),
};
});
mockRetrievalAddonsService.testAllRetrievalMethods.mockImplementation(async () => {
order.push("transport-start");
await new Promise((r) => setTimeout(r, 10));
order.push("transport-end");
return successfulTransport;
});
- Split retrievals_completed message by allSuccess so partial-failure
  runs are obvious from the log line (Copilot #538 review).
- Replace setTimeout-based concurrency test with deferred-promise
  barrier; both sides resolve their started signal before awaiting the
  shared release. Removes wall-clock dependency.
- Drop "pgBoss mode" from describe block; the conditional now reads as
  IPNI orchestration, not a separate run mode.
SgtPooki added a commit that referenced this pull request May 14, 2026
Two adjacent observability bugs left IPNI failures invisible whenever
the outer pg-boss retrieval-job timeout fired:

1. IpniVerificationService.verify checked signal?.aborted before
   verificationSignal.aborted. On a race where both signals were
   asserted by the time the catch handler ran, signal.throwIfAborted()
   re-threw before the ipni_verification_timed_out log could fire.
   Inner check now runs first so the inner timeout log fires whenever
   the inner timeout signal aborted, regardless of outer state.

2. RetrievalService.verifyIpniForRetrieval's catch returned a silent
   failure.timedout when signal?.aborted, with no log. Added
   retrieval_ipni_verification_timed_out so the retrieval-side caller
   records the abort.

Tracks dealbot#524 option C. Pairs with #538 (parallel IPNI + transport).
Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to get aligned on what counters are emitted and when.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://github.com/FilOzone/dealbot/blob/main/docs/checks/data-storage.md#deal-status-progression and https://github.com/FilOzone/dealbot/blob/main/docs/checks/events-and-metrics.md#retrievalStatus, this retrievalStatus counter is supposed to increase when the ipfsRetrievalIntegrityChecked event fires. It wasn't intended to be when ipfsRetrievalIntegrityChecked and ipniVerificationComplete.

Assuming we implement it that way, do we even need a counter for when ipfsRetrievalIntegrityChecked and ipniVerificationComplete occur? Back when specing this out, I wasn't seeing what that got us.

My understanding of the situation:

  • Today: retrievalStatus is successful when ipfsRetrievalIntegrityChecked and ipniVerificationCompleteoccur with success, and those happen sequentially.
  • Your proposal:
    • retrievalStatus is successful when ipfsRetrievalIntegrityChecked and ipniVerificationComplete occur with success, and those happen in parallel.
    • Add retrievalTransportStatus for when ipfsRetrievalIntegrityChecked fires.
  • My proposal (as originally speced, or at least what was meant but I can accept it wasn't clear)
    • retrievalStatus when ipfsRetrievalIntegrityChecked fires.
    • Discoverability and Retrieval portions happen in parallel.
    • No additional counters.

Comment on lines +105 to +106
| <a id="retrievalStatus"></a>`retrievalStatus` | Data Storage, Retrieval | [`ipfsRetrievalIntegrityChecked`](#ipfsRetrievalIntegrityChecked) | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). Composite: `success` requires both `discoverabilityStatus=success` and `retrievalTransportStatus=success`. | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) |
| <a id="retrievalTransportStatus"></a>`retrievalTransportStatus` | Retrieval | When the `/ipfs` transport stage completes (success, all-blocks fetched and validated; or failure on HTTP, content, or abort). Scoped to the transport stage; does **not** include IPNI discoverability. Pair with [`discoverabilityStatus`](#discoverabilityStatus) to attribute a `retrievalStatus` failure to IPNI vs transport. | `success`, `failure.timedout`, `failure.other` from [Data Storage Sub-status meanings](./data-storage.md#sub-status-meanings). | [`retrieval.service.ts`](../../apps/backend/src/retrieval/retrieval.service.ts) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per file comment, I'm questioning whether we need to change anything here. I think we can just implement retrievalStatus as originally specced aand just have one counter here.

Comment thread docs/checks/retrievals.md
### Retrieval Checks

For each selected piece, dealbot performs the following in parallel:
For each selected piece, dealbot performs the following in parallel. Each stage emits its own sub-status counter ([`discoverabilityStatus`](./events-and-metrics.md#discoverabilityStatus) for IPNI, [`retrievalTransportStatus`](./events-and-metrics.md#retrievalTransportStatus) for `/ipfs`). The composite [`retrievalStatus`](./events-and-metrics.md#retrievalStatus) records `success` only when both sub-statuses succeed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure more documentation is needed here, but if we are wanting to add extra documentation around metrics, lets put it in https://github.com/FilOzone/dealbot/blob/main/docs/checks/retrievals.md#metrics-recorded ?

@github-project-automation github-project-automation Bot moved this from 📌 Triage to ⌨️ In Progress in FOC May 15, 2026
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code portion looks good overall. Just one comment.

)
: Promise.resolve({ ok: true as const });

const [transportSettled, ipniSettled] = await Promise.allSettled([transportPromise, ipniPromise]);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may extend runtime to the slower of the two timeouts. For example, we could end up waiting for the IPNI timeout even though the transport has already implied failure.timedout since there’s no early exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ⌨️ In Progress

Development

Successfully merging this pull request may close these issues.

5 participants