feat: cap deal/retrievals with abort signals by SgtPooki · Pull Request #263 · FilOzone/dealbot

SgtPooki · 2026-02-11T19:50:09Z

Summary
Adds end-to-end abort propagation with shared utilities, introduces job-level timeout enforcement for deal/retrieval jobs, and improves observability/testing around aborted runs and retrieval failures. Updates HTTP timeout defaults and documents new job timeout env vars.

Problem
Long-running deal/retrieval jobs and downstream steps lacked consistent abort handling, causing wasted work and incomplete observability when timeouts or cancellations occur. Abort reasons could be lost, and job metrics didn’t clearly distinguish aborts from failures.

Solution

Added abort-utils helpers (createAbortError, awaitWithAbort, delay) with tests.
Propagated AbortSignal through deal/retrieval flows, add‑ons, and IPNI polling; prevent new work on abort while preserving partial results.
Job runner now enforces per‑job timeouts (deal/retrieval) via AbortController, records handler_result="aborted", and keeps success vs business failure semantics.
Retrieval results carry an aborted flag; errors preserve non‑Error abort reasons.
Added/updated tests for abort behavior and error preservation.

Notes

New env vars: DEAL_JOB_TIMEOUT_SECONDS, RETRIEVAL_JOB_TIMEOUT_SECONDS (defaults 6m/1m) and docs updated.
HTTP request timeout defaults reduced to 4m to align with expected transfer throughput.
Metrics doc now describes jobs_completed_total handler result values (success, aborted, error).

Fixes #258

Copilot

Pull request overview

Adds end-to-end abort propagation and job-level timeout enforcement for deal/retrieval workflows so long-running jobs can be actively cancelled while improving metrics/logging around abort vs failure.

Changes:

Introduces shared abort helpers (abort-utils) and adopts AbortSignal propagation across deal/retrieval flows (including add-ons and IPNI polling).
Enforces per-job timeouts in the pg-boss job runner and records handler_result="aborted" for timed-out executions.
Updates defaults/docs for job timeouts and HTTP request timeouts, plus adds/updates tests for abort behavior and error preservation.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
docs/environment-variables.md	Documents new job timeout env vars and adds them to the quick reference.
apps/backend/src/retrieval/retrieval.service.ts	Propagates abort signals through retrieval execution and adjusts batch behavior on abort.
apps/backend/src/retrieval/retrieval.service.spec.ts	Updates tests for abort behavior and aligns Deal IDs with UUID strings.
apps/backend/src/retrieval-addons/types.ts	Extends retrieval test result shape with an optional `aborted` flag.
apps/backend/src/retrieval-addons/retrieval-addons.service.ts	Adds abort checks, uses shared abort-aware `delay`, and improves error capture for non-`Error` throws (partially).
apps/backend/src/retrieval-addons/retrieval-addons.service.spec.ts	Adds test ensuring non-`Error` throws are captured in execution results.
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts	Documents `handler_result` semantics for `jobs_completed_total`.
apps/backend/src/jobs/jobs.service.ts	Enforces job-level timeouts via `AbortController` and reports aborted jobs distinctly.
apps/backend/src/jobs/jobs.service.spec.ts	Adds metrics and timeout-abort tests for deal/retrieval jobs; updates private-call signatures.
apps/backend/src/deal/deal.service.ts	Propagates abort signal into deal creation/upload/IPNI/retrieval checks; preserves non-`Error` error messages.
apps/backend/src/deal/deal.service.spec.ts	Adds coverage for preserving non-`Error` abort reasons through deal creation failure recording.
apps/backend/src/deal-addons/strategies/ipni.strategy.ts	Propagates abort signal through IPNI monitoring/polling and uses abort-aware delay.
apps/backend/src/deal-addons/interfaces/deal-addon.interface.ts	Extends `onUploadComplete` to accept an optional abort signal.
apps/backend/src/deal-addons/deal-addons.service.ts	Propagates abort signal through upload-complete add-on handlers and uses `awaitWithAbort`.
apps/backend/src/config/app.config.ts	Adds config schema + loader for job timeout env vars; reduces default HTTP request timeouts.
apps/backend/src/common/abort-utils.ts	Adds shared helpers: `createAbortError`, `awaitWithAbort`, and abort-aware `delay`.
apps/backend/src/common/abort-utils.spec.ts	Adds unit tests for abort utilities.
apps/backend/.env.example	Adds new job timeout vars and updates HTTP timeout defaults/comments.

Comments suppressed due to low confidence (2)

apps/backend/src/retrieval-addons/retrieval-addons.service.ts:207

When a retrieval promise rejects in testAllRetrievalMethods, the recorded error uses result.reason?.message || "Unknown error". If a strategy throws a non-Error (the new spec covers this), .message will be undefined and the reason is lost. Prefer result.reason instanceof Error ? result.reason.message : String(result.reason) so execution results preserve the real failure details.

    const executionResults: RetrievalExecutionResult[] = results.map((result, index) => {
      if (result.status === "fulfilled") {
        return result.value;
      } else {
        // Create failed result - retryCount unknown for catastrophic failures
        return {
          url: urlResults[index].url,
          method: urlResults[index].method,
          data: Buffer.alloc(0),
          metrics: {
            latency: 0,
            ttfb: 0,
            throughput: 0,
            statusCode: 0,
            timestamp: new Date(),
            responseSize: 0,
          },
          success: false,
          error: result.reason?.message || "Unknown error",
          retryCount: undefined, // Unknown for catastrophic failures
        };

apps/backend/src/retrieval/retrieval.service.ts:201

performAllRetrievals logs "All retrievals failed" at error level for any thrown error, including aborts thrown via signal.throwIfAborted(). This will create noisy failure logs for expected cancellations/timeouts. Consider skipping the error log (or downgrading to warn) when signal?.aborted is true, similar to the batch-level handling above.

    } catch (error) {
      const errorMessage = error instanceof Error ? error.message : String(error);
      this.logger.error(`All retrievals failed for ${deal.pieceCid}: ${errorMessage}`);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BigLep

@SgtPooki: if we don't get @silent-cipher review during his 2026-02-13, I think this would be a good candidate for having another agent do a double check of the change. It should be able to reason about abortcontrollers and its standard behavior and then trace through to make sure it is propagated through everywhere.

silent-cipher

Looks good to me! Nothing blocking - just few comments

Co-authored-by: Steve Loeppky <biglep@protocol.ai>

SgtPooki · 2026-02-13T17:44:44Z

also updated retreival and deal timeouts to 6m and 1m respectively

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

apps/backend/src/jobs/jobs.service.ts:360

Inside handleRetrievalJob, timeoutMs is declared for the job-level abort timer and then re-declared inside recordJobExecution for the interval-based retrieval deadline. Reusing the same name makes it easy to pass the wrong value in future edits; renaming one of them would reduce confusion and prevent subtle bugs.

      try {
        const timeoutsConfig = this.configService.get("timeouts");
        const intervalMs = data.intervalSeconds * 1000;
        const timeoutMs = Math.max(10000, intervalMs - timeoutsConfig.retrievalTimeoutBufferMs);
        const httpTimeoutMs = Math.max(timeoutsConfig.httpRequestTimeoutMs, timeoutsConfig.http2RequestTimeoutMs);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SgtPooki added 2 commits February 11, 2026 11:47

fix: checks and jobs have a maximum timeout

09a51d7

fix: propogate aborts down to checks

a2bd22f

Copilot AI review requested due to automatic review settings February 11, 2026 19:50

SgtPooki linked an issue Feb 11, 2026 that may be closed by this pull request

we need to set deal/retrieval max timeout #258

Closed

FilOzzy added this to FOC Feb 11, 2026

github-project-automation Bot moved this to 📌 Triage in FOC Feb 11, 2026

Copilot started reviewing on behalf of SgtPooki February 11, 2026 19:50 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

SgtPooki self-assigned this Feb 11, 2026

SgtPooki moved this from 📌 Triage to 🔎 Awaiting review in FOC Feb 11, 2026

SgtPooki mentioned this pull request Feb 11, 2026

fix: remove RETRIEVAL_TIMEOUT_BUFFER_MS #266

Merged

chore: address pr comments

963b5bd

SgtPooki requested a review from silent-cipher February 11, 2026 20:39

This was referenced Feb 11, 2026

lower timeouts for deals/retrievals #267

Open

fix: use single pgboss queue to enforce per SP lock #247

Merged

BigLep reviewed Feb 13, 2026

View reviewed changes

Comment thread docs/environment-variables.md

Comment thread docs/environment-variables.md Outdated

Comment thread docs/environment-variables.md Outdated

silent-cipher approved these changes Feb 13, 2026

View reviewed changes

Comment thread apps/backend/src/jobs/jobs.service.ts Outdated

Comment thread apps/backend/src/jobs/jobs.service.ts

Comment thread apps/backend/src/deal/deal.service.ts Outdated

BigLep moved this from 🔎 Awaiting review to ✔️ Approved by reviewer in FOC Feb 13, 2026

SgtPooki and others added 4 commits February 13, 2026 11:27

chore: address pr comments

8bde4bb

Merge branch 'main' into 258-we-need-to-set-dealretrieval-max-timeout

c1f6963

Update docs/environment-variables.md

15735d4

Co-authored-by: Steve Loeppky <biglep@protocol.ai>

Update docs/environment-variables.md

1a31734

Co-authored-by: Steve Loeppky <biglep@protocol.ai>

SgtPooki requested a review from Copilot February 13, 2026 17:44

Copilot started reviewing on behalf of SgtPooki February 13, 2026 17:45 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

SgtPooki added 2 commits February 13, 2026 16:37

Merge branch 'main' into 258-we-need-to-set-dealretrieval-max-timeout

d439bdc

chore: address pr comments

d818327

rjan90 added this to the M4.1: mainnet ready milestone Feb 16, 2026

Merge branch 'main' into 258-we-need-to-set-dealretrieval-max-timeout

a8d668b

SgtPooki merged commit 0623bcf into main Feb 16, 2026
6 checks passed

github-project-automation Bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC Feb 16, 2026

SgtPooki deleted the 258-we-need-to-set-dealretrieval-max-timeout branch February 16, 2026 12:58

github-actions Bot mentioned this pull request Feb 13, 2026

chore: release to production (main) #261

Merged

SgtPooki mentioned this pull request Apr 27, 2026

docs(checks): close resolved TBDs in data-storage, events, README #481

Open

4 tasks

Conversation

SgtPooki commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BigLep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silent-cipher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SgtPooki commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SgtPooki commented Feb 11, 2026 •

edited

Loading