fix(deal): combine FWSS, PDPVerifier, and SP-HTTP probes for dataset liveness#537
Merged
Conversation
…liveness
`isDataSetLive` returns true only when all three signals agree:
WarmStorage.validateDataSet, PDPVerifier.dataSetLive, and an unauthenticated
POST to the SP's `/pdp/data-sets/{id}/pieces` endpoint. Curio returns HTTP 409
on that endpoint when `unrecoverable_proving_failure_epoch` is set, which is
the only signal observable when a dataset is dead on the SP but chain still
reports it as live.
Chain probes rethrow on transient errors so transient outages don't get
misclassified as termination. The SP HTTP probe treats any non-409 response
(including auth failures and network errors) as live.
Threads `providerAddress` through `isDataSetLive` so the SP probe can resolve
the serviceURL from the provider registry.
Contributor
There was a problem hiding this comment.
Pull request overview
Updates backend dataset liveness classification to be more robust by composing multiple independent termination signals (on-chain FWSS, on-chain PDPVerifier, and an SP HTTP probe) so that repair can trigger for datasets Curio has already marked unrecoverably terminated even when chain state hasn’t propagated yet.
Changes:
- Expands
DealService.isDataSetLive(...)into a composite check that runs FWSSvalidateDataSet, PDPVerifierdataSetLive, and an SP HTTPPOST /pdp/data-sets/{id}/piecesprobe (409 => terminated). - Threads
providerAddressthrough call sites that need SP registry info for the HTTP probe. - Adds unit tests covering the three probes, including SP HTTP 409 behavior, request shape, and transient-error handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| apps/backend/src/deal/deal.service.ts | Implements composite liveness probing (FWSS + PDPVerifier + SP HTTP) and updates provisioning status logic to use it. |
| apps/backend/src/deal/deal.service.spec.ts | Adds mocks/stubs and new test cases validating the combined probe behavior and SP HTTP request details. |
…errors `Promise.all` previously let a transient chain-probe rejection mask a conclusive SP HTTP 409. Switch to `Promise.allSettled` and return false when any probe positively reports terminated; only rethrow when no probe reported termination. Also require the SP HTTP probe to match `unrecoverable proving failure` in the response body in addition to HTTP 409. Defends against a future Curio reusing 409 for a non-terminal conflict, which would otherwise trigger destructive `terminateDataSet` + deal cleanup. Tests cover both the SP-409-rescues-transient-chain-error path and the 409-with-different-body-treated-as-live path.
SgtPooki
commented
May 14, 2026
Collaborator
Author
SgtPooki
left a comment
There was a problem hiding this comment.
self review.. a little heavy on the tests here, but at least our expectations are documented.
silent-cipher
approved these changes
May 15, 2026
Collaborator
silent-cipher
left a comment
There was a problem hiding this comment.
Overall looks good to me. Just one comment.
This was referenced May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
DealService.isDataSetLive(providerAddress, dataSetId, signal)now runs three independent liveness probes and returns true only when all agree:WarmStorage.validateDataSet(chain) - catches FWSS-side termination. Preserves the PR fix: handle PDP-terminated datasets via data_set_creation repair #518 behaviour: rethrows on non-terminal errors so a transient RPC outage cannot misclassify a healthy dataset as terminated.PDPVerifier.dataSetLive(chain, via@filoz/synapse-core/pdp-verifier) - catches PDPVerifier-side termination.POST /pdp/data-sets/{id}/pieceswith an empty body (off-chain, unauthenticated) - catches Curio'sunrecoverable_proving_failure_epochstate, where the SP refusesaddPieceswith HTTP 409 while both chain signals still report the dataset live.The SP HTTP probe is the only signal that detects PDP-terminated datasets on
sp-playgroundon calibration today, where 197/197 provider-24 datasets return 409 from the addPieces endpoint while both chain probes report them live. Curio's handler returns 409 exclusively for the terminated check on this endpoint (curio/pdp/handlers_add.go#L302-L305), so the probe matches on status code rather than body text.The SP HTTP probe treats any non-409 response (including 401, 404, 5xx, and network errors) as live to avoid triggering destructive repair on transient SP outages or future auth changes.
getDataSetProvisioningStatusand the existing repair flow (data-set-creation.handler.ts->repairTerminatedDataSet) are untouched; the stronger probe automatically widens what they classify as terminated.How to verify
pnpm --filter dealbot-backend test src/deal/deal.service.spec.tsNew cases cover each probe returning false in isolation, transient errors, and the request shape sent to the SP.
Notes
AbortSignal.DATASET_CREATIONS_PER_SP_PER_HOUR=0.25on staging means existing 197 terminatedsp-playgrounddatasets will repair at ~1 per 4 hours per provider once deployed. No backfill script needed; the scheduleddata_set_creationjob picks them up via the widerisDataSetLive.