The data_retention_poll job records jobs_completed_total{handler_result="success"} on every run, including when the data-retention check could not run because its subgraph dependency was unavailable. Job success should mean the check was able to execute, so a dependency outage should be recorded as a job failure.
Impact
A subgraph endpoint outage took down mainnet and calibration data-retention checks for about 6 days with no alarm, and SP approval stalled on insufficient dataSetChallengeStatus samples. The external cause was fixed in FilOzone/infra#216 (config) and follow-up is tracked in FilOzone/pdp-explorer#106 (stable tags). This issue covers the silent failure and the alerting gap that hid it.
Root cause
pollDataRetention() catches the subgraph error without rethrowing (data-retention.service.ts:192), handleDataRetentionJob() hardcodes return "success" (jobs.service.ts:656), and the data_retention_poll jobs stopped alarm counts success and error completions together, so it stayed green while the job errored hourly.
Proposed fix
- Code:
pollDataRetention() fails the job when a check dependency is not operating (subgraph fetch or query failure, baseline-load failure, missing endpoint in prod). The trigger is dependency failure, not output volume; a poll over an empty-but-healthy provider set and transient per-provider failures stay success.
- Alarm: filter the
data_retention_poll jobs stopped alarm to handler_result='success', and confirm an absent success series (for example a pod rotating during an outage that only emits error) fires rather than evaluating to no data.
- Optional run heartbeat: a metric such as
data_retention_poll_success_timestamp_seconds, set only when the poll ran against a healthy subgraph, alerted on staleness. Not a dataSetChallengeStatus zero-sample alarm: that counter only moves on proving-period deltas (~24h on mainnet), so it would be noisy.
- Sibling: audit
providers_refresh, which discards the boolean from loadProviders() (wallet-sdk.service.ts:90) and also hardcodes success.
Definition of done
Evidence, code references, and cross-agent review notes: https://gist.github.com/SgtPooki/a4cb6d7eac4fdf0c5fd99e84b1635443
The
data_retention_polljob recordsjobs_completed_total{handler_result="success"}on every run, including when the data-retention check could not run because its subgraph dependency was unavailable. Job success should mean the check was able to execute, so a dependency outage should be recorded as a job failure.Impact
A subgraph endpoint outage took down mainnet and calibration data-retention checks for about 6 days with no alarm, and SP approval stalled on insufficient
dataSetChallengeStatussamples. The external cause was fixed in FilOzone/infra#216 (config) and follow-up is tracked in FilOzone/pdp-explorer#106 (stable tags). This issue covers the silent failure and the alerting gap that hid it.Root cause
pollDataRetention()catches the subgraph error without rethrowing (data-retention.service.ts:192),handleDataRetentionJob()hardcodesreturn "success"(jobs.service.ts:656), and thedata_retention_poll jobs stoppedalarm countssuccessanderrorcompletions together, so it stayed green while the job errored hourly.Proposed fix
pollDataRetention()fails the job when a check dependency is not operating (subgraph fetch or query failure, baseline-load failure, missing endpoint in prod). The trigger is dependency failure, not output volume; a poll over an empty-but-healthy provider set and transient per-provider failures staysuccess.data_retention_poll jobs stoppedalarm tohandler_result='success', and confirm an absentsuccessseries (for example a pod rotating during an outage that only emitserror) fires rather than evaluating to no data.data_retention_poll_success_timestamp_seconds, set only when the poll ran against a healthy subgraph, alerted on staleness. Not adataSetChallengeStatuszero-sample alarm: that counter only moves on proving-period deltas (~24h on mainnet), so it would be noisy.providers_refresh, which discards the boolean fromloadProviders()(wallet-sdk.service.ts:90) and also hardcodessuccess.Definition of done
data_retention_pollrecordshandler_result="error"when the subgraph or another check dependency is unavailable; partial per-provider failures staysuccess.errorresults exist.providers_refreshfailure handling reviewed, then fixed or confirmed correct.Evidence, code references, and cross-agent review notes: https://gist.github.com/SgtPooki/a4cb6d7eac4fdf0c5fd99e84b1635443