Skip to content

data_retention_poll reports success when its subgraph dependency is down, so outages fire no alarm #591

@SgtPooki

Description

@SgtPooki

The data_retention_poll job records jobs_completed_total{handler_result="success"} on every run, including when the data-retention check could not run because its subgraph dependency was unavailable. Job success should mean the check was able to execute, so a dependency outage should be recorded as a job failure.

Impact

A subgraph endpoint outage took down mainnet and calibration data-retention checks for about 6 days with no alarm, and SP approval stalled on insufficient dataSetChallengeStatus samples. The external cause was fixed in FilOzone/infra#216 (config) and follow-up is tracked in FilOzone/pdp-explorer#106 (stable tags). This issue covers the silent failure and the alerting gap that hid it.

Root cause

pollDataRetention() catches the subgraph error without rethrowing (data-retention.service.ts:192), handleDataRetentionJob() hardcodes return "success" (jobs.service.ts:656), and the data_retention_poll jobs stopped alarm counts success and error completions together, so it stayed green while the job errored hourly.

Proposed fix

  • Code: pollDataRetention() fails the job when a check dependency is not operating (subgraph fetch or query failure, baseline-load failure, missing endpoint in prod). The trigger is dependency failure, not output volume; a poll over an empty-but-healthy provider set and transient per-provider failures stay success.
  • Alarm: filter the data_retention_poll jobs stopped alarm to handler_result='success', and confirm an absent success series (for example a pod rotating during an outage that only emits error) fires rather than evaluating to no data.
  • Optional run heartbeat: a metric such as data_retention_poll_success_timestamp_seconds, set only when the poll ran against a healthy subgraph, alerted on staleness. Not a dataSetChallengeStatus zero-sample alarm: that counter only moves on proving-period deltas (~24h on mainnet), so it would be noisy.
  • Sibling: audit providers_refresh, which discards the boolean from loadProviders() (wallet-sdk.service.ts:90) and also hardcodes success.

Definition of done

  • data_retention_poll records handler_result="error" when the subgraph or another check dependency is unavailable; partial per-provider failures stay success.
  • The job-completion alarm fires on a sustained dependency outage, including when only error results exist.
  • providers_refresh failure handling reviewed, then fixed or confirmed correct.

Evidence, code references, and cross-agent review notes: https://gist.github.com/SgtPooki/a4cb6d7eac4fdf0c5fd99e84b1635443

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingready-for-workTriaged: scope, plan, and DoD are clear; contributor can pick up

Type

No type
No fields configured for issues without a type.

Projects

Status
📌 Triage

Relationships

None yet

Development

No branches or pull requests

Issue actions