data_retention_poll reports success when its subgraph dependency is down, so outages fire no alarm

The `data_retention_poll` job records `jobs_completed_total{handler_result="success"}` on every run, including when the data-retention check could not run because its subgraph dependency was unavailable. Job success should mean the check was able to execute, so a dependency outage should be recorded as a job failure.

## Impact

A subgraph endpoint outage took down mainnet and calibration data-retention checks for about 6 days with no alarm, and SP approval stalled on insufficient `dataSetChallengeStatus` samples. The external cause was fixed in FilOzone/infra#216 (config) and follow-up is tracked in FilOzone/pdp-explorer#106 (stable tags). This issue covers the silent failure and the alerting gap that hid it.

## Root cause

`pollDataRetention()` catches the subgraph error without rethrowing (`data-retention.service.ts:192`), `handleDataRetentionJob()` hardcodes `return "success"` (`jobs.service.ts:656`), and the `data_retention_poll jobs stopped` alarm counts `success` and `error` completions together, so it stayed green while the job errored hourly.

## Proposed fix

- Code: `pollDataRetention()` fails the job when a check dependency is not operating (subgraph fetch or query failure, baseline-load failure, missing endpoint in prod). The trigger is dependency failure, not output volume; a poll over an empty-but-healthy provider set and transient per-provider failures stay `success`.
- Alarm: filter the `data_retention_poll jobs stopped` alarm to `handler_result='success'`, and confirm an absent `success` series (for example a pod rotating during an outage that only emits `error`) fires rather than evaluating to no data.
- Optional run heartbeat: a metric such as `data_retention_poll_success_timestamp_seconds`, set only when the poll ran against a healthy subgraph, alerted on staleness. Not a `dataSetChallengeStatus` zero-sample alarm: that counter only moves on proving-period deltas (~24h on mainnet), so it would be noisy.
- Sibling: audit `providers_refresh`, which discards the boolean from `loadProviders()` (`wallet-sdk.service.ts:90`) and also hardcodes `success`.

## Definition of done

- [ ] `data_retention_poll` records `handler_result="error"` when the subgraph or another check dependency is unavailable; partial per-provider failures stay `success`.
- [ ] The job-completion alarm fires on a sustained dependency outage, including when only `error` results exist.
- [ ] `providers_refresh` failure handling reviewed, then fixed or confirmed correct.

Evidence, code references, and cross-agent review notes: https://gist.github.com/SgtPooki/a4cb6d7eac4fdf0c5fd99e84b1635443


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_retention_poll reports success when its subgraph dependency is down, so outages fire no alarm #591

Impact

Root cause

Proposed fix

Definition of done

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

data_retention_poll reports success when its subgraph dependency is down, so outages fire no alarm #591

Description

Impact

Root cause

Proposed fix

Definition of done

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions