Integration does not auto-recover from a transient HTTP 502 — stays unavailable indefinitely after one panel-side blip

## Summary

The integration enters a permanent `unavailable` state after a single transient HTTP 502 (or other 5xx) from the panel's local API, and never auto-retries on its own. The panel itself remains reachable and healthy; only a manual config-entry reload (or HA core restart) recovers the integration. This is reminiscent of the long-closed #29 but is happening on the current 2.0.x line and at the connection-handler level rather than at entity state.

## Environment

- SPAN Panel running spanos2 firmware (v3 series)
- Integration: `span_panel` v2.0.6
- Home Assistant: recent stable
- Network: panel is on the local LAN; sub-2ms ping; HTTP 200 from `/api/v1/status` once the integration is reloaded

## Observed behaviour

- Integration was healthy for weeks, then a brief panel-side blip produced a single HTTP 502 from `/api/v1/status`.
- All entities transitioned to `unavailable` and stayed that way for ~4 days continuously.
- During the outage the panel itself remained reachable from the HA host — manual `curl http://<panel>/api/v1/status` returned 200 the whole time; ping was sub-2ms.
- Recovery was instantaneous after issuing a config-entry reload via `POST /api/config/config_entries/entry/<entry_id>/reload` — no re-auth, no zeroconf rediscovery, no HA restart. Every entity came back within seconds.

## Expected behaviour

A transient 5xx or timeout should not be terminal. The integration should retry on a backoff cadence (or re-arm the coordinator) so it recovers without operator intervention once the panel is healthy again.

## Why this is distinct from existing fixes

The v2.0.7 release notes call out `unavailable` fixes for the door, grid-islandable, and BESS-connected binary sensors — those address entity-state mapping when the panel reports `UNKNOWN`. That is the right fix for the entity layer, but it does not address the connection-handler layer where the coordinator gives up after a single 5xx. The pattern reported here looks like the same failure-mode family as #29 but on the 2.0.x rewrite — once the coordinator decides the panel is gone, nothing brings it back short of a reload.

## Suggested direction

Two non-exclusive options:

1. **Don't treat a single 5xx as fatal in the coordinator** — the existing retry / catch-up logic in `_async_fetch_with_retry` should bubble back into the coordinator's normal polling cadence, not flip the integration into a permanent unavailable state. The next polling interval should re-attempt.
2. **Add backoff-based reconnect** — if N consecutive polls fail with 5xx / timeouts, schedule an exponential-backoff retry (with a reasonable cap, e.g. 5–10 min) using `async_schedule_update_ha_state(force_refresh=True)` or by re-invoking the equivalent of `async_setup_entry` for the live entry. Either path is strictly better than waiting for the operator to notice.

A 4-day silent outage from a single blip is the kind of thing operators only catch via the energy dashboard going flat, which is too late.

## Workaround

For anyone hitting this in the meantime: a periodic config-entry reload via the HA REST API restores service immediately. It's not a fix, but it's reliable as a stopgap and doesn't disturb the panel.

## Happy to help

If a maintainer can point at the right place in the 2.0.x coordinator / API client layer, happy to test a patch in production on the same panel that produced this report.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration does not auto-recover from a transient HTTP 502 — stays unavailable indefinitely after one panel-side blip #242

Summary

Environment

Observed behaviour

Expected behaviour

Why this is distinct from existing fixes

Suggested direction

Workaround

Happy to help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Integration does not auto-recover from a transient HTTP 502 — stays unavailable indefinitely after one panel-side blip #242

Description

Summary

Environment

Observed behaviour

Expected behaviour

Why this is distinct from existing fixes

Suggested direction

Workaround

Happy to help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions