Skip to content

Integration does not auto-recover from a transient HTTP 502 — stays unavailable indefinitely after one panel-side blip #242

@rancur

Description

@rancur

Summary

The integration enters a permanent unavailable state after a single transient HTTP 502 (or other 5xx) from the panel's local API, and never auto-retries on its own. The panel itself remains reachable and healthy; only a manual config-entry reload (or HA core restart) recovers the integration. This is reminiscent of the long-closed #29 but is happening on the current 2.0.x line and at the connection-handler level rather than at entity state.

Environment

  • SPAN Panel running spanos2 firmware (v3 series)
  • Integration: span_panel v2.0.6
  • Home Assistant: recent stable
  • Network: panel is on the local LAN; sub-2ms ping; HTTP 200 from /api/v1/status once the integration is reloaded

Observed behaviour

  • Integration was healthy for weeks, then a brief panel-side blip produced a single HTTP 502 from /api/v1/status.
  • All entities transitioned to unavailable and stayed that way for ~4 days continuously.
  • During the outage the panel itself remained reachable from the HA host — manual curl http://<panel>/api/v1/status returned 200 the whole time; ping was sub-2ms.
  • Recovery was instantaneous after issuing a config-entry reload via POST /api/config/config_entries/entry/<entry_id>/reload — no re-auth, no zeroconf rediscovery, no HA restart. Every entity came back within seconds.

Expected behaviour

A transient 5xx or timeout should not be terminal. The integration should retry on a backoff cadence (or re-arm the coordinator) so it recovers without operator intervention once the panel is healthy again.

Why this is distinct from existing fixes

The v2.0.7 release notes call out unavailable fixes for the door, grid-islandable, and BESS-connected binary sensors — those address entity-state mapping when the panel reports UNKNOWN. That is the right fix for the entity layer, but it does not address the connection-handler layer where the coordinator gives up after a single 5xx. The pattern reported here looks like the same failure-mode family as #29 but on the 2.0.x rewrite — once the coordinator decides the panel is gone, nothing brings it back short of a reload.

Suggested direction

Two non-exclusive options:

  1. Don't treat a single 5xx as fatal in the coordinator — the existing retry / catch-up logic in _async_fetch_with_retry should bubble back into the coordinator's normal polling cadence, not flip the integration into a permanent unavailable state. The next polling interval should re-attempt.
  2. Add backoff-based reconnect — if N consecutive polls fail with 5xx / timeouts, schedule an exponential-backoff retry (with a reasonable cap, e.g. 5–10 min) using async_schedule_update_ha_state(force_refresh=True) or by re-invoking the equivalent of async_setup_entry for the live entry. Either path is strictly better than waiting for the operator to notice.

A 4-day silent outage from a single blip is the kind of thing operators only catch via the energy dashboard going flat, which is too late.

Workaround

For anyone hitting this in the meantime: a periodic config-entry reload via the HA REST API restores service immediately. It's not a fix, but it's reliable as a stopgap and doesn't disturb the panel.

Happy to help

If a maintainer can point at the right place in the 2.0.x coordinator / API client layer, happy to test a patch in production on the same panel that produced this report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions