Summary
The integration enters a permanent unavailable state after a single transient HTTP 502 (or other 5xx) from the panel's local API, and never auto-retries on its own. The panel itself remains reachable and healthy; only a manual config-entry reload (or HA core restart) recovers the integration. This is reminiscent of the long-closed #29 but is happening on the current 2.0.x line and at the connection-handler level rather than at entity state.
Environment
- SPAN Panel running spanos2 firmware (v3 series)
- Integration:
span_panel v2.0.6
- Home Assistant: recent stable
- Network: panel is on the local LAN; sub-2ms ping; HTTP 200 from
/api/v1/status once the integration is reloaded
Observed behaviour
- Integration was healthy for weeks, then a brief panel-side blip produced a single HTTP 502 from
/api/v1/status.
- All entities transitioned to
unavailable and stayed that way for ~4 days continuously.
- During the outage the panel itself remained reachable from the HA host — manual
curl http://<panel>/api/v1/status returned 200 the whole time; ping was sub-2ms.
- Recovery was instantaneous after issuing a config-entry reload via
POST /api/config/config_entries/entry/<entry_id>/reload — no re-auth, no zeroconf rediscovery, no HA restart. Every entity came back within seconds.
Expected behaviour
A transient 5xx or timeout should not be terminal. The integration should retry on a backoff cadence (or re-arm the coordinator) so it recovers without operator intervention once the panel is healthy again.
Why this is distinct from existing fixes
The v2.0.7 release notes call out unavailable fixes for the door, grid-islandable, and BESS-connected binary sensors — those address entity-state mapping when the panel reports UNKNOWN. That is the right fix for the entity layer, but it does not address the connection-handler layer where the coordinator gives up after a single 5xx. The pattern reported here looks like the same failure-mode family as #29 but on the 2.0.x rewrite — once the coordinator decides the panel is gone, nothing brings it back short of a reload.
Suggested direction
Two non-exclusive options:
- Don't treat a single 5xx as fatal in the coordinator — the existing retry / catch-up logic in
_async_fetch_with_retry should bubble back into the coordinator's normal polling cadence, not flip the integration into a permanent unavailable state. The next polling interval should re-attempt.
- Add backoff-based reconnect — if N consecutive polls fail with 5xx / timeouts, schedule an exponential-backoff retry (with a reasonable cap, e.g. 5–10 min) using
async_schedule_update_ha_state(force_refresh=True) or by re-invoking the equivalent of async_setup_entry for the live entry. Either path is strictly better than waiting for the operator to notice.
A 4-day silent outage from a single blip is the kind of thing operators only catch via the energy dashboard going flat, which is too late.
Workaround
For anyone hitting this in the meantime: a periodic config-entry reload via the HA REST API restores service immediately. It's not a fix, but it's reliable as a stopgap and doesn't disturb the panel.
Happy to help
If a maintainer can point at the right place in the 2.0.x coordinator / API client layer, happy to test a patch in production on the same panel that produced this report.
Summary
The integration enters a permanent
unavailablestate after a single transient HTTP 502 (or other 5xx) from the panel's local API, and never auto-retries on its own. The panel itself remains reachable and healthy; only a manual config-entry reload (or HA core restart) recovers the integration. This is reminiscent of the long-closed #29 but is happening on the current 2.0.x line and at the connection-handler level rather than at entity state.Environment
span_panelv2.0.6/api/v1/statusonce the integration is reloadedObserved behaviour
/api/v1/status.unavailableand stayed that way for ~4 days continuously.curl http://<panel>/api/v1/statusreturned 200 the whole time; ping was sub-2ms.POST /api/config/config_entries/entry/<entry_id>/reload— no re-auth, no zeroconf rediscovery, no HA restart. Every entity came back within seconds.Expected behaviour
A transient 5xx or timeout should not be terminal. The integration should retry on a backoff cadence (or re-arm the coordinator) so it recovers without operator intervention once the panel is healthy again.
Why this is distinct from existing fixes
The v2.0.7 release notes call out
unavailablefixes for the door, grid-islandable, and BESS-connected binary sensors — those address entity-state mapping when the panel reportsUNKNOWN. That is the right fix for the entity layer, but it does not address the connection-handler layer where the coordinator gives up after a single 5xx. The pattern reported here looks like the same failure-mode family as #29 but on the 2.0.x rewrite — once the coordinator decides the panel is gone, nothing brings it back short of a reload.Suggested direction
Two non-exclusive options:
_async_fetch_with_retryshould bubble back into the coordinator's normal polling cadence, not flip the integration into a permanent unavailable state. The next polling interval should re-attempt.async_schedule_update_ha_state(force_refresh=True)or by re-invoking the equivalent ofasync_setup_entryfor the live entry. Either path is strictly better than waiting for the operator to notice.A 4-day silent outage from a single blip is the kind of thing operators only catch via the energy dashboard going flat, which is too late.
Workaround
For anyone hitting this in the meantime: a periodic config-entry reload via the HA REST API restores service immediately. It's not a fix, but it's reliable as a stopgap and doesn't disturb the panel.
Happy to help
If a maintainer can point at the right place in the 2.0.x coordinator / API client layer, happy to test a patch in production on the same panel that produced this report.