Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,29 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [2.6.4] - 05/2026

### Fixed

- **MQTT reconnect now self-heals after persistent failure** — `AsyncMqttBridge._reconnect_loop` rebuilds the paho client from scratch (re-fetching the panel CA, constructing a fresh client, resetting the Homie accumulator) after
`MQTT_FULL_REBUILD_AFTER_FAILURES` (3) consecutive failures, or immediately on any `ssl.SSLError`. The previous behavior pinned the panel's CA certificate into the paho client once at `connect()` time and re-used it across all reconnect attempts; if the
panel rotated its private CA — most plausibly during a firmware upgrade — every subsequent reconnect raised `ssl.SSLCertVerificationError` (caught by the broad `OSError` clause and silently retried) and the bridge could not recover without a config-entry
reload. The rebuild mirrors what a manual reload does without going through HA's `config_entry` teardown, so entities stay registered and the integration's grace-period logic continues to apply unchanged. The threshold-cadence design (counter reset on
every rebuild attempt, success or fail) keeps the recovery path active throughout extended outages — multi-day disconnections recover whenever the panel becomes usable again, including if the CA rotates a second time mid-outage. See
`SpanPanel_Docs/span-panel-api/2026-05-17-mqtt-ca-refresh-on-reconnect-design.md` for the full design.

### Added

- **`AsyncMqttBridge._rebuild_client()`** — internal recovery method invoked by the reconnect loop on persistent failure. Re-fetches the panel CA via `download_ca_cert()`, builds a fresh paho client via the new `_make_paho_client()` factory, fires the
optional pre-rebuild callback so consumers can reset their own state, tears down the old client, and submits the initial connect via the executor. Restores the previous client on any failure.
- **`AsyncMqttBridge.set_pre_rebuild_callback()`** — internal API for `SpanMqttClient` to register a hook that fires before each rebuild. Used to reset the Homie accumulator so retained messages on the new subscription start from a clean slate.
- **`MQTT_FULL_REBUILD_AFTER_FAILURES`** constant in `mqtt/const.py`.

### Changed

- **`SpanPanelAPIError` now in the bridge's CA-fetch exception list** — a `download_ca_cert()` failure during rebuild (e.g. panel returns HTTP 502 mid-outage) is caught, logged at WARNING, and the loop continues retrying with the previous client instead of
letting the reconnect task die.

## [2.6.2] - 04/2026

### Changed
Expand Down
3 changes: 1 addition & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "span-panel-api"
version = "2.6.2"
version = "2.6.4"
description = "A client library for SPAN Panel API"
authors = [
{name = "SpanPanel"}
Expand Down Expand Up @@ -129,7 +129,6 @@ omit = [
"*/tests/*",
"*/.venv/*",
"*/venv/*",
"src/span_panel_api/mqtt/connection.py",
]

[tool.coverage.report]
Expand Down
34 changes: 34 additions & 0 deletions src/span_panel_api/mqtt/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@ def __init__(
self._field_metadata: dict[str, FieldMetadata] | None = None
self._schema_hash: str | None = None
self._previous_schema_types: HomieSchemaTypes | None = None
# Cached at connect() so the pre-rebuild hook can reconstruct the
# Homie accumulator with the same panel size after a transport-level
# rebuild. Schema cannot change within a session, so caching is safe.
self._panel_size: int | None = None

def _require_homie(self) -> HomieDeviceConsumer:
"""Return the HomieDeviceConsumer, raising if not yet connected."""
Expand Down Expand Up @@ -115,6 +119,7 @@ async def connect(self) -> None:

# Fetch schema to determine panel size and build field metadata
schema = await get_homie_schema(self._host, port=self._panel_http_port)
self._panel_size = schema.panel_size
self._accumulator = HomiePropertyAccumulator(self._serial_number)
self._homie = HomieDeviceConsumer(self._accumulator, schema.panel_size)

Expand Down Expand Up @@ -157,6 +162,11 @@ async def connect(self) -> None:
# Wire message handler
self._bridge.set_message_callback(self._on_message)
self._bridge.set_connection_callback(self._on_connection_change)
# Pre-rebuild hook: reset Homie accumulator before the bridge swaps
# paho clients, so retained messages on the new subscription start
# from a clean slate (no stale `$state=disconnected` cached from
# the original outage).
self._bridge.set_pre_rebuild_callback(self._on_pre_rebuild)

# Connect to broker
_LOGGER.debug("MQTT: Connecting to broker...")
Expand Down Expand Up @@ -369,6 +379,30 @@ def _on_connection_change(self, connected: bool) -> None:
except Exception: # pylint: disable=broad-exception-caught
_LOGGER.warning("Connection callback raised", exc_info=True)

def _on_pre_rebuild(self) -> None:
"""Reset Homie accumulator state before the bridge rebuilds its paho client.

Called synchronously from the bridge's `_rebuild_client` before the
old paho client is torn down and the new one is wired up. Discards
any stale `$state=disconnected` cached during the outage so the
new subscription's retained messages repopulate from a clean slate.

Schema-derived state (`_field_metadata`, `_schema_hash`,
`_previous_schema_types`) is intentionally preserved — the Homie
schema cannot change within a session, so the cache remains valid
and a refetch would just add cost. If the panel reboots and the
schema actually changed, the existing drift-detection log fires on
the next session's `connect()`.
"""
if self._panel_size is None:
# Pre-rebuild fired before connect() cached the panel size.
# Treat as a no-op — there is no accumulator state to reset
# because connect() never completed.
return
_LOGGER.debug("Pre-rebuild — resetting Homie accumulator")
self._accumulator = HomiePropertyAccumulator(self._serial_number)
self._homie = HomieDeviceConsumer(self._accumulator, self._panel_size)

async def _wait_for_circuit_names(self, timeout: float) -> None:
"""Wait for all circuit-like nodes to have a ``name`` property.

Expand Down
Loading
Loading