fix: autolock timer persistence, race conditions, and reload safety#601
fix: autolock timer persistence, race conditions, and reload safety#601raman325 wants to merge 23 commits intoFutureTense:mainfrom
Conversation
cancel() can null _end_time while _persist_to_store awaits async_load, causing an AttributeError on isoformat(). Snapshot values into locals before the first await so the persist uses captured state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #601 +/- ##
=======================================
Coverage ? 90.17%
=======================================
Files ? 28
Lines ? 3614
Branches ? 0
=======================================
Hits ? 3259
Misses ? 355
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Fixes a race in KeymasterTimer._persist_to_store() where cancel() can clear _end_time during an await, leading to AttributeError when persisting timer state (reported in #594).
Changes:
- Snapshot
_end_timeand_durationinto locals before the firstawaitin_persist_to_store(). - Add a regression test that simulates cancellation occurring during persistence.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
custom_components/keymaster/helpers.py |
Snapshots timer state before awaiting store I/O to avoid NoneType.isoformat() crashes. |
tests/test_helpers.py |
Adds an async test intended to reproduce the concurrent-cancel persistence crash. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
After async_load yields, verify _end_time still matches the snapshot before saving. If cancel() ran during the await, bail out instead of re-adding the entry. Updates test to assert cancel wins the race. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the snapshot+re-check approach with an asyncio.Lock shared between _persist_to_store and _remove_from_store, fully preventing interleaved store operations that could resurrect canceled timers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@raman325 you've got some copilot reviews to respond to. |
The previous two tests didn't actually exercise the race: - one cleared _end_time before persist, exiting at the early guard - the other ran start()/cancel() sequentially, so the lock was untested Replaces both with a concurrent test using asyncio.Event to coordinate: persist holds the lock while awaiting async_load, cancel is invoked concurrently and must wait for the lock, then runs strictly after persist. Verified to fail without the asyncio.Lock and pass with it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The per-instance asyncio.Lock only protected one timer at a time. Since all KeymasterTimer instances persist into the same _timer_store dict, two timers could load the dict concurrently, mutate their own keys, and the later async_save would clobber the earlier write — silently dropping another timer's persisted entry. Hoists the lock to the coordinator (one per _timer_store) and accepts it via setup(store_lock=...). The per-instance default is kept as a safe fallback for tests/standalone use. Adds a regression test that fails without the shared lock and passes with it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The per-instance asyncio.Lock default existed only to keep tests passing without threading a lock through every fixture. In production the coordinator always supplies the shared lock, and a per-instance lock is the bug we just fixed (cross-timer overwrite race). Make store_lock a required setup() parameter so the type system enforces correct wiring, and update existing tests to pass the lock explicitly. Adds an assertion in test_setup_timer_passes_timer_id_and_store that the coordinator wires its shared _timer_store_lock to setup(), so a future refactor can't silently drop it and reintroduce the race. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a config entry was reloaded, _update_lock replaced the old KeymasterLock with a new instance and called _setup_timer() to wire up a fresh KeymasterTimer. The old timer was orphaned but its async_call_later callback stayed scheduled, holding a stale kmlock reference. On firing it would invoke _timer_triggered with the dead kmlock, producing duplicate or out-of-date autolock actions. Adds KeymasterTimer.detach() — like cancel(), but skips the store removal so the replacement timer can resume from the persisted entry on setup(). _update_lock now detaches the old timer before _setup_timer runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rden detach - _persist_to_store / _remove_from_store: log error instead of silently no-opping when _store_lock is missing. The old behavior would mask a bug where setup() was skipped, silently dropping autolock state across restarts. - _persist_to_store: clarify the snapshot comment — it guards against detach() nulling _end_time while the persist was queued behind the lock, not against an unprotected race (the lock handles the latter). - detach(): wrap _cancel_callbacks() in try/finally so an unsub raise (e.g. during HA shutdown) doesn't leave the orphan with stale refs. - coordinator._update_lock: trim 4-line comment to 1, since detach()'s docstring covers the why. - Drop the redundant _store_lock field comment. - Add test for _persist_to_store after detach() being a no-op. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Three real bugs caught by Copilot review of 7391c0a: 1. setup() didn't reset _detached: rollback path reuses the detached timer instance via _setup_timer(current); without resetting the flag future cancel() calls would silently no-op and the user couldn't stop the restored autolock. 2. setup() recovery path removed the store entry BEFORE running the action. If the lock action failed during startup recovery, the entry was gone and the autolock state was lost forever instead of being replayed on next restart. Mirrors the _on_expired contract: extract _run_recovery_action helper that only removes on success. 3. _update_lock rollback only re-attached the timer, not listeners. _unsubscribe_listeners runs early in the try block; if a later step raises, the surviving lock has no listeners until another reload. Add _update_listeners(current) to the rollback path. Plus coverage gap (behavior D from review): no test was asserting _persist_to_store / _remove_from_store raise RuntimeError when _store_lock is missing. Add two short tests so a silent regression back to log-and-return would be caught. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY the cleanup quartet (_remove_from_store / _cancel_callbacks / _end_time / _duration) shared between cancel() and _on_expired's success path. Per code-simplifier review. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…claim
Two unrelated fixes batched:
1. State transfer: Add KeymasterCodeSlotDayOfWeek.derive_from_existing()
and KeymasterCodeSlot.derive_from_existing() classmethods, plus
KeymasterLock.inherit_state_from() instance method that orchestrates
the entire old→new state transfer. The brittle 40-line copy block in
coordinator._update_lock collapses to `new.inherit_state_from(old)`.
Per review: this business logic belongs on the dataclass, not buried
in the coordinator.
2. setup() recovery race: revert to remove-under-lock + schedule action
pattern. The previous "background _run_recovery_action that removes
on success" introduced two new races (Copilot-flagged):
- A new start() between setup-return and recovery-completion gets
silently wiped by the recovery's _remove_from_store
- Two setup()s on the same timer_id (rollback) both see the entry
before the background recovery removes it → double-fire
The atomic-claim approach (remove BEFORE scheduling) prevents both.
Trade-off: recovery action failure loses the autolock state. Accept
this for the startup-only edge case — user's next lock interaction
creates a fresh autolock cycle.
Replace the now-obsolete "preserves on action failure" test with one
that verifies the atomic-claim semantics: two concurrent setup()s
yield exactly one action call.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…etup Two more Copilot-flagged real reload races: 1. Detaching the old timer BEFORE _unsubscribe_listeners exposed a window where in-flight provider/door callbacks (lock/door state change handlers) could call old.autolock_timer.start()/cancel() on the now-detached timer — start() would silently fail (refs nulled), cancel() would noop (_detached=True). A user lock/unlock racing with reload could either drop the autolock entirely or leave a stale store entry that the replacement timer replays. Fix: swap the order — unsubscribe listeners first, then detach. detach still awaits the in-flight _on_expired so the earlier "timer fires during await chain" concern stays addressed. 2. setup() called on a previously-detached instance (rollback path) only reset _detached. detach() preserves _end_time/_duration so an in-flight persist can still write them — but if setup's load found no entry in the store, those preserved fields stayed set on the rebound instance, making is_running falsely report True for a timer with no scheduled callback. Fix: also clear _end_time/_duration in setup(). They get repopulated by _resume() if there's a valid persisted entry. Test updates: - test_update_lock_detaches_old_timer_before_any_await — invert the ordering assertion to match the new correct order - test_keymaster_timer_setup_resets_detached_and_state — extend to also assert _end_time/_duration are cleared and is_running is False after re-setup with an empty store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…lback
Two more Copilot-flagged real reload races:
1. Clearing _end_time/_duration in setup() BEFORE acquiring _store_lock
meant a queued in-flight _persist_to_store would see None on its
post-lock recheck and abort, dropping a start() that was issued just
before reload.
Fix: don't clear pre-lock. Inside the lock, after the load:
- If the store has an entry, use it (existing behavior).
- If the store is empty BUT _end_time is set (preserved by detach
from a recent start()), claim it: write back + _resume.
- If the store is empty AND _end_time is None, clear stale fields
so is_running doesn't lie.
This handles both Copilot concerns (don't drop pending state, don't
leave stale fields) without losing the queued persist's data.
2. _update_lock rollback restored listeners BEFORE the timer, mirroring
the original buggy ordering: a provider/door callback could arrive
in the gap and hit the still-detached timer or no timer at all.
Fix: restore in reverse order — _setup_timer first, then
_update_listeners. Listeners only fire after both timer and
listeners exist on the rebound kmlock.
Test updates:
- Split test_keymaster_timer_setup_resets_detached_and_state into:
- test_keymaster_timer_setup_resets_detached_flag (just _detached)
- test_keymaster_timer_setup_rescues_preserved_state (the new
rescue+resume behavior)
- test_keymaster_timer_setup_clears_state_when_no_pending (the
idle-rebind path)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two more Copilot-flagged real reload races: 1. _update_lock creates a FRESH KeymasterTimer for `new`. Its setup() can rescue from disk but not from the old timer's in-memory _end_time. If the old timer's queued persist hadn't run by the time detach() ran, the disk had no entry — and the new timer's in-memory _end_time is None — so the autolock was silently lost. Fix: detach() now force-persists in-memory state to disk before nulling refs. Acquires the store lock, checks if the entry already exists (queued persist already ran), and writes only if missing. This guarantees the disk has the autolock state for the replacement timer's setup() to load. 2. The previous-round rescue branch in setup() didn't check whether the preserved _end_time was already in the past. If reload took longer than the autolock delay, the rescue would write an expired entry back to disk and _resume from it. A subsequent reload before the immediate-fire callback ran could replay the same overdue autolock. Fix: rescue branch now branches on expiry. If expired, schedule the recovery action immediately and DON'T write back. If not expired, claim and resume as before. Test updates: - test_keymaster_timer_detach_preserves_store: load now returns the entry that start() persisted, so detach's force-persist sees it exists and skips writing - new test_keymaster_timer_detach_force_persists_when_store_empty: verifies the force-persist behavior when queued persist hasn't run - new test_keymaster_timer_setup_rescues_expired_state_without_resurrecting: verifies the expired-rescue path fires the action without writing back to disk Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Unsubscribe listeners FIRST so in-flight provider/door callbacks | ||
| # can't reach the about-to-be-detached timer (where start()/cancel() | ||
| # would silently fail or no-op). Then detach the old timer; its | ||
| # detach() awaits any in-flight _on_expired so mutations it made on | ||
| # old (e.g. pending_retry_lock) are visible to the state copy. | ||
| # Once detached, the old timer is unusable until the new timer is | ||
| # set up — wrap the rest in try/except that restores listeners and | ||
| # the autolock on whichever kmlock is current if any step fails. | ||
| try: | ||
| await KeymasterCoordinator._unsubscribe_listeners(old) | ||
| if old.autolock_timer: |
There was a problem hiding this comment.
Acknowledging — the in-flight _lock_locked() race is real but is fundamentally a coordinator-side ordering issue (lock event handler sets kmlock.lock_state and awaits notification dismissals BEFORE calling cancel()). Fixing it cleanly requires reorganizing the lock event handler flow so cancel happens before any awaitable work, which is meaningfully larger scope than this PR. Marking as deferred follow-up.
Three more Copilot-flagged real reload races: 1. setup() recovery actions launched via hass.async_create_task were untracked. detach() only awaited _on_expired_task, not these. A reload during recovery could copy the old kmlock's state before _timer_triggered finished mutating it, losing the recovery work. Fix: replace _on_expired_task with a generic _inflight_tasks set that tracks both _on_expired callbacks AND setup() recovery actions. detach() awaits all of them. Setup uses _track_recovery() which adds the task to the set before scheduling. 2. detach()'s force-persist swallowed exceptions and continued. When the timer state existed only in memory (the exact race force-persist exists for), failure meant the replacement came up empty and autolock was silently disabled. Fix: propagate the exception. _update_lock's outer try/except catches it and runs the rollback path, which re-attempts setup — loud failure beats silent state loss. 3. (Deferred) The _lock_locked() in-flight race is acknowledged in a reply to the Copilot thread. Real but coordinator-side: the lock event handler sets state and awaits notification dismissals BEFORE calling cancel(). Fixing it cleanly requires reorganizing the lock event handler flow, which is meaningfully larger scope than this PR's persistence/reload fixes. New regression tests: - test_keymaster_timer_detach_awaits_recovery_action — verifies detach blocks for an in-flight setup() recovery action - test_keymaster_timer_detach_propagates_force_persist_failure — verifies the exception is no longer swallowed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for task in list(self._inflight_tasks): | ||
| if not task.done(): | ||
| try: | ||
| await task | ||
| except Exception: | ||
| _LOGGER.exception("[KeymasterTimer] In-flight task raised during detach") |
| for task in list(self._inflight_tasks): | ||
| if not task.done(): | ||
| try: | ||
| await task | ||
| except Exception: | ||
| _LOGGER.exception("[KeymasterTimer] In-flight task raised during detach") |
| if self._call_action is None: | ||
| return | ||
| action = self._call_action | ||
| try: | ||
| await action(now) |
|
Closing this PR in favor of a redesign. After 17 review iterations covering ~30 races and edge cases, it's clear that the underlying Will be replaced by a fresh PR built on:
Original |
Summary
Hardens the keymaster autolock timer against a family of concurrency bugs around persistence, cancellation, and config-entry reload. Started as a fix for the
AttributeErrorreported in #594's latest comment and grew via 17+ review iterations into a comprehensive fix for store races, cross-timer overwrites, orphaned-callback issues, reload-rollback safety, listener-ordering races, force-persistence on detach, expiry-aware rescue, tracked recovery actions, and brittle state transfer.Proposed change
Locking & persistence (
helpers.py):asyncio.Lockper_timer_store(one lock per Store, owned by coordinator, threaded throughKeymasterTimer.setup(store_lock=…)as a required parameter)_persist_to_storeand_remove_from_storeserialize via the shared lock; persist uses snapshot-then-recheck to handle cancel-racessetup()'s entire load+resume runs under the lock so reloads can't read a stale empty store while the outgoing timer's persist is queuedsetup()recovery removes the expired entry atomically with detection (under lock) so two concurrentsetup()s on the sametimer_idcan't both schedule the recovery actionsetup()resets_detachedand rescues any preserved_end_time(from astart()just before reload) — branches on expiry: if not expired, claim and resume; if expired, fire as recovery without resurrecting on disk_store_lockraisesRuntimeErrorinstead of silently no-opping (would have lost autolock state across restarts)detach()for clean reload handoff:KeymasterTimer.detach()cancels callbacks, sets_detached=True, nulls hass/kmlock/call_action, but preserves_end_time/_durationand the store entry so the replacement timer can resume from diskdetach()awaits ALL in-flight action tasks — both_on_expiredcallbacks AND setup() recovery actions, tracked via a unified_inflight_tasksset — so any mutations on the old kmlock are visible to the state copydetach()force-persists in-memory state to disk before nulling refs and propagates any storage failure so the rollback path triggers (no silent autolock loss)cancel()is a noop on detached timers (in-flight callbacks must remove the entry directly via_clear_timer_state, not via cancel)_on_expiredsemantics:_inflight_taskssodetach()can await completion_clear_timer_state()helper (DRY withcancel)State transfer (
lock.py):KeymasterCodeSlotDayOfWeek.derive_from_existing()andKeymasterCodeSlot.derive_from_existing()classmethodsKeymasterLock.inherit_state_from(old)instance method centralizes the entire old→new state transfer (lock state, autolock config, retry state, code slots + DOW)coordinator._update_lockcollapses tonew.inherit_state_from(old)Coordinator
_update_lock:_unsubscribe_listeners(old)BEFOREdetach()so in-flight provider/door callbacks can't reach the about-to-be-detached timer (wherestart()/cancel()would silently fail)await detach()on the old timer before the state-copy and kmlocks-replacement, preventing the old timer from firing during the await chain against the soon-to-be-orphaned kmlock_setup_timerfirst then_update_listeners(mirror of teardown order) on whichever kmlock is current if any inner step raises (no silent autolock-disabled or listener-disabled state on partial failure)Tests:
_on_expiredinteractions, action-failure preservation, atomic-claim recovery,_update_lockrollback, listener restoration ordering, state rescue on rebind, expired-rescue, force-persist on detach, force-persist failure propagation, tracked recovery actions, ordering invariants, andinherit_state_from. Each new test verified to fail without its corresponding fix.Known follow-up
Copilot identified a coordinator-side race where
_lock_locked()sets state and awaits notification dismissals BEFORE callingcancel(). If reload hits during those awaits,cancel()noops on the detached timer, leaving the persisted entry to fire a phantom (cosmetic) autolock later. Fixing this cleanly requires reorganizing the lock event handler flow socancel()happens before any awaitable work — meaningfully larger scope than this PR. Tracked as a follow-up.Type of change
Additional information
🤖 Generated with Claude Code