Skip to content

fix: autolock timer persistence, race conditions, and reload safety#601

Closed
raman325 wants to merge 23 commits intoFutureTense:mainfrom
raman325:fix/autolock-timer-persist-race
Closed

fix: autolock timer persistence, race conditions, and reload safety#601
raman325 wants to merge 23 commits intoFutureTense:mainfrom
raman325:fix/autolock-timer-persist-race

Conversation

@raman325
Copy link
Copy Markdown
Collaborator

@raman325 raman325 commented Apr 18, 2026

Summary

Hardens the keymaster autolock timer against a family of concurrency bugs around persistence, cancellation, and config-entry reload. Started as a fix for the AttributeError reported in #594's latest comment and grew via 17+ review iterations into a comprehensive fix for store races, cross-timer overwrites, orphaned-callback issues, reload-rollback safety, listener-ordering races, force-persistence on detach, expiry-aware rescue, tracked recovery actions, and brittle state transfer.

Proposed change

Locking & persistence (helpers.py):

  • Shared asyncio.Lock per _timer_store (one lock per Store, owned by coordinator, threaded through KeymasterTimer.setup(store_lock=…) as a required parameter)
  • _persist_to_store and _remove_from_store serialize via the shared lock; persist uses snapshot-then-recheck to handle cancel-races
  • setup()'s entire load+resume runs under the lock so reloads can't read a stale empty store while the outgoing timer's persist is queued
  • setup() recovery removes the expired entry atomically with detection (under lock) so two concurrent setup()s on the same timer_id can't both schedule the recovery action
  • setup() resets _detached and rescues any preserved _end_time (from a start() just before reload) — branches on expiry: if not expired, claim and resume; if expired, fire as recovery without resurrecting on disk
  • Missing _store_lock raises RuntimeError instead of silently no-opping (would have lost autolock state across restarts)

detach() for clean reload handoff:

  • New async KeymasterTimer.detach() cancels callbacks, sets _detached=True, nulls hass/kmlock/call_action, but preserves _end_time/_duration and the store entry so the replacement timer can resume from disk
  • detach() awaits ALL in-flight action tasks — both _on_expired callbacks AND setup() recovery actions, tracked via a unified _inflight_tasks set — so any mutations on the old kmlock are visible to the state copy
  • detach() force-persists in-memory state to disk before nulling refs and propagates any storage failure so the rollback path triggers (no silent autolock loss)
  • cancel() is a noop on detached timers (in-flight callbacks must remove the entry directly via _clear_timer_state, not via cancel)

_on_expired semantics:

  • Action fires first, then store cleanup (so storage failures can't block the safety-critical lock)
  • If action raises, store entry is preserved for replay on next HA restart
  • Self-registers in _inflight_tasks so detach() can await completion
  • Cleanup uses shared _clear_timer_state() helper (DRY with cancel)

State transfer (lock.py):

  • New KeymasterCodeSlotDayOfWeek.derive_from_existing() and KeymasterCodeSlot.derive_from_existing() classmethods
  • New KeymasterLock.inherit_state_from(old) instance method centralizes the entire old→new state transfer (lock state, autolock config, retry state, code slots + DOW)
  • The 40-line manual field-by-field copy block in coordinator._update_lock collapses to new.inherit_state_from(old)
  • Business logic now lives next to the field declarations, not buried in the coordinator

Coordinator _update_lock:

  • Calls _unsubscribe_listeners(old) BEFORE detach() so in-flight provider/door callbacks can't reach the about-to-be-detached timer (where start()/cancel() would silently fail)
  • Calls await detach() on the old timer before the state-copy and kmlocks-replacement, preventing the old timer from firing during the await chain against the soon-to-be-orphaned kmlock
  • Wraps the post-detach body in try/except that re-runs _setup_timer first then _update_listeners (mirror of teardown order) on whichever kmlock is current if any inner step raises (no silent autolock-disabled or listener-disabled state on partial failure)

Tests:

  • ~32 new regression tests covering the persist/cancel race, cross-timer overwrite, lock-required setup, detach/_on_expired interactions, action-failure preservation, atomic-claim recovery, _update_lock rollback, listener restoration ordering, state rescue on rebind, expired-rescue, force-persist on detach, force-persist failure propagation, tracked recovery actions, ordering invariants, and inherit_state_from. Each new test verified to fail without its corresponding fix.

Known follow-up

Copilot identified a coordinator-side race where _lock_locked() sets state and awaits notification dismissals BEFORE calling cancel(). If reload hits during those awaits, cancel() noops on the detached timer, leaving the persisted entry to fire a phantom (cosmetic) autolock later. Fixing this cleanly requires reorganizing the lock event handler flow so cancel() happens before any awaitable work — meaningfully larger scope than this PR. Tracked as a follow-up.

Type of change

  • Bugfix (non-breaking change which fixes an issue)

Additional information

🤖 Generated with Claude Code

cancel() can null _end_time while _persist_to_store awaits async_load,
causing an AttributeError on isoformat(). Snapshot values into locals
before the first await so the persist uses captured state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 18, 2026 00:08
@github-actions github-actions Bot added the bugfix Fixes a bug label Apr 18, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 18, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 90.72848% with 14 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@c01b089). Learn more about missing BASE report.
⚠️ Report is 52 commits behind head on main.

Files with missing lines Patch % Lines
custom_components/keymaster/helpers.py 92.78% 7 Missing ⚠️
custom_components/keymaster/coordinator.py 81.25% 6 Missing ⚠️
custom_components/keymaster/lock.py 95.45% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #601   +/-   ##
=======================================
  Coverage        ?   90.17%           
=======================================
  Files           ?       28           
  Lines           ?     3614           
  Branches        ?        0           
=======================================
  Hits            ?     3259           
  Misses          ?      355           
  Partials        ?        0           
Flag Coverage Δ
python 90.17% <90.72%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in KeymasterTimer._persist_to_store() where cancel() can clear _end_time during an await, leading to AttributeError when persisting timer state (reported in #594).

Changes:

  • Snapshot _end_time and _duration into locals before the first await in _persist_to_store().
  • Add a regression test that simulates cancellation occurring during persistence.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
custom_components/keymaster/helpers.py Snapshots timer state before awaiting store I/O to avoid NoneType.isoformat() crashes.
tests/test_helpers.py Adds an async test intended to reproduce the concurrent-cancel persistence crash.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_helpers.py Outdated
Comment thread custom_components/keymaster/helpers.py Outdated
raman325 and others added 2 commits April 17, 2026 20:44
After async_load yields, verify _end_time still matches the snapshot
before saving. If cancel() ran during the await, bail out instead of
re-adding the entry. Updates test to assert cancel wins the race.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the snapshot+re-check approach with an asyncio.Lock shared
between _persist_to_store and _remove_from_store, fully preventing
interleaved store operations that could resurrect canceled timers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_helpers.py Outdated
Comment thread tests/test_helpers.py Outdated
@tykeal
Copy link
Copy Markdown
Collaborator

tykeal commented May 2, 2026

@raman325 you've got some copilot reviews to respond to.

The previous two tests didn't actually exercise the race:
- one cleared _end_time before persist, exiting at the early guard
- the other ran start()/cancel() sequentially, so the lock was untested

Replaces both with a concurrent test using asyncio.Event to coordinate:
persist holds the lock while awaiting async_load, cancel is invoked
concurrently and must wait for the lock, then runs strictly after persist.
Verified to fail without the asyncio.Lock and pass with it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/helpers.py Outdated
The per-instance asyncio.Lock only protected one timer at a time. Since
all KeymasterTimer instances persist into the same _timer_store dict,
two timers could load the dict concurrently, mutate their own keys, and
the later async_save would clobber the earlier write — silently dropping
another timer's persisted entry.

Hoists the lock to the coordinator (one per _timer_store) and accepts
it via setup(store_lock=...). The per-instance default is kept as a
safe fallback for tests/standalone use.

Adds a regression test that fails without the shared lock and passes
with it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/coordinator.py
The per-instance asyncio.Lock default existed only to keep tests passing
without threading a lock through every fixture. In production the
coordinator always supplies the shared lock, and a per-instance lock is
the bug we just fixed (cross-timer overwrite race). Make store_lock a
required setup() parameter so the type system enforces correct wiring,
and update existing tests to pass the lock explicitly.

Adds an assertion in test_setup_timer_passes_timer_id_and_store that
the coordinator wires its shared _timer_store_lock to setup(), so a
future refactor can't silently drop it and reintroduce the race.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/coordinator.py
raman325 and others added 2 commits May 2, 2026 18:32
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a config entry was reloaded, _update_lock replaced the old
KeymasterLock with a new instance and called _setup_timer() to wire up
a fresh KeymasterTimer. The old timer was orphaned but its
async_call_later callback stayed scheduled, holding a stale kmlock
reference. On firing it would invoke _timer_triggered with the dead
kmlock, producing duplicate or out-of-date autolock actions.

Adds KeymasterTimer.detach() — like cancel(), but skips the store
removal so the replacement timer can resume from the persisted entry on
setup(). _update_lock now detaches the old timer before _setup_timer
runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/helpers.py
Comment thread custom_components/keymaster/helpers.py Outdated
…rden detach

- _persist_to_store / _remove_from_store: log error instead of silently
  no-opping when _store_lock is missing. The old behavior would mask a
  bug where setup() was skipped, silently dropping autolock state across
  restarts.
- _persist_to_store: clarify the snapshot comment — it guards against
  detach() nulling _end_time while the persist was queued behind the
  lock, not against an unprotected race (the lock handles the latter).
- detach(): wrap _cancel_callbacks() in try/finally so an unsub raise
  (e.g. during HA shutdown) doesn't leave the orphan with stale refs.
- coordinator._update_lock: trim 4-line comment to 1, since detach()'s
  docstring covers the why.
- Drop the redundant _store_lock field comment.
- Add test for _persist_to_store after detach() being a no-op.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/helpers.py
Comment thread custom_components/keymaster/helpers.py
Comment thread custom_components/keymaster/coordinator.py Outdated
Three real bugs caught by Copilot review of 7391c0a:

1. setup() didn't reset _detached: rollback path reuses the detached
   timer instance via _setup_timer(current); without resetting the flag
   future cancel() calls would silently no-op and the user couldn't
   stop the restored autolock.

2. setup() recovery path removed the store entry BEFORE running the
   action. If the lock action failed during startup recovery, the entry
   was gone and the autolock state was lost forever instead of being
   replayed on next restart. Mirrors the _on_expired contract: extract
   _run_recovery_action helper that only removes on success.

3. _update_lock rollback only re-attached the timer, not listeners.
   _unsubscribe_listeners runs early in the try block; if a later step
   raises, the surviving lock has no listeners until another reload.
   Add _update_listeners(current) to the rollback path.

Plus coverage gap (behavior D from review): no test was asserting
_persist_to_store / _remove_from_store raise RuntimeError when
_store_lock is missing. Add two short tests so a silent regression
back to log-and-return would be caught.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DRY the cleanup quartet (_remove_from_store / _cancel_callbacks /
_end_time / _duration) shared between cancel() and _on_expired's
success path. Per code-simplifier review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/helpers.py Outdated
Comment thread custom_components/keymaster/helpers.py Outdated
…claim

Two unrelated fixes batched:

1. State transfer: Add KeymasterCodeSlotDayOfWeek.derive_from_existing()
   and KeymasterCodeSlot.derive_from_existing() classmethods, plus
   KeymasterLock.inherit_state_from() instance method that orchestrates
   the entire old→new state transfer. The brittle 40-line copy block in
   coordinator._update_lock collapses to `new.inherit_state_from(old)`.
   Per review: this business logic belongs on the dataclass, not buried
   in the coordinator.

2. setup() recovery race: revert to remove-under-lock + schedule action
   pattern. The previous "background _run_recovery_action that removes
   on success" introduced two new races (Copilot-flagged):
   - A new start() between setup-return and recovery-completion gets
     silently wiped by the recovery's _remove_from_store
   - Two setup()s on the same timer_id (rollback) both see the entry
     before the background recovery removes it → double-fire
   The atomic-claim approach (remove BEFORE scheduling) prevents both.
   Trade-off: recovery action failure loses the autolock state. Accept
   this for the startup-only edge case — user's next lock interaction
   creates a fresh autolock cycle.

   Replace the now-obsolete "preserves on action failure" test with one
   that verifies the atomic-claim semantics: two concurrent setup()s
   yield exactly one action call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/coordinator.py Outdated
Comment thread custom_components/keymaster/helpers.py
…etup

Two more Copilot-flagged real reload races:

1. Detaching the old timer BEFORE _unsubscribe_listeners exposed a
   window where in-flight provider/door callbacks (lock/door state
   change handlers) could call old.autolock_timer.start()/cancel() on
   the now-detached timer — start() would silently fail (refs nulled),
   cancel() would noop (_detached=True). A user lock/unlock racing
   with reload could either drop the autolock entirely or leave a
   stale store entry that the replacement timer replays.

   Fix: swap the order — unsubscribe listeners first, then detach.
   detach still awaits the in-flight _on_expired so the earlier
   "timer fires during await chain" concern stays addressed.

2. setup() called on a previously-detached instance (rollback path)
   only reset _detached. detach() preserves _end_time/_duration so an
   in-flight persist can still write them — but if setup's load found
   no entry in the store, those preserved fields stayed set on the
   rebound instance, making is_running falsely report True for a
   timer with no scheduled callback.

   Fix: also clear _end_time/_duration in setup(). They get repopulated
   by _resume() if there's a valid persisted entry.

Test updates:
- test_update_lock_detaches_old_timer_before_any_await — invert the
  ordering assertion to match the new correct order
- test_keymaster_timer_setup_resets_detached_and_state — extend to
  also assert _end_time/_duration are cleared and is_running is False
  after re-setup with an empty store

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/helpers.py Outdated
Comment thread custom_components/keymaster/coordinator.py Outdated
…lback

Two more Copilot-flagged real reload races:

1. Clearing _end_time/_duration in setup() BEFORE acquiring _store_lock
   meant a queued in-flight _persist_to_store would see None on its
   post-lock recheck and abort, dropping a start() that was issued just
   before reload.

   Fix: don't clear pre-lock. Inside the lock, after the load:
     - If the store has an entry, use it (existing behavior).
     - If the store is empty BUT _end_time is set (preserved by detach
       from a recent start()), claim it: write back + _resume.
     - If the store is empty AND _end_time is None, clear stale fields
       so is_running doesn't lie.

   This handles both Copilot concerns (don't drop pending state, don't
   leave stale fields) without losing the queued persist's data.

2. _update_lock rollback restored listeners BEFORE the timer, mirroring
   the original buggy ordering: a provider/door callback could arrive
   in the gap and hit the still-detached timer or no timer at all.

   Fix: restore in reverse order — _setup_timer first, then
   _update_listeners. Listeners only fire after both timer and
   listeners exist on the rebound kmlock.

Test updates:
- Split test_keymaster_timer_setup_resets_detached_and_state into:
    - test_keymaster_timer_setup_resets_detached_flag (just _detached)
    - test_keymaster_timer_setup_rescues_preserved_state (the new
      rescue+resume behavior)
    - test_keymaster_timer_setup_clears_state_when_no_pending (the
      idle-rebind path)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/helpers.py Outdated
Comment thread custom_components/keymaster/coordinator.py
Comment thread custom_components/keymaster/coordinator.py
Two more Copilot-flagged real reload races:

1. _update_lock creates a FRESH KeymasterTimer for `new`. Its setup()
   can rescue from disk but not from the old timer's in-memory
   _end_time. If the old timer's queued persist hadn't run by the
   time detach() ran, the disk had no entry — and the new timer's
   in-memory _end_time is None — so the autolock was silently lost.

   Fix: detach() now force-persists in-memory state to disk before
   nulling refs. Acquires the store lock, checks if the entry already
   exists (queued persist already ran), and writes only if missing.
   This guarantees the disk has the autolock state for the replacement
   timer's setup() to load.

2. The previous-round rescue branch in setup() didn't check whether
   the preserved _end_time was already in the past. If reload took
   longer than the autolock delay, the rescue would write an expired
   entry back to disk and _resume from it. A subsequent reload before
   the immediate-fire callback ran could replay the same overdue
   autolock.

   Fix: rescue branch now branches on expiry. If expired, schedule
   the recovery action immediately and DON'T write back. If not
   expired, claim and resume as before.

Test updates:
- test_keymaster_timer_detach_preserves_store: load now returns the
  entry that start() persisted, so detach's force-persist sees it
  exists and skips writing
- new test_keymaster_timer_detach_force_persists_when_store_empty:
  verifies the force-persist behavior when queued persist hasn't run
- new test_keymaster_timer_setup_rescues_expired_state_without_resurrecting:
  verifies the expired-rescue path fires the action without writing
  back to disk

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread custom_components/keymaster/helpers.py Outdated
Comment thread custom_components/keymaster/helpers.py Outdated
Comment on lines +1082 to +1092
# Unsubscribe listeners FIRST so in-flight provider/door callbacks
# can't reach the about-to-be-detached timer (where start()/cancel()
# would silently fail or no-op). Then detach the old timer; its
# detach() awaits any in-flight _on_expired so mutations it made on
# old (e.g. pending_retry_lock) are visible to the state copy.
# Once detached, the old timer is unusable until the new timer is
# set up — wrap the rest in try/except that restores listeners and
# the autolock on whichever kmlock is current if any step fails.
try:
await KeymasterCoordinator._unsubscribe_listeners(old)
if old.autolock_timer:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledging — the in-flight _lock_locked() race is real but is fundamentally a coordinator-side ordering issue (lock event handler sets kmlock.lock_state and awaits notification dismissals BEFORE calling cancel()). Fixing it cleanly requires reorganizing the lock event handler flow so cancel happens before any awaitable work, which is meaningfully larger scope than this PR. Marking as deferred follow-up.

Comment thread custom_components/keymaster/helpers.py Outdated
Three more Copilot-flagged real reload races:

1. setup() recovery actions launched via hass.async_create_task were
   untracked. detach() only awaited _on_expired_task, not these. A
   reload during recovery could copy the old kmlock's state before
   _timer_triggered finished mutating it, losing the recovery work.

   Fix: replace _on_expired_task with a generic _inflight_tasks set
   that tracks both _on_expired callbacks AND setup() recovery
   actions. detach() awaits all of them. Setup uses _track_recovery()
   which adds the task to the set before scheduling.

2. detach()'s force-persist swallowed exceptions and continued. When
   the timer state existed only in memory (the exact race force-persist
   exists for), failure meant the replacement came up empty and
   autolock was silently disabled.

   Fix: propagate the exception. _update_lock's outer try/except
   catches it and runs the rollback path, which re-attempts setup —
   loud failure beats silent state loss.

3. (Deferred) The _lock_locked() in-flight race is acknowledged in a
   reply to the Copilot thread. Real but coordinator-side: the lock
   event handler sets state and awaits notification dismissals BEFORE
   calling cancel(). Fixing it cleanly requires reorganizing the lock
   event handler flow, which is meaningfully larger scope than this
   PR's persistence/reload fixes.

New regression tests:
- test_keymaster_timer_detach_awaits_recovery_action — verifies
  detach blocks for an in-flight setup() recovery action
- test_keymaster_timer_detach_propagates_force_persist_failure —
  verifies the exception is no longer swallowed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +272 to +277
for task in list(self._inflight_tasks):
if not task.done():
try:
await task
except Exception:
_LOGGER.exception("[KeymasterTimer] In-flight task raised during detach")
Comment on lines +272 to +277
for task in list(self._inflight_tasks):
if not task.done():
try:
await task
except Exception:
_LOGGER.exception("[KeymasterTimer] In-flight task raised during detach")
Comment on lines +326 to +330
if self._call_action is None:
return
action = self._call_action
try:
await action(now)
@raman325
Copy link
Copy Markdown
Collaborator Author

raman325 commented May 3, 2026

Closing this PR in favor of a redesign.

After 17 review iterations covering ~30 races and edge cases, it's clear that the underlying KeymasterTimer design conflates three layers with different lifetimes (persistent store, scheduled callbacks, kmlock binding) into one object with nullable fields and flags. Each Copilot pass finds a narrower interleaving because the architecture doesn't make races structurally impossible — only progressively less likely.

Will be replaced by a fresh PR built on:

  • Separated lifecycle layers (TimerStore / ScheduledCallback / KeymasterTimer)
  • Explicit state machine for the timer (Fresh → Active → Firing → Done | Detached)
  • _update_lock ownership transfer (hand timer/listeners between kmlock instances) instead of swap-and-mutate

Original AttributeError from issue #594 will be re-addressed in the redesign. Tracked as a follow-up issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ISSUE: Auto Lock Timer "Unknown"

4 participants