Skip to content

Add systemd watchdog notifications during sleep intervals#57

Open
assisted-by-ai wants to merge 2 commits intoKicksecure:masterfrom
assisted-by-ai:claude/investigate-sdwdate-bug-oKU75
Open

Add systemd watchdog notifications during sleep intervals#57
assisted-by-ai wants to merge 2 commits intoKicksecure:masterfrom
assisted-by-ai:claude/investigate-sdwdate-bug-oKU75

Conversation

@assisted-by-ai
Copy link
Copy Markdown

Summary

Modified the wait_sleep() method to send periodic systemd watchdog notifications during sleep periods, preventing the service from being terminated due to watchdog timeout.

Key Changes

  • Replaced single time.sleep() call with a loop that breaks sleep into 60-second intervals
  • Added SDNOTIFY_OBJECT.notify("WATCHDOG=1") calls after each sleep chunk to keep systemd watchdog alive
  • Removed large block of commented-out legacy code that documented previous attempts to use shell sleep command
  • Maintains the same total sleep duration while improving service reliability under systemd supervision

Implementation Details

  • Sleep intervals are capped at 60 seconds to ensure watchdog notifications occur frequently enough
  • The loop calculates remaining sleep time and uses the minimum of the watchdog interval or remaining time for each chunk
  • This approach is compatible with the existing nanosecond precision in sleep calculations

https://claude.ai/code/session_01Ps5Qw92ubEyzQQkG3Uvjdk

…eout

The wait_sleep() method sent a single WATCHDOG=1 notification before
sleeping for 60-180 minutes. If the total runtime before sleep plus the
sleep duration exceeded the 200-minute WatchdogSec, systemd killed the
process. Fix by sending WATCHDOG=1 every 60 seconds during the sleep
period.

https://claude.ai/code/session_01Ps5Qw92ubEyzQQkG3Uvjdk
@ArrayBolt3
Copy link
Copy Markdown
Contributor

This is unnecessary IMO. The maximum sleep time in wait_sleep() is 180 minutes, the WatchdogSec value in /usr/lib/systemd/system/sdwdate.service is 200 minutes. Unless sdwdate takes 20 minutes to do whatever it does between sleeps, systemd shouldn't terminate it (and if sdwdate does take 20 minutes to do whatever it does between sleeps, systemd probably should terminate it to let us know there's a bug causing extremely slow operation).

…atchdog timeout

The actual root cause of the watchdog timeout: nanoseconds from
secrets.choice(range(0, 999999999)) was added directly to
sleep_time_seconds without dividing by 1e9 first. This resulted in
time.sleep() being called with ~500 million seconds (~15 years) instead
of ~5640.5 seconds. The process would sleep effectively forever until
the systemd watchdog killed it at the 200-minute mark.

https://claude.ai/code/session_01Ps5Qw92ubEyzQQkG3Uvjdk
@assisted-by-ai
Copy link
Copy Markdown
Author

Updated Analysis — ArrayBolt3 was right to question, but the real bug is deeper

ArrayBolt3's comment on #57 is:

"Unless sdwdate takes 20 minutes to do whatever it does between sleeps, systemd shouldn't terminate it"

They're correct that 180 min < 200 min should be safe. But that argument led me to find the actual root cause, which is much worse:

The Real Bug: wait_sleep() sleeps for ~15 years

In sdwdate.py:853:

nanoseconds = secrets.choice(self.range_nanoseconds)  # integer in [0, 999999999)

Then in the original line 883:

time.sleep(self.sleep_time_seconds + nanoseconds)  # BUG: nanoseconds not divided by 1e9

This adds an integer like 734829156 directly to 5640, making time.sleep(734834796) — that's ~23 years. The process sleeps forever until the watchdog kills it at 200 minutes.

Compare with add_or_subtract_nanoseconds() (line 466) which correctly does float(nanoseconds) / 1000000000.

The fix (2 commits)

  1. Convert nanoseconds to fractional seconds: total_sleep = self.sleep_time_seconds + (nanoseconds / 1000000000)
  2. Periodic watchdog notifications during sleep: defense-in-depth via 60-second sleep chunks with WATCHDOG=1 after each

@ArrayBolt3
Copy link
Copy Markdown
Contributor

This adds an integer like 734829156 directly to 5640, making time.sleep(734834796) — that's ~23 years. The process sleeps forever until the watchdog kills it at 200 minutes.

Lol, well that would do it 🤦

We still don't need the code that periodically wakes up and sends a heartbeat to systemd, but fixing the calculation error makes a lot of sense. Fixed in ArrayBolt3@1b329e3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants