Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional watchdog for "adapter disconnected"-type events (non-node-crash) #23043

Merged
merged 4 commits into from
Jun 19, 2024

Conversation

Nerivec
Copy link
Collaborator

@Nerivec Nerivec commented Jun 13, 2024

Add basic, optional, watchdog to handle "soft failures" for bare Z2M users instead of requiring the use of a process watchdog. Should handle any Z2M failure that would result in the node process being stopped (like "adapter disconnected" event).

npm run start => no watchdog
Z2M_WATCHDOG=default npm run start => watchdog with default delays (1min, 5min, 15min, 30min, 60min)
Z2M_WATCHDOG=minutes_csv npm run start => watchdog with custom delays (example: Z2M_WATCHDOG=0.5,3,6,15 npm run start)

  • The number of configured delays is the de facto number of times the watchdog will retry, past that, the node process will be stopped (to avoid endlessly retrying when clearly, something's requiring the user's attention). With the default delays, the watchdog will retry after 1min on first failure, then after 5min on second failure, then after 15min on third failure, then after 30min on fourth failure, then after 60min on fifth failure, then exit if sixth start fails. Any successful start resets that to the beginning.
  • The watchdog will only trigger on failure after the initial (manual) start is successful.
  • A problem with settings will always ignore the watchdog and stop Z2M.
  • A manual stop/restart will ignore the watchdog to comply with user intent.
  • Z2M_WATCHDOG accepts default as value or a list of comma-separated integers/floats. Other values/formats (invalid) will prevent Z2M from starting.

Note: This does not handle node crashes, that's better suited for a process watchdog (and already widely available).

@Koenkk
Copy link
Owner

Koenkk commented Jun 15, 2024

Looks good! I think the best place to document this is above here

@Nerivec Nerivec marked this pull request as ready for review June 15, 2024 18:02
@Nerivec
Copy link
Collaborator Author

Nerivec commented Jun 15, 2024

That would only provide the information on Linux side though. Do you want to duplicate the paragraph on all OS pages?

@Koenkk
Copy link
Owner

Koenkk commented Jun 16, 2024

I think for each setup the way to enable this is a bit different, but Im not sure if we want it for the Docker setup for example.

@Nerivec
Copy link
Collaborator Author

Nerivec commented Jun 16, 2024

For containerized setups, I guess it can remove some overhead, by avoiding container reset, and just trying to launch the controller again.

@Koenkk
Copy link
Owner

Koenkk commented Jun 17, 2024

Im thinking that it might be easier to enable this through an env var, Z2M_WATCHDOG, that makes it easier to configure

@Nerivec
Copy link
Collaborator Author

Nerivec commented Jun 17, 2024

Z2M_WATCHDOG=default => use default delays
Z2M_WATCHDOG=1,2,3 => use 1,2,3 for delays
Something like this?

@Koenkk
Copy link
Owner

Koenkk commented Jun 18, 2024

Yes, and with everything that are not csv numbers, z2m should refuse to start

@Koenkk
Copy link
Owner

Koenkk commented Jun 18, 2024

For the docs, Im wondering if a new Watchdog page under Configuration makes sense

@Nerivec
Copy link
Collaborator Author

Nerivec commented Jun 18, 2024

How about under Installation? Right above Zigbee2MQTT fails to start looks good.

@Koenkk
Copy link
Owner

Koenkk commented Jun 19, 2024

Thats also fine

@Koenkk Koenkk merged commit 2b36f74 into Koenkk:dev Jun 19, 2024
11 checks passed
@Nerivec Nerivec deleted the watchdog branch June 19, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants