-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden failure to startup - timeout after SRSP - ZDO - startupFromApp #765
Comments
OK, so being an unforgivably impatient sort, I backed up the coordinator and updated to 20230507. This went flawlessly, but I had exactly the same issue on starting the addon. You can see from the log that the firmware has updated: Docker Log (Same failure but updated coordinator firmware)
However, whilst I was coming here to post, the watchdog started it again, and this time it booted up as usual!!! I have restarted several times and turned off herdsman logging to see if it survives restarts, and it does. However, out of 157 devices, only 58 have reported back in so far, and it seems to have stalled at that number, so I need to give it some time to see if I can get the full network to rebuild. Docker Log (truncated after startupFromApp succeeds)
|
Update: Z2M ran for approx. 12 hours, it was mostly issuing regular warnings such as:
As a number of devices remained 'offline' to Z2m (but were responding to group messages). About 2/3rds of devices were communicating OK, the rest were responding to the groups (even if showing as 'No Network route' to Z2m. After 12 hours the errors started appearing in the logs, e.g.
About 3-4 hours later devices stopped responding entirely and went offline, and pretty much every error and warning was for I noticed about 3 hours later and attempted to restart Z2m, at which point the above failure started and z2m is back to the refusing to start state, due to the timeout Docker log showing failure to start
UPDATE: Again, after about 1 hour, Z2M rebooted successfully. |
OK, I updated to 20230901 to see if that improved things. The first start of the addon caused the same error, but, as with the last time I updated the firmware, it succeeded on the second attempt. |
So, a reboot after about 2 hours caused it to go back into the failed state. After multiple unsuccessful reboot attempts, I again reflashed the firmware and restored the network backup. However, this time, it failed to reboot after multiple attempts, so the 'pattern' of succeeding on the 2nd attempt after a flash was a coincidence and probably more related to time passing. I've set the HAOS addon watchdog back on so that it repeatedly attempts a reboot, to see if it eventually succeeds and will update this comment. UPDATE: Nope, it didn't restart, so I reflashed again, and still no luck :( |
Could it be that there are 2 zigbee networks in your house? |
Yes, as I mentioned in the description:
The primary network runs on channel 11 and has 159 devices. Z2m runs as a Home Assistant Add-on on an Intel NUC. I split the networks due to the issues caused by the 17 'noisy' presence sensors that made it unstable. It has been relatively stable for about four months with this setup. I also have a Philips Hue bridge with a few lights attached (about 4 or 5). I don't know what channel it runs on, but it's coordinator is on the other side of the house. They added a 'smart' meter to my electricity supply about a month ago. However, it doesn't work due to an inability to sync to the radio network. As such, I haven't set up the monitor. It too, uses Zigbee, but I don't believe it is active. |
Z2m booted up again after about 9 hours this morning, and has been running since, though only 100 of 159 devices are 'online'. I have ordered some new coordinators as backups, and am crossing my fingers I can keep the network up in the meantime. Do you have any suggestions as to why it timeouts at this point? |
I suggest to downgrade to 20221226 (https://github.com/Koenkk/Z-Stack-firmware/tree/Z-Stack_3.x.0_coordinator_20221226/coordinator/Z-Stack_3.x.0/bin), problems have been reported with 20230507: Koenkk/Z-Stack-firmware#474 |
Yeah, sorry, being Autistic I'm overly verbose, so I don't blame you for not reading all the walls of text. I just wanted to give you all the info. You'll see from the details on the first comment, that the problems started whilst it wa running 20221226. I am running 20230507 on the second coordinator (which isn't showing this issue). I've never run it on the primary. I upgraded the primary to 20230901 as that's effectively the same as 20221226 just a newer SDK. The symptom has been the same on both 20221226 and 20230901. I've included logs showing the failure on both versions. Hope that clears things up. |
If I recall correctly, |
So wrapping the adapter in tin toil, and removing the antennae allows Z2M to start, but all communications to devices fail with The prospect of changing to Channel 25 and repairing 150+ devices is more than a little daunting, especially as the vast majority are Philips hue bulbs, and they would require unpairing first... which would be a challenge. Some of the lights are in very inaccessible places too. So, I then replaced my Sonoff Dongle with a new one straight from the box and flashed with 20221226 and the network backup. It too failed to start with The real issue is how to identify the culprit that's stomping on my zigbee channel... |
So, I unplugged my Philips Hue hub (which is 2 floors away) and the second coordinator (which is supposed to be on Channel 12). I can't physically disconnect the Smart Meter, but it's in the garage and has no devices connected to it. I can't disconnect my Google Wifi Pro 6 (which was the worst purchase decision ever btw) and I can't see what channel it's running on, never mind change it (because it's SH*T), but I would expect Channel 12 to be stomped on too if the issue was Wifi related. None of these steps made any difference. I am running the coordinator on a USB extension cord already, and, as mentioned, it's been relatively stable until 3 days ago. |
So my working theory is one of:
I haven't got a sniffer, many of the CC2531 modules I can see require a separate programmer board that I'm not sure how to source or use. UPDATE: After a further 10 minutes, I started getting: |
Since it is starting now but the coordinator is crashing, can you try 20221226 again? (it's seems to be the most stable fw for now based on the feedback of many users) |
I'm not sure what that helps demonstrate? As I've already stated, the problems started when I was on 20221226 already (and had been for about six months. However, I did as asked and, sure enough, it wouldn't start up again at all, until I switched off all my light circuits (unpowering about 130 devices). However, switching on the circuits after a reboot almost immediately caused the dreaded
UPDATE: To be uber-complete, I have also downgraded my secondary network coordinator to 2022126, as I have been running 20230507 successfully for the last few months. Unsurprisingly, it had no impact and Z2m on the primary coordinator doesn't start up even when both are running 20221226. I have now run every combination of coordinators in 20221226, 20230509 and 20230901, and the result is identical. The only way I can get the network to start up reliably is to switch off all the lights. That takes out about 100 of the 150+ zigbee devices. I can then add back one floor (which adds back about 60+ devices, but I get a lot of The vast majority of my lights are Philips Hue spots, which are usually rock solid. I haven't done any recent updates to bring on this new turn events. UPDATE 2: I managed to get the network to about 133 devices, but slowly switching on rooms one at a time until all the lights pinged. I occasionally got a few UPDATE 3: After being stable overnight I tried to add the final ~15 devices, using the same room-by-room method, and I quickly hit the |
After a few more days of testing, I identified a couple of Moes Thermostats that are super chatty and moved them to my secondary network. This improved things slighty, but I'm still having the same problems getting the network to start. What I'm seeing is that, when busy, Z2M continues to send messages into an already busy environment, which causes a cascade failure, resulting in Because of this, particularly chatty devices, like sensors, are problematic. Also, the number of pings and messages being spammed by Z2M on startup (like querying the device state when it first sees the device on startup) contributes to the issue of starting up stably in a large network environment. I've started looking through the code base to see if there's any throttling ability built into sending messages to the coordinator based on the current frames/second. I suspect such a feature would be really useful and lead to a much more stable coordinator in busy environments. What would be good is for the herdsman to track the current reception and transmission rate (preferably in kb/s, but even msg/s would be useful). Exposing these two figures would be useful for debugging. With these two rates, it would then be good to update the waitress to delay sending messages until the combined rate is below a configurable threshold. This would reduce pressure on the coordinator and prevent contributing to the saturated network environment. It is better for the message to timeout higher up the stack than on the network itself. As a final stage, adding a priority to message types would allow messages such as pings, OTA updates, etc., to be placed at a lower priority. Also, splitting the timeout into two - the timeout after the initial send (when the waitress receives the message) and the timeout after the actual transmit (when it releases it to the coordinator after the network saturation is deemed low enough) would allow messages such as availability pings to be given a low priority but a longer send timeout whilst keeping the transmit timeout short. Later, Z2M itself could use the network saturation data to determine optimum times to do maintenance work (like querying initial state, pinging availability, etc.), helping to balance the network load. What do you think as a way forward to support larger networks? TL;DR
|
You can configure the amount of concurrent messages, see https://www.zigbee2mqtt.io/guide/configuration/adapter-settings.html#mdns-zeroconf-discovery |
I'll try reducing to 8 as that is half the recommended value, and see what happens. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days |
I have a similar issue Koenkk/zigbee2mqtt#21962. Did you figure out what the problem was ? Thanks ! |
Symptom
Home Assistant Z2M addon no longer starts, reporting
Error: SRSP - ZDO - startupFromApp after 40000ms
. No changes were made that could explain the sudden fatal failure. Multiple hardware reboots have failed to have an impact.Description
I've been running two instances of Z2m for about six months on two separate machines (an Intel NUC as a HAOS addon and a Rasberry Pi as a standalone Z2m instance) to allow me to run with two separate channels. Today at around 13:50, my Z2m HAOS addon instance on the Intel NUC died and refuses to restart. I had made no changes and was not interacting with Home Assistant or the hardware when it happened, and I have only been able to deduce the failure time from when various devices became 'unavailable' to HA. I did trigger a reboot around that time, though.
Sadly, my addon wasn't configured for Herdsman logging nor SSH access via 22222. By the time I got it all set up, the initial logs were gone. However, I eventually managed to grab the full log below.
I've checked, and my Sonoff Zigbee 3.0 USB Dongle Plus is still connected to the correct USB port. From the logs, there appears to be initial communication until it gets to
SRSP - ZDO - startupFromApp
, at which point it times out, and the docker container is unloaded (making grabbing logs difficult).Have you any idea what may be causing this issue? I have a very large network (~150 devices), and rebuilding it would be a nightmare.
I have a coordinator backup from last night, so I could try restoring that to the dongle and upgrading to coordinator revision 20230507, which I've been running on the second dongle for several months. However, I wanted to let you look first and see if there were any logs or debug steps you'd like me to try. However, I can no longer control any of my house lights or automations, so I would appreciate any help you can provide quickly.
Data
System Properties for Intel NUC HA addon instance
Docker Log (up to timeout)
Z2M Log (on timeout)
The text was updated successfully, but these errors were encountered: