Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All devices gone after power outage #11759

Open
Kodemikkel opened this issue Mar 8, 2022 · 55 comments
Open

All devices gone after power outage #11759

Kodemikkel opened this issue Mar 8, 2022 · 55 comments
Labels
dont-stale problem Something isn't working

Comments

@Kodemikkel
Copy link

What happened?

After waking up from a night's sleep, I noticed in HA that some devices were unavailable. After some more looking around, I found all my Z2M devices were unavailable.

Heading over to my Z2M docker container running on an Unraid host, I saw none of my devices were listed anywhere, and "Devices", "Dashboard" and "Map" were all blank, although my settings were still the same.
I do suspect that there was a power outage while I was sleeping, as all the lights had turned on and after booting my PC, there were some indications that it had lost power.
After power cycling a device I had nearby and setting Z2M to allow joining, the device appeared as previously configured without needing to change anything manually.

Now I have power cycled/reset all my devices, and they are all fully functional as they were before. What is weird is that I have, on several occasions, unintentionally, removed power from my running Unraid host without any issues at all.
This leads me to my questions:

  • Does anyone know why this happened, especially when the host has lost power on several occasions earlier without any issues?
  • Has anyone else experienced this issue before?
  • And if so, how did you fix it?

I am lucky to only have about 30 devices on my network, and resetting them doesn't take that long, although it is still boring and tedious. I could only imagine how it would be for someone with 100s of devices on their network.

What did you expect to happen?

Not losing all my devices after a power outage.

How to reproduce it (minimal and precise)

No idea.
As mentioned my host has lost power several times earlier without issues.

Zigbee2MQTT version

1.23.0-dev commit: afe94a7

Adapter firmware version

0x26720700

Adapter

ConBee2

Debug log

07MAR22 11:20:18.txt
08MAR22 08:30:21.txt

The first log is dated 07MAR22 11:20:18 and I assume the last line is right before the power is lost. (I can't find any useful information in this)
The second log is dated 08MAR22 08:30:21 and I assume it would be from when the power came back. The shutdown at the end of this log is me trying to restart it.

@Kodemikkel Kodemikkel added the problem Something isn't working label Mar 8, 2022
@MattWestb
Copy link

If having one system in production with light then user real Zigbee light switches and binding them to the Zigbee Light groups so they is always working if the host system or internal internet is having problem.
Implanting light "HA way" you can always getting problem and all is not working and with Light groups its only one device that is falling and not 100% of the system.

The reason you need repower the device and having joining enable is that the coordinator is have its frame counter for the network key is out off sync or the system have restored one old backup of its after coming back after the power problem => all devices in the network is blocking then they thing its one replay attack (normal Zigbee security).

@Kodemikkel
Copy link
Author

Kodemikkel commented Mar 8, 2022

But how come if I manually cut the power to the host, everything works fine? I've never had that issue before when my host loses and restores power.

And also, HA does not really have anything to do with this, as the Z2M docker is running completely separately and it was the Z2M docker that had the issue.

Edit: Added some more information in the reply.

@twsl
Copy link

twsl commented Mar 11, 2022

I had something similar happen to my docker-based instance 3 days ago after rebooting my server.
Zigbee2MQTT version 1.24.0-dev (commit #c49f546)
zigbee-herdsman (0.14.20)

@eloo
Copy link

eloo commented Mar 14, 2022

Had the same issue today :(

Really a pitty that the system is not self-healing.

Zigbee2MQTT version
1.24.0-dev commit: [f7c6207](https://github.com/Koenkk/zigbee2mqtt/commit/f7c6207)
Coordinator type
ConBee2/RaspBee2
Coordinator revision
0x26720700
Coordinator IEEE Address
0x00212effff0656b9
Frontend version
0.6.77

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale issues label Apr 14, 2022
@adelaiglesia
Copy link

Same issue today. Power loss on both Z2M and lights / devices and a lot of devices gone when Z2M reboots. I have manually cut the power before without consecuences.

@github-actions github-actions bot removed the stale Stale issues label Apr 15, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale issues label May 15, 2022
@eloo
Copy link

eloo commented May 16, 2022

is this really stale?
afaik there was no fix yet?

@github-actions github-actions bot removed the stale Stale issues label May 17, 2022
@mihaiblaga89
Copy link

just had the same issue. Almost all devices gone after power outage, only 2 were present in z2m, both Hue motion sensors, the rest of 20 devices gone. I managed to get most of them back by resetting them but I have 4 Philips Hue outdoor lights that don't want to rejoin by themselves and I can't reset them. I'll need to remove them from the wall, get the serial number, add them to Philips Hue app and I think I'll keep them there, don't want to get the ladder out if a power outage happens again.

Using zzh stick. Also tried updating to latest coordinator firmware and keeping "Allow join" on all the time with the hope that some devices will rejoin by themselves but those 4 lights never did. Some IKEA buttons did rejoin when I pressed them but not all.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2022

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale issues label Jul 4, 2022
@eloo
Copy link

eloo commented Jul 4, 2022

AFAIK not stale

@Koenkk
Copy link
Owner

Koenkk commented Jul 4, 2022

What do you mean with "gone"?

  • Are the devices missing from the z2m frontend OR
  • Are non of the devices controllable?

@Kodemikkel
Copy link
Author

What do you mean with "gone"?

  • Are the devices missing from the z2m frontend OR
  • Are non of the devices controllable?

Both

@github-actions github-actions bot removed the stale Stale issues label Jul 5, 2022
@eloo
Copy link

eloo commented Jul 5, 2022

Yep for me it was also the same.. nearly all devices were not visible in Z2M frontend nor controllable in Homeassistent.
I had also need to repair them.

Maybe there is some unwritten state in Z2M which creates an inconsistency when the Z2M system is killed hardly?

@Koenkk
Copy link
Owner

Koenkk commented Jul 5, 2022

Did something special happen around during crash? Z2M won't just empty its data/database.db file by itself

@eloo
Copy link

eloo commented Jul 5, 2022

hmm i can not remember any special expect the power outage..

just checked what the database.db is.. and afaik this is just a json file?
maybe this file just gets corrupted from time to time?

as no proper database is used the "database.db" is lacking corruption prevention.
Maybe it would make sense to use something more robust here like a sqlite database?

further the last_seen is also stored in this file. so this file is going to have a lot write operations which could lead to a corruption while a power outage

@adelaiglesia
Copy link

adelaiglesia commented Jul 5, 2022

What do you mean with "gone"?

  • Are the devices missing from the z2m frontend OR
  • Are non of the devices controllable?

Hi, sorry for the lack of clarity in my last response. The devices were present in z2m frontend but unreachable (all of them, 91 devices). None of the devices or groups were controllable. Coordinator were reflashed with same version but with no effect. Repairing all devices was necessary.

Just for your info, 80/91 devices are power line operated (not battery devices). Just in case that helps. I'm going to search if i have Database.db to share it in this conversation.

The workaround i have deployed is to connect Zigbee2mqtt machine to an UPS 🤣. At this time i think that z2m was writting just in the right moment and got corrupted. Only if i find the file we will know.

Thank you for your time

@Koenkk
Copy link
Owner

Koenkk commented Jul 6, 2022

@eloo

as no proper database is used the "database.db" is lacking corruption prevention.

there is some corruption prevention, the db is first written to a temp path and then renamed (https://github.com/Koenkk/zigbee-herdsman/blob/f1c6a3887e9d7a763e9ec981543881716c75c5ff/src/controller/database.ts#L75). I agree that sqlite may be a better option but its also more complicated (and we are not sure yet this causes the issue).

further the last_seen is also stored in this file. so this file is going to have a lot write operations which could lead to a corruption while a power outage

not every last_seen state will rewrite the db, this is done occasionally

@adelaiglesia what did you see in the log when sending messages to the devices?

@hitokiri8x
Copy link

I don't know if it's the same but I describe my situation: stop ( maybe ungraceful ) of the container then all sensor are still paired but they receive no signal.
I have only aquara devices: windows, temperature and water.
Only the windows sensors when toggled works again; for the temperature/water sensors to work again I need to press the button ( not re-pair )

@tripplet
Copy link
Contributor

I just had the same problem after a short power outage Z2M no longer showed any devices in the web interface, however the log looked normal.
The new database.db only contained 3 lines, 1 for the coordinator and 2 empty groups:

{"id":1,"type":"Coordinator","ieeeAddr":"0x...." ... }
{"id":2,"type":"Group","groupID":1,"members":[],"meta":{}}
{"id":3,"type":"Group","groupID":2,"members":[],"meta":{}}

All devices were gone.
Luckily I was able to restore a backup from Home Assistant which contained the database.db.
After restoring and restarting the addon all worked again no need for a lengthy repairing of all devices.

@eloo
Copy link

eloo commented Aug 17, 2022

@tripplet okay.. that makes is more clear that the problem seems to be related to the database.db as restoring will fix it.

@Koenkk maybe as a quickfix the database.db can be duplicated every time? so maybe the old version will be just renamed with .bak or something like this?
so we can easily restore every time

@Koenkk
Copy link
Owner

Koenkk commented Aug 17, 2022

I will check if I can come up with an easy recovery solution. Something like:

on save of db:

  • copy old db to something like database.db.bak as you suggested
  • write db to temp file with a closing mark at the end
  • copy db from temp file to database.db (if this only completes partially we get this issue)
  • if z2m starts next time it will check if it can find the closing mark such that it knows the db is complete, if not it will use the database.db.bak file if present

@eloo
Copy link

eloo commented Aug 17, 2022

@Koenkk sounds like a good solution.
i also like the idea of the self healing check 👍

@xit
Copy link

xit commented Sep 17, 2022

Had the same issue the other night. Short power outage made my server reboot and only my Philips Hue motion sensors appeared, after they had detected motion.

Rolled back VM snapshot and everything was back to normal and I could finally turn off all the lights that turned on when the power came back. 😵‍💫

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale issues label Oct 18, 2022
@eloo
Copy link

eloo commented Oct 18, 2022

not stale AFAIK

@Koenkk Koenkk added dont-stale and removed stale Stale issues labels Oct 18, 2022
@mrwiwi
Copy link

mrwiwi commented Oct 26, 2022

Just got the issue after power outage too, so stressing, luckcly i had a backup from a few days back.

@mrwiwi
Copy link

mrwiwi commented Oct 26, 2022

Note : my database.db.backup was only 4ko, when my backuped database.db was 43ko !

@xit
Copy link

xit commented Nov 3, 2022

Happened yet again. Sigh.

@drhirn
Copy link

drhirn commented Nov 18, 2022

Just had a power outage too. In my case the database was ok, but all z2m devices were offline and couldn't be controlled.

@skinkie
Copy link

skinkie commented Dec 10, 2022

Happened to me now twice too. Database was corrupted, and configuration.yaml isn't used to restore anything.

@Koenkk
Copy link
Owner

Koenkk commented Dec 11, 2022

@skinkie can you provide me an example of how the data/database.db looked?

@mrwiwi
Copy link

mrwiwi commented Dec 11, 2022

@skinkie can you provide me an example of how the data/database.db looked?

For me ever time it looked brand new !

@Koenkk
Copy link
Owner

Koenkk commented Dec 11, 2022

I've added the fsync call before the rename as suggested by @tripplet. Let's see if it still occurs after this.

Changes will be available in the dev branch in a few hours from now. (https://www.zigbee2mqtt.io/advanced/more/switch-to-dev-branch.html)

Koenkk added a commit to Koenkk/zigbee-herdsman that referenced this issue Dec 11, 2022
@jjarven
Copy link

jjarven commented Dec 18, 2022

I saw this behaviour with 1.28.4 yesterday.
Migrated from ZHA and had issues, zigbee2mqtt crashed many times during device pairing (the web front end stopped to acknowledging to pairing attempts and finally noticed the backend was down.

At one point, I had around 5 devices paired and backend crashed - when restarted, the devices were gone.
Thus the service automatic restart function is not working either - had to manually start.

@maxime1992
Copy link

I had a very similar situation yesterday and while I had no power outage as far as I'm aware, I start to wonder if it's not somehow related : #15868

@scottrhoyt
Copy link

Hi, I just ran into a similar issue on 1.28.1 running in docker. I restarted the container and the WebUI was now depopulated of all devices and most other info (version numbers in about, map not working, .etc). Though looking at the log, it appears that devices are still paired and transmitting state and commands correctly. Here's what I tried to no avail:

  • Restart container
  • Rollback container data to known good state
  • Update Z2M to 1.30.0

@noci2012
Copy link

noci2012 commented Jun 27, 2023

Current version of Z2M: 1.31.2 commit: 21f51258
Conbee II, Firmware: 0x26580700

After a restart (requested through either the webinterface, or systemd restart for the Z2M process)
Most devices are off-line. - not forgotten... just off-line and the need to be repaired.
Devices that report are more likely to return than others.
Devices that were unavailable (mains device offline by being turned off from the mains) also have a better chance of returning.

44 devices, 31 using mains, 12 on battery, one never bothered to report either battery/mains (does run on battery).

Is it possible that a single status request that for some device gets lost in the traffic during startup causing disabling the device somehow?
Also observed (once noticed, no complete trackrecord) there is a failed poll in the log files BEFORE the stick has been registerd.

info  2023-06-27 10:29:17: Logging to console and directory: '/opt/zigbee2mqtt/data/log/2023-06-27.10-29-17' filename: log.txt
warn  2023-06-27 10:29:17: Failed to ping 'innr_plug1' (attempt 1/1, Read 0x18fc260000051121/1 genBasic(["zclVersion"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":true,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (no response received (17)))
info  2023-06-27 10:29:17: Starting Zigbee2MQTT version 1.31.2 (commit #21f51258)
info  2023-06-27 10:29:17: Starting zigbee-herdsman (0.14.117)
info  2023-06-27 10:29:17: zigbee-herdsman started (resumed)
info  2023-06-27 10:29:17: Coordinator firmware version: '{"meta":{"maintrel":0,"majorrel":38,"minorrel":88,"product":0,"revision":"0x26580700","transportrev":0},"type":"ConBee2/RaspBee2"}'

@noci2012
Copy link

Additional: database.db entry for such a device that stubornly doesn't come back...

{
  "id": 22,
  "type": "Router",
  "ieeeAddr": "0x847127fffea9ccab",
  "nwkAddr": 10923,
  "manufId": 4644,
  "manufName": "ROBB smarrt",
  "powerSource": "Mains (single phase)",
  "modelId": "ROB_200-004-0",
  "epList": [
    1,
    242
  ],
  "endpoints": {
    "1": {
      "profId": 260,
      "epId": 1,
      "devId": 257,
      "inClusterList": [
        0,
        3,
        4,
        5,
        6,
        8,
        2821,
        4096
      ],
      "outClusterList": [
        25
      ],
      "clusters": {
        "genBasic": {
          "attributes": {
            "modelId": "ROB_200-004-0",
            "manufacturerName": "ROBB smarrt",
            "powerSource": 1,
            "zclVersion": 3,
            "appVersion": 0,
            "stackVersion": 0,
            "hwVersion": 1,
            "dateCode": "NULL",
            "swBuildId": "2.5.3_r51"
          }
        },
        "genOta": {
          "attributes": {
            "currentFileVersion": 51
          }
        },
        "genOnOff": {
          "attributes": {
            "onOff": 0
          }
        },
        "genLevelCtrl": {
          "attributes": {
            "currentLevel": 69,
            "onLevel": 255
          }
        }
      },
      "binds": [
        {
          "cluster": 6,
          "type": "endpoint",
          "deviceIeeeAddress": "0x00212effff06747e",
          "endpointID": 1
        },
        {
          "cluster": 8,
          "type": "endpoint",
          "deviceIeeeAddress": "0x00212effff06747e",
          "endpointID": 1
        }
      ],
      "configuredReportings": [
        {
          "cluster": 6,
          "attrId": 0,
          "minRepIntval": 0,
          "maxRepIntval": 3600,
          "repChange": 0
        }
      ],
      "meta": {}
    },
    "242": {
      "profId": 41440,
      "epId": 242,
      "devId": 102,
      "inClusterList": [
        33
      ],
      "outClusterList": [
        33
      ],
      "clusters": {},
      "binds": [],
      "configuredReportings": [],
      "meta": {}
    }
  },
  "appVersion": 0,
  "stackVersion": 0,
  "hwVersion": 1,
  "dateCode": "NULL",
  "swBuildId": "2.5.3_r51",
  "zclVersion": 3,
  "interviewCompleted": true,
  "meta": {
    "configured": 1461352984
  },
  "lastSeen": 1687851057212,
  "defaultSendRequestWhen": "immediate"
}

@tripplet
Copy link
Contributor

Given the lack of new reports I think the fix works and this can be closed?

@noci2012
Copy link

noci2012 commented Sep 28, 2023

I avoided all updates until now, i will check next weekend, report or close whatever is appropriate.

@noci2012
Copy link

noci2012 commented Oct 1, 2023

No power outage....,
update through update.sh

After update: (& waiting half an hour):
All devices that send measurements (temperature, power usage, motion) are Online
All switchable devices (lamps, relais) are gone. (switches that give a power reading are present).

Repairing of some devices has issues... (from 5 lamps 4 would re-connect after reset, 5th lamp no reconnect).
One hour later no change.

@noci2012
Copy link

noci2012 commented Oct 2, 2023

Plugs/Switches are reported to be online, still giving errors:

Failed to read state of 'frients_plug1' after reconnect (Read 0x0015bc002f013410/2 genOnOff(["onOff"], {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":tue,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (no response received (48)))

This is equivalent for all plugs.

@noci2012
Copy link

noci2012 commented Nov 6, 2023

I have exeprienced several (8 or so) power failures last week, some devices were lost. (about 1 each time). Different types, etc.
The non-connecting lamp probably died during reset attempts... it will not reset anymore, it now just is an expensive dumb bulb. Not sure what genius thought turning off/on a device 6 times in a short time is a sane reset method.

I consider this solved. It can be closed. (I cannot do it).

@joaquinvacas
Copy link
Sponsor

I have exeprienced several (8 or so) power failures last week, some devices were lost. (about 1 each time). Different types, etc. The non-connecting lamp probably died during reset attempts... it will not reset anymore, it now just is an expensive dumb bulb. Not sure what genius thought turning off/on a device 6 times in a short time is a sane reset method.

I consider this solved. It can be closed. (I cannot do it).

It still happens from time to time, I'm making periodical backups so I can restore it if something happens.

@mmerickel
Copy link

mmerickel commented Nov 23, 2023

Using 1.33.2 with an ezsp adapter (sonoff dongle-e) I had setup z2m for the first time and connected 22 devices. I then clicked the restart button on the z2m web ui, and when it came back every device was gone. The database.db.backup was there, and contained all of the devices. I put it back in place and z2m started showing all of the devices but the network was non-functional and the devices were gone from HA. I had to repair everything to re-establish the network. Did not see any errors in the logs.txt files from before/after the restart. This is hugely concerning as a first time user of z2m, trying it after never seeing an issue like this with ZHA over about 1.5 years.

@gorstj
Copy link

gorstj commented Dec 30, 2023

I think this issue is still present. See the following bug:

See #19988

@Brachterbaek
Copy link

I had the same issue yesterday (1.35.3-1). Circuit breaker trip which also contained the socket my server is on. Lost al 32 devices after everything was back up running again. Had to manually add everything back to Z2M, had no good old backup sadly.
The second problem that occurred is that InfluxDB and Grafana now don't included new data on device-id, the id's of my devices haven't changed of course but new data is only added when using device name and not identity-id anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-stale problem Something isn't working
Projects
None yet
Development

No branches or pull requests