Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request]: Incorporate multi-threading #20648

Closed
corporategoth opened this issue Jan 7, 2024 · 75 comments
Closed

[Feature request]: Incorporate multi-threading #20648

corporategoth opened this issue Jan 7, 2024 · 75 comments
Labels
feature request Feature request

Comments

@corporategoth
Copy link

Is your feature request related to a problem? Please describe

High MQTT load (for example, when you have a LOT of HA devices, and HA configuration occurs) can cause actual zigbee traffic to be delayed/halted, and even time out (causing Z2M to crash).
This can be because of either the MQTT traffic, OR the logging of such (if logging is set to debug).

To wit, if I have logging set to debug, and HA enabled, Z2M will almost always crash on startup for me, because the amount of work being done both to log things and communicate with HA's autoconfiguration causes enough delays in Zigbee comms to cause a timeout and crash. I have to either turn off HA integration before startup, or set logging to info to get a successful startup of Z2M.

Describe the solution you'd like

Although JavaScript is normally single-threaded, this is a prime candidate for investigating/using the means to do multi-threading in JS. Either through some existing means (there are a few) or by splitting things into multiple processes.

There would have to be a pair of queues between the threads (one going each direction). So that, for example, when you get a status from the Zigbee network, you put that status onto a queue, and a separate thread is responsible for publishing that to MQTT. Or you get a command from MQTT, it goes onto a queue, that the Zigbee thread can pick up and send to the zigbee device.
This way, high load or concurrency limiting (on either side) will NOT affect the other.

The same is true for logging (though in this case only 1 queue is needed). Log messages are put on a queue, and a separate thread pulls from that queue to write the logs out (to disk, or wherever) - so that any I/O from logging doesn't delay core processing.

Describe alternatives you've considered

For now, I'm working around this by lowing log levels and disabling HA integration on startup. Not ideal, especially if a restart happens when I'm not around to change these values.

Additional context

All as described.

@corporategoth
Copy link
Author

corporategoth commented Jan 7, 2024

This could probably achieve this:
https://nodejs.org/api/worker_threads.html

Easier for the logging, but should be able to handle all needs. Simply start up 3 worker threads. One for MQTT comms, one for Zigbee comms, one for logging. And invoke each for bridging.

So for logging, you might do logger_worker.postMessage({ logData });
Or to send to zigbee you might do zigbee_driver.postMessage({ requestToSend });
Or to send something to MQTT, mqtt.postMessage({ requestToSend });

I'm not a JS coder (C++ is my language), so I would just make a hash of this, but it IS possible and supported by node.

@Koenkk
Copy link
Owner

Koenkk commented Jan 8, 2024

This would be a good addition indeed. But to confirm this is the issue, when setting the log level to error, are your issues solved?

@corporategoth
Copy link
Author

@Koenkk My issue is 'solved' when I do EITHER of:
a. Set my log level to INFO (not debug).
OR b. disable home assistant during startup (and enable it once Z2M is active again).

This tells me that both logging AND the home assistant MQTT traffic have the ability to use up the main loop enough to delay Zigbee device traffic (or more likely the promise handling) to hit the timeout limit, which causes a crash on startup.

This is likely because my 180 zigbee devices ends up being 12,938 entities in Home Assistant (in HA, there is one entity for each sensor reading or controllable element provided, and some of my devices have quite a bit of configurability).
Which means, on HA auto-config, there is a LOT of MQTT traffic as HA auto-configures the devices (and gets the entity statuses), which ALSO generates a LOT of debug-level logging.

So ensuring that MQTT traffic, logging, and zigbee device handling happens on different worker threads, which cannot interfere with each other would make Z2M significantly more robust.

I would work on this myself, but I'm a C++ coder, and have very little JS exposure, I would make a complete hash of it.

@Koenkk
Copy link
Owner

Koenkk commented Jan 9, 2024

@corporategoth that's a workaround indeed, it's good that I'm aware of this now, I will check if any other users are also affected by it and implemented this if I have the time for it.

@mundschenk-at
Copy link
Contributor

I haven't had any startup crashes, but I do notice slowdowns sometimes (in motion-triggered light automations), which might be related to logging levels (I had it all on debug to be able to use logs if necessary).

@Koenkk
Copy link
Owner

Koenkk commented Jan 9, 2024

@mundschenk-at could you try to confirm if this indeed is the cause of your issue?

@mundschenk-at
Copy link
Contributor

Not sure how, as the slowdowns only happened occasionally. I've changed logging to error, if it doesn't happen over a week or two, it probably was the logging, but that's still not definite proof. However, the sheer number of MQTT messages in a larger network (with HA discovery) might warrant multiple threads nonetheless.

@corporategoth
Copy link
Author

@Koenkk The startup crashes is usually due to a zigbee device timeout, which brings up Koenkk/zigbee-herdsman#856 and Koenkk/zigbee-herdsman#857 - ie. because of the main loop being overloaded (by HA traffic, or the logging thereof), and the timer being set at "request work from zigbee" time instead of "sent work to zigbee" time.

But either way, the isolation of MQTT workloads, Zigbee device workloads and logging can only help your app. It's something I'd have done in a C++ implementation a long time ago. There systems interact with each other, but are async in nature, so don't need to interlock or starve each other.

@sjorge
Copy link
Sponsor Contributor

sjorge commented Jan 9, 2024

Hmm interesting, I notice a improvement in overall responsive ness with log set to info instead of debug when my adaptive lighting automation triggers, it causes a a bunch of reporting when the group updates around 20ish lights.

Usually briefly after the system is sluggish to respond for a few seconds, but it's better with a lower log level.

It's generally not a super big issue for me personally though, as it only triggers when there are steap changes in sun above horizon %, so mostly on the short winder days (and i usually also rate limit it, which i removed this afternoon for testing)

@corporategoth
Copy link
Author

corporategoth commented Jan 9, 2024

Note that every MQTT publish is published at INFO (not just debug)

info  2024-01-09 15:27:09: MQTT publish: topic 'zigbee2mqtt/Foyer Lights', payload '{"action":null,"activeEnergyReports":250,"activePowerReports":25,"autoTimerOff":0,"auxSwitchUniqueScenes":null,"bindingOffToOnSyncLevel":null,"brightness":254,"brightnessLevelForDoubleTapDown":null,"brightnessLevelForDoubleTapUp":null,"buttonDelay":"100ms","defaultLed1ColorWhenOff":255,"defaultLed1ColorWhenOn":255,"defaultLed1IntensityWhenOff":101,"defaultLed1IntensityWhenOn":101,"defaultLed2ColorWhenOff":255,"defaultLed2ColorWhenOn":255,"defaultLed2IntensityWhenOff":101,"defaultLed2IntensityWhenOn":101,"defaultLed3ColorWhenOff":255,"defaultLed3ColorWhenOn":255,"defaultLed3IntensityWhenOff":101,"defaultLed3IntensityWhenOn":101,"defaultLed4ColorWhenOff":255,"defaultLed4ColorWhenOn":255,"defaultLed4IntensityWhenOff":101,"defaultLed4IntensityWhenOn":101,"defaultLed5ColorWhenOff":255,"defaultLed5ColorWhenOn":255,"defaultLed5IntensityWhenOff":101,"defaultLed5IntensityWhenOn":101,"defaultLed6ColorWhenOff":255,"defaultLed6ColorWhenOn":255,"defaultLed6IntensityWhenOff":101,"defaultLed6IntensityWhenOn":101,"defaultLed7ColorWhenOff":255,"defaultLed7ColorWhenOn":255,"defaultLed7IntensityWhenOff":101,"defaultLed7IntensityWhenOn":101,"defaultLevelLocal":255,"defaultLevelRemote":255,"deviceBindNumber":null,"dimmingSpeedDownLocal":10,"dimmingSpeedDownRemote":0,"dimmingSpeedUpLocal":25,"dimmingSpeedUpRemote":0,"doubleTapClearNotifications":"Disabled","doubleTapDownToParam56":null,"doubleTapUpForFullBrightness":"Button Press Event + Set Load to 100%","doubleTapUpToParam55":null,"energy":1.77,"firmwareUpdateInProgressIndicator":"Enabled","higherOutputInNonNeutral":null,"individual_led_effect":null,"internalTemperature":null,"invertSwitch":"No","last_seen":"2024-01-09T15:27:09-05:00","ledBarScaling":null,"ledColorWhenOff":null,"ledColorWhenOn":null,"ledIntensityWhenOff":1,"ledIntensityWhenOn":33,"led_effect":null,"linkquality":132,"loadLevelIndicatorTimeout":"Stay On","localProtection":"Disabled","maximumLevel":null,"minimumLevel":null,"onOffLedMode":"All","outputMode":null,"overheat":null,"periodicPowerAndEnergyReports":300,"power":18.4,"powerType":null,"rampRateOffToOnLocal":5,"rampRateOffToOnRemote":5,"rampRateOnToOffLocal":10,"rampRateOnToOffRemote":10,"relayClick":"Disabled (Click Sound On)","remoteProtection":null,"smartBulbMode":null,"state":"ON","stateAfterPowerRestored":0,"switchType":null,"update":{"installed_version":16908815,"latest_version":16908815,"state":"idle"},"update_available":null}'

Which means there is still a lot of logging going on, even at info.

@sjorge
Copy link
Sponsor Contributor

sjorge commented Jan 9, 2024

I wonder if moving that one from info -> debug would help.

@corporategoth
Copy link
Author

@sjorge It might help for cases of lots of logging, IF people don't WANT to log the MQTT publishes.

However, for those that DO want to log it, there is no reason the zigbee traffic (or indeed mqtt traffic) should be slowed down because of a higher log level.

@Koenkk
Copy link
Owner

Koenkk commented Jan 18, 2024

I investigated the multi-threading a bit but doesn't seem too easy to do. This will also bring herdsman debug logging and z2m logging out of sync which makes debugging harder.

I think we should first investigate some ways to reduce the logging.

@corporategoth

  • what do you think if we just would stop Home Assistant discovery messages?
  • Could you provide your debug log so that I can see what kind of messages get logged in your setup?

@sjorge
Copy link
Sponsor Contributor

sjorge commented Jan 18, 2024

Perhaps we could introduce a 'subsystem' to the logging we can tune ?

e.g. herdman, converter, homeassistent, mqtt, ...

so we could then have these on different lives, samba does it like this and it's pretty nice to keep most things at a minimal level except certain things you are interested in at the moment.

@corporategoth
Copy link
Author

@Koenkk I actually don't care as much about the logging. I can always set the logging to WARN and have no real logs. It's annoying, but I don't look at the logs often anyway.

I'm actually much more concerned about a massive influx of MQTT traffic (ie. when HA does it's auto-configure, and as such does a config request for everything, and gets a LOT of data flowing back the other way for each entity) causing delays in zigbee device communications.

Even if it doesn't freeze out the zigbee device entirely (though it could), it could turn something that had no delay, into a few seconds delay on processing a response from the zigbee network (or sending a request to the zigbee network). Even with logging at WARN.

That's more concerning to me, as that can't just be turned off or throttled like logging can without consequence. Though I thought logging might be a quick win because, at least in the case of the z2m logging that is tunable from the UI (so ignoring herdsman logging that requires the DEBUG env var), the logging seems to be contained to a single file, so I thought it might be a simple case of re-implementing the logging API calls to use the threading messaging structure used by worker_threads.

But as I said, I don't know JS / node / TS well enough to do anything more than simple tweaks or hack around ineffectively ;)

@Koenkk
Copy link
Owner

Koenkk commented Jan 20, 2024

and as such does a config request for everything, and gets a LOT of data flowing back the other way for each entity) causing delays in zigbee device communications.

But this only happens when HA starts, right?

@mundschenk-at
Copy link
Contributor

mundschenk-at commented Jan 20, 2024

Actually, mostly anything that changes the discovery structure (group modifications, scenes ...) will clear and re-publish the config messages (or at least large chunks of them).

@corporategoth
Copy link
Author

Plus - if home assistant is restarted, and Z2M is not, it should trigger a re-configuration (it's not right now, which causes HA to not subscribe to some topics, but it used to, and is supposed to - I am not sure whether this was a Z2M, MQTT or HA change that broke that). That's a separate issue, but the point is, HA autoconfiguration doesn't just happen at Z2M startup.

@mundschenk-at
Copy link
Contributor

mundschenk-at commented Jan 22, 2024

So it's not a replacement for multi-threading, but I just removed a whole bunch of unnecessary discovery updates for scene changes with #20952.

@corporategoth
Copy link
Author

@Koenkk Here is a sample log. As you can see, my 180 devices results in more than 38k configure messages being sent to Home Assistant. This is why I am worried about the MQTT traffic blocking zigbee traffic. Logging I can turn down, but those 38k+ MQTT messages are still being published and choking out the single JS event loop thread.

This is a successful startup (I managed to get it to start up, with debug logging, without crashing). This log covers less than 4 minutes of operation, and it's already almost 35mb uncompressed.

log.txt.gz

@Koenkk
Copy link
Owner

Koenkk commented Jan 24, 2024

I see, you have quite some Inovelli devices which expose a lot of entities. I guess your setup is quite extreme, nevertheless this should be fixed in z2m.

@corporategoth
Copy link
Author

@Koenkk The extreme cases are where most of the action is at 🤣

Thanks - one main problem is the fact that, even though there are thousands of entities (though only 180 devices), ultimately, Home Assistant just subscribes (multiple times) to the same single topic - one for each entity, just pulling out the relevant field from that one topic.

It seems extremely wasteful for home assistant to have to subscribe multiple times to the same topic, or do a configuration back and forth on a per-entity basis, when realistically it should only have to do 1 back and forth per device if each entity's status is coming on a single topic (which it will subscribe to, if it gets updates for ANY entity on that device).

So if there is a way to cut down that config traffic, so that the auto-configure only happens on a per-device basis (not per entity), especially if HA knows that each entity on the device gets it traffic from the same topic update (which it kind of already does), that would go a long way.

I worked with HA to optimize performance issues on their side around this - where they ended up debouncing subscriptions to the device topic, and not performing duplicate subscribes (which made MQTT stuff actually work again, and not take 10 mins to start up!). But that was more about the subscriptions that happen as a result of auto-configuraiton - not the auto-configuration part. (home-assistant/core#88117, home-assistant/core#88826, home-assistant/core#88862, home-assistant/core#92172, home-assistant/core#92201)

I'm not sure who is 'driving' the auto-configuration, or exactly how it works - so if there is a way to send less '/config' messages (ie. don't send 1 per entity, but 1 per device) - and it's understood that "This is the cconfiguration for ALL these entities, no need to configure each one individually', that would improve things drastically.

But I'm not sure if that then requires changes on BOTH the HA MQTT and Z2M side. Or violates some other mechanism in place.

Thanks though, for seeing it's a problem and recognizing it needs some kind of solution.

@mundschenk-at
Copy link
Contributor

mundschenk-at commented Jan 25, 2024

Unfortunately, that would mostly be a change on the HA side, completely revamping their auto-discovery protocol. I don't think it is realistic to change that so drastically at this point (because it touches lots of third-party solutions). But even if we designed an optional v2 of the discovery protocol, it would need to get architectural buy-in from HA first (historically, HA uses entities as its basic building blocks, the device registry is a later add-on). So at best, this is a long-term goal.

@corporategoth
Copy link
Author

@mundschenk-at Which leads back to the current task of fixing Z2M so that it can handle most any load (on the MQTT side) without affecting Zigbee performance - similar to what HA did (fixing to be able to handle much greater MQTT load without stalling out HA). Which is what the original ticket is about anyway.

Still, something to put on the roadmap - as all this MQTT traffic seems superfluous in the first place :)

@corporategoth
Copy link
Author

corporategoth commented Jan 27, 2024

Slightly off topic, but ...

Looking at the log above ... Why does Z2M subscribe to the config topics, getting our own messages back?
I was looking at the config messages, there are 12,752 that are MQTT publish (which is still a lot from 180 devices, but still).
But ALSO, 25,831 messages RECEIVED on the /config endpoints. Specifically, I see Z2M publishing on the /config endpoint, then receiving that message twice. For example:

info  2024-01-22 20:07:36: MQTT publish: topic 'homeassistant/light/0x6c5cb1fffe56cbd6/light/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"brightness":true,"brightness_scale":254,"command_topic":"zigbee2mqtt/Rec Room Lights/set","device":{"identifiers":["zigbee2mqtt_0x6c5cb1fffe56cbd6"],"manufacturer":"Inovelli","model":"Inovelli 2-in-1 switch + dimmer (VZM31-SN)","name":"Rec Room Lights","sw_version":"2.15","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"name":null,"object_id":"rec_room_lights","origin":{"name":"Zigbee2MQTT","sw":"1.35.1","url":"https://www.zigbee2mqtt.io"},"schema":"json","state_topic":"zigbee2mqtt/Rec Room Lights","unique_id":"0x6c5cb1fffe56cbd6_light_zigbee2mqtt"}'
debug 2024-01-22 20:07:39: Received MQTT message on 'homeassistant/light/0x6c5cb1fffe56cbd6/light/config' with data '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"brightness":true,"brightness_scale":254,"command_topic":"zigbee2mqtt/Rec Room Lights/set","device":{"identifiers":["zigbee2mqtt_0x6c5cb1fffe56cbd6"],"manufacturer":"Inovelli","model":"Inovelli 2-in-1 switch + dimmer (VZM31-SN)","name":"Rec Room Lights","sw_version":"2.15","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"name":null,"object_id":"rec_room_lights","origin":{"name":"Zigbee2MQTT","sw":"1.35.1","url":"https://www.zigbee2mqtt.io"},"schema":"json","state_topic":"zigbee2mqtt/Rec Room Lights","unique_id":"0x6c5cb1fffe56cbd6_light_zigbee2mqtt"}'
debug 2024-01-22 20:07:41: Received MQTT message on 'homeassistant/light/0x6c5cb1fffe56cbd6/light/config' with data '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"brightness":true,"brightness_scale":254,"command_topic":"zigbee2mqtt/Rec Room Lights/set","device":{"identifiers":["zigbee2mqtt_0x6c5cb1fffe56cbd6"],"manufacturer":"Inovelli","model":"Inovelli 2-in-1 switch + dimmer (VZM31-SN)","name":"Rec Room Lights","sw_version":"2.15","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"name":null,"object_id":"rec_room_lights","origin":{"name":"Zigbee2MQTT","sw":"1.35.1","url":"https://www.zigbee2mqtt.io"},"schema":"json","state_topic":"zigbee2mqtt/Rec Room Lights","unique_id":"0x6c5cb1fffe56cbd6_light_zigbee2mqtt"}'

There should be no reason for Z2M to subscribe to the /config endpoints, just publish to them.
We should probably have a separate ticket to NOT subscribe to messages Z2M itself publishes.

@mundschenk-at
Copy link
Contributor

mundschenk-at commented Jan 27, 2024

Z2M seems to subscribe to other completely unnecessary topics as well (maybe homeassistant/#? I haven't checked the code), I've seen messages from Valetudo being processed by Z2M during debugging. Restricting that to relevant topics (HA shutdown/restart I think) would certainly free up so e processing capacity.

Edit: Yeah, it's homeassistant/#:

this.mqtt.subscribe(`${this.discoveryTopic}/#`);

@corporategoth
Copy link
Author

Z2M seems to subscribe to other completely unnecessary topics as well (maybe homeassistant/#? I haven't checked the code), I've seen messages from Valetudo being processed by Z2M during debugging. Restricting that to relevant topics (HA shutdown/restart I think) would certainly free up so e processing capacity.

Z2M needs to subscribe to the command topics too - but since Z2M is what tells HA what the command topics are for each entity (aka. device), it can choose what to subscribe to.

@Koenkk
Copy link
Owner

Koenkk commented Mar 6, 2024

@corporategoth I want to make some improvements already and wan't to start with not staying subscribed and not to receive the discovery messages twice. I don't know how you run z2m, but can you comment this line and see if it makes a big difference? I can provide more specific instructions if needed.

@corporategoth
Copy link
Author

@Koenkk

So I noticed with 1.36.0 that I could already keep logging at debug and have things start up. I also switched to the ember driver and updated my device's firmware accordingly. So that's a plus.

On the upside with the line commented, if I do a grep for '/config' now in the logs, I only see 13290 lines now, instead of 37106 before I commented out the line you requested.

Additionally, all '/config' messages end about 8 seconds after the start of the app now:

debug 2024-03-06 19:35:33: Loaded state from file /data/state.json
...
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/light/1221051039810110150109113116116_33/light/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"brightness":true,"brightness_scale":254,"color_mode":true,"command_topic":"zigbee2mqtt/driveway_6of6/set","device":{"identifiers":["zigbee2mqtt_1221051039810110150109113116116_33"],"manufacturer":"Zigbee2MQTT","model":"Group","name":"driveway_6of6","sw_version":"Zigbee2MQTT 1.36.0","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"max_mireds":500,"min_mireds":154,"name":null,"object_id":"driveway_6of6","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"schema":"json","state_topic":"zigbee2mqtt/driveway_6of6","supported_color_modes":["xy","color_temp"],"unique_id":"33_light_zigbee2mqtt"}'

As opposed to almost a full minute prior to commenting it out (with 1.36.0):

debug 2024-03-06 13:40:19: Loaded state from file /data/state.json
...
debug 2024-03-06 13:41:17: Received MQTT message on 'homeassistant/light/1221051039810110150109113116116_33/light/config' with data '{"availability":[{"topic":"zigbee2mqtt/bri
dge/state","value_template":"{{ value_json.state }}"}],"brightness":true,"brightness_scale":254,"color_mode":true,"command_topic":"zigbee2mqtt/driveway_6of6/set","device":{"id
entifiers":["zigbee2mqtt_1221051039810110150109113116116_33"],"manufacturer":"Zigbee2MQTT","model":"Group","name":"driveway_6of6","sw_version":"Zigbee2MQTT 1.36.0","via_device
":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"max_mireds":500,"min_mireds":154,"name":null,"object_id":"driveway_6of6","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https:
//www.zigbee2mqtt.io"},"schema":"json","state_topic":"zigbee2mqtt/driveway_6of6","supported_color_modes":["xy","color_temp"],"unique_id":"33_light_zigbee2mqtt"}'

On the downside, something is going wrong. For example, on Z2M startup, I picked an entity ('Post 1 Light - Small'), it published these on startup:

info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/light/0xb0ce1814001d4f3c/light/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"brightness":true,"brightness_scale":254,"color_mode":true,"command_topic":"zigbee2mqtt/Post 1 Light - Small/set","device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"max_mireds":500,"min_mireds":154,"name":null,"object_id":"post_1_light_-_small","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"schema":"json","state_topic":"zigbee2mqtt/Post 1 Light - Small","supported_color_modes":["xy","color_temp"],"unique_id":"0xb0ce1814001d4f3c_light_zigbee2mqtt"}'
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/sensor/0xb0ce1814001d4f3c/power/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"device_class":"power","enabled_by_default":true,"entity_category":"diagnostic","object_id":"post_1_light_-_small_power","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"state_class":"measurement","state_topic":"zigbee2mqtt/Post 1 Light - Small","unique_id":"0xb0ce1814001d4f3c_power_zigbee2mqtt","unit_of_measurement":"W","value_template":"{{ value_json.power }}"}'
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/sensor/0xb0ce1814001d4f3c/energy/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"device_class":"energy","enabled_by_default":true,"object_id":"post_1_light_-_small_energy","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"state_class":"total_increasing","state_topic":"zigbee2mqtt/Post 1 Light - Small","unique_id":"0xb0ce1814001d4f3c_energy_zigbee2mqtt","unit_of_measurement":"kWh","value_template":"{{ value_json.energy }}"}'
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/sensor/0xb0ce1814001d4f3c/linkquality/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"enabled_by_default":false,"entity_category":"diagnostic","icon":"mdi:signal","name":"Linkquality","object_id":"post_1_light_-_small_linkquality","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"state_class":"measurement","state_topic":"zigbee2mqtt/Post 1 Light - Small","unique_id":"0xb0ce1814001d4f3c_linkquality_zigbee2mqtt","unit_of_measurement":"lqi","value_template":"{{ value_json.linkquality }}"}'
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/sensor/0xb0ce1814001d4f3c/last_seen/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"device_class":"timestamp","enabled_by_default":false,"entity_category":"diagnostic","icon":"mdi:clock","name":"Last seen","object_id":"post_1_light_-_small_last_seen","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"state_topic":"zigbee2mqtt/Post 1 Light - Small","unique_id":"0xb0ce1814001d4f3c_last_seen_zigbee2mqtt","value_template":"{{ value_json.last_seen }}"}'
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/sensor/0xb0ce1814001d4f3c/update_state/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"enabled_by_default":false,"entity_category":"diagnostic","icon":"mdi:update","name":"Update state","object_id":"post_1_light_-_small_update_state","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"state_topic":"zigbee2mqtt/Post 1 Light - Small","unique_id":"0xb0ce1814001d4f3c_update_state_zigbee2mqtt","value_template":"{{ value_json['update']['state'] }}"}'
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/binary_sensor/0xb0ce1814001d4f3c/update_available/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"device_class":"update","enabled_by_default":false,"entity_category":"diagnostic","name":null,"object_id":"post_1_light_-_small_update_available","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"payload_off":false,"payload_on":true,"state_topic":"zigbee2mqtt/Post 1 Light - Small","unique_id":"0xb0ce1814001d4f3c_update_available_zigbee2mqtt","value_template":"{{ value_json['update']['state'] == \"available\" }}"}'
info  2024-03-06 19:35:41: MQTT publish: topic 'homeassistant/update/0xb0ce1814001d4f3c/update/config', payload '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"command_topic":"zigbee2mqtt/bridge/request/device/ota_update/update","device":{"identifiers":["zigbee2mqtt_0xb0ce1814001d4f3c"],"manufacturer":"Sengled","model":"Element color plus E12 (E1F-N5E)","name":"Post 1 Light - Small","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"device_class":"firmware","entity_category":"config","entity_picture":"https://github.com/Koenkk/zigbee2mqtt/raw/master/images/logo.png","json_attributes_template":"{\"in_progress\": {{ iif(value_json['update']['state'] == 'updating', 'true', 'false') }} }","json_attributes_topic":"zigbee2mqtt/Post 1 Light - Small","latest_version_template":"{{ value_json['update']['latest_version'] }}","latest_version_topic":"zigbee2mqtt/Post 1 Light - Small","name":null,"object_id":"post_1_light_-_small","origin":{"name":"Zigbee2MQTT","sw":"1.36.0","url":"https://www.zigbee2mqtt.io"},"payload_install":"{\"id\": \"0xb0ce1814001d4f3c\"}","state_topic":"zigbee2mqtt/Post 1 Light - Small","unique_id":"0xb0ce1814001d4f3c_update_zigbee2mqtt","value_template":"{{ value_json['update']['installed_version'] }}"}'

If I look at my MQTT server, and run mosquitto_sub -u test -P '****' -v -t 'homeassistant/+/0xb0ce1814001d4f3c/#', they are all returning correctly (both when Z2M is down, showing retained messages, and getting updated on Z2M startup).

With the line commented, On Home Assistant, the device exists, but shows no entities - as if it didn't configure at all. It knows the device exists, but it completely ignored the /config entry published.

post-1-entities

However, if I uncomment the line again (ie. revert to default 1.36.0 behavior), I get:

post-1-entities-working

@corporategoth
Copy link
Author

corporategoth commented Mar 7, 2024

One other thing (I can file a separate bug report if you like). I noticed with 1.36.0, Z2M seems to be ignoring commands from HA.

I'm using ember, so I don't know if that;'s part of it. But Z2M is RECEIVING commands from HA:
Device command:

debug 2024-03-07 07:33:52: Received MQTT message on 'zigbee2mqtt/Post 1 Light - Small/set' with data '{"state":"OFF"}'
debug 2024-03-07 07:33:52: Publishing 'set' 'state' to 'Post 1 Light - Small'

Group command:

debug 2024-03-07 07:36:26: Received MQTT message on 'zigbee2mqtt/driveway_lights/set' with data '{"state":"OFF"}'
debug 2024-03-07 07:36:26: Publishing 'set' 'state' to 'driveway_lights'

Just not doing anything with it (the actual device state doesn't change). The device state does not change. Meaning any automation from HA is meaningless. It IS publishing states to HA, so I can see the state of things in HA, just not control them.
This is using the stock homeassistant.js from 1.36.0

EDIT: Submitted #21707 to track this

@Koenkk
Copy link
Owner

Koenkk commented Mar 8, 2024

@corporategoth I cannot understand yet why commenting that line breaks HA. It will only prevent clear messages from being published (which shouldn't happen in your case as you don't have removed devices)

@Koenkk
Copy link
Owner

Koenkk commented May 14, 2024

@corporategoth could you confirm this problem is still present with 1.37.1? @Nerivec made some great performance improvements lately.

@corporategoth
Copy link
Author

I increased my log level to DEBUG (I'm on 1.37.1). For startup.

  1. no crash
  2. doing all the config stuff takes about 50 seconds:

First /config message

[2024-05-14 15:01:37] debug:    z2m: MQTT publish: topic 'homeassistant/binary_sensor/1221051039810110150109113116116_0xe0798dfffea88b0a/connection_state/config', payload '{"device":{"hw_version":"EmberZNet 7.4.1 [GA]","identifiers":["zigbee2mqtt_bridge_0xe0798dfffea88b0a"],"manufacturer":"Zigbee2MQTT","model":"Bridge","name":"Zigbee2MQTT Bridge","sw_version":"1.37.1"},"device_class":"connectivity","entity_category":"diagnostic","name":"Connection state","object_id":"zigbee2mqtt_bridge_connection_state","origin":{"name":"Zigbee2MQTT","sw":"1.37.1","url":"https://www.zigbee2mqtt.io"},"payload_off":"offline","payload_on":"online","state_topic":"zigbee2mqtt/bridge/state","unique_id":"bridge_0xe0798dfffea88b0a_connection_state_zigbee2mqtt","value_template":"{{ value_json.state }}"}'

Last /config message

[2024-05-14 15:02:25] debug:    z2m: Received MQTT message on 'homeassistant/light/1221051039810110150109113116116_33/light/config' with data '{"availability":[{"topic":"zigbee2mqtt/bridge/state","value_template":"{{ value_json.state }}"}],"brightness":true,"brightness_scale":254,"command_topic":"zigbee2mqtt/driveway_6of6/set","device":{"identifiers":["zigbee2mqtt_1221051039810110150109113116116_33"],"manufacturer":"Zigbee2MQTT","model":"Group","name":"driveway_6of6","sw_version":"Zigbee2MQTT 1.37.1","via_device":"zigbee2mqtt_bridge_0xe0798dfffea88b0a"},"max_mireds":500,"min_mireds":154,"name":null,"object_id":"driveway_6of6","origin":{"name":"Zigbee2MQTT","sw":"1.37.1","url":"https://www.zigbee2mqtt.io"},"schema":"json","state_topic":"zigbee2mqtt/driveway_6of6","supported_color_modes":["xy","color_temp"],"unique_id":"33_light_zigbee2mqtt"}'
  1. However ...
$ grep "/config" log.log | wc -l
42897
$ grep "/config" log.log | grep -c "MQTT publish"
14226
$ grep "/config" log.log | grep -c "Received MQTT message"
28671

I've not tried removing the line you asked for last time, but it seems it is still receiving everything twice (once presumably for the cached state in the MQTT server, and again receiving every message we publish back again, presumably - though the log seems to indicate it is sending all the publishes before any receives).

@Koenkk
Copy link
Owner

Koenkk commented May 15, 2024

@corporategoth so the situation did improve? (the only issue is that its a bit slow now?)

@corporategoth
Copy link
Author

For the multi-threading issue? yeah, it seems so - the only issue left is related to the entity-based discovery (ie. 12k entities) instead of device-based (180 devices), which makes discovery take a long time.

I would still recommend moving the subscribe of /config topics before the publish, processing them all (to figure out what to delete), then unsubscribing from the /config topics before publishing to them, so Z2M doesn't end up getting all the /config messages twice. But it's not causing a crash on startup anymore from what I can tell.

@jbouwh
Copy link

jbouwh commented May 16, 2024

Because the config topics are retained, they will be replayed when you subscribe to them at start up, if they are received, it is not needed to resend them, it is enough to just catch up. Also it is recommended to subscribe to the ha mqtt birth message and use this message to trigger any state updates, in case the you send updates anyway, then do not use retained state updates, as it will result in processing 2 updates.

@Koenkk
Copy link
Owner

Koenkk commented May 21, 2024

@corporategoth I've merged #22701 to dev now, I expect this to hugely improve the situation, can you check it on your setup? Make sure to restart z2m at least twice, in case it works, you should see a lot of this +- 5 seconds after z2m startup:

Skipping discovery of 'sensor/0x00158d0001d8e1d2/action/config', already discovered

In case it works, could you provide the same stats as you did in #20648 (comment)?

@mundschenk-at
Copy link
Contributor

home-assistant/core#109030 has finally been merged (in a addition to a slew of performance and stability updates for MQTT discovery/parsing) and is tentatively scheduled for the 2024.6.0 release of Home Assistant. I would wait until that version is released, but then I think we should try to switch to device-based discovery quickly.

@Koenkk
Copy link
Owner

Koenkk commented Jun 3, 2024

@corporategoth could you check the performance with 1.38.0? It would be nice if you could provide the same info as in #20648 (comment)

@mundschenk-at
Copy link
Contributor

mundschenk-at commented Jun 6, 2024

Unfortunately, device-level discovery was reverted for 2024.6.0. The new draft PR is home-assistant/core#118757. There is also a corresponding documentation PR under home-assistant/home-assistant.io#33072.

@corporategoth
Copy link
Author

corporategoth commented Jun 6, 2024

@mundschenk-at Yeah, that team there seems to have blinders on.

They don't care about the effect requiring 12,000 messages to configure 180 devices (my exact situation) has on other projects. Two of their maintainers (@bdraco and @jbouwh) seem on board with the fact that 1) this is blindingly obvious, and 2) reducing the overall number of messages to the MQTT broker makes sense.

One of them (@emontnemery) won't accept it unless it improves HA performance specifically by more than some arbitrary target number. Heck, even 12% or 16% is not enough (the other maintainers made that argument), I don't know what he wants, 100%? 500%? Running the integration gives you back CPU cycles?

I tried making the argument to them that, even if the performance was WORSE, causing a blast of 12000 messages for a mere 180 devices is not being a good software citizen, and should be fixed. Apparently they don't care, because "it works for my tiny setup" and "that's OTHER software, I only care about HA." And they decided to make it "club members only" from that point on.

So I've given up hope of device-level discovery ever being merged. They have no interest in correcting bad design (which entity-level discovery is). Sure, they will then have to maintain device AND entity level discovery from then on, but if entity level discovery is deprecated, then it will slowly die / not need updates from then on. That's how software works in professional development. You fix the bad design, and eat the cost of doing the bad design in the first place.

end rant ... sorry, I've never had good interactions with the MQTT implementers in HA. I had to repeatedly, over months, report issues and even write a fix myself (which they rejected both because I'm not a "club" member, and because they wanted a piecemeal fix instead of a holistic one) when their code was processing every retained message in an n^2 manner (ie. every single entity for a device, processed the retained messages n times, depending how many entities on a device) - which caused their startup to consume 100% cpu, take 10 minutes and time out, and FAIL for most entities in a large setup. At first they tried to blame Z2M, and I had to PROVE it was their code, AND write a fix. And they fought me every step of the way.

really end rant ;)

@jbouwh
Copy link

jbouwh commented Jun 6, 2024

Unfortunately, device-level discovery was reverted for 2024.6.0. The new draft PR is home-assistant/core#118757. There is also a corresponding documentation PR under home-assistant/home-assistant.io#33072.

Still hope we can do some tests

@mundschenk-at
Copy link
Contributor

While I get your frustration, @corporategoth, I think this is overly harsh. HA is a really large and busy (as in the number of contributions) project, so I can understand that not everything can be merged immediately. I assume some discussions went on on Discord (or some other private venue) as well, so we only got to see the end results of those discussions in the public PR, which made them seem very sudden and not deliberative.

@jbouwh Can we help in some way to move the PR forward?

@corporategoth
Copy link
Author

corporategoth commented Jun 6, 2024

@mundschenk-at No, it's not overly harsh. The MQTT component only has 3 real contributors. @jbouwh did an excellent job of creating a device-based discovery MR. Trying to solve a specific problem (which was NOT about performance, but about flooding other systems). It was done quite quickly and it was done and complete, and just going through review/tweaking. I was even giving constructive feedback on it, when it was solely related to device-based discovery and fixing the issue at hand.

It was only later they decided to tack on performance goals to this task, which are unrelated. This is proven, because much of the performance improvements that came out have nothing to do with device-based discovery - and they got into 2024.6.0. They could have, and should have created another ticket / MR to track performance issues in the HA MQTT implementation. But they hijacked the original MR instead, and then had it sit for months because "it doesn't give enough of a performance improvement" - despite that having nothing to do with why device-based discovery is a thing.

My previous MR to fix the n^2 retained message issue was rejected because it incorporated 3 fixes in one MR (all 3 of which were required to fix this issue properly - and were eventually incorporated months later when @jbouwh got around to re-implementing my fixes). However for the device discovery MR, all of a sudden, any performance improvement was fair game to add to it, despite being completely unrelated. Delaying this MR, and the fix therein. AND, now there were minimum performance metrics that must be hit for this MR (apparently something more than 12%), despite the device discovery not being about performance, but about overwhelming other products (MQTT broker, Z2M, even HA itself can't keep up with high message volumes). All of this being the executive decision of 1 of the 3 maintainers. There is a reason I refuse to submit any fixes to the MQTT HA implementation every again.

I actually would be less upset if they had prioritized performance issues over the device-based discovery, and had let the device-based discovery sit while they were doing that. And it was still pending. But the fact they hijacked the device-based discovery MR, then locked it down to maintainers only when I pointed out that it has nothing to do with performance and continued to discuss performance issues in the device-based discovery MR is what incensed me. Especially after the previous "each MR should do only ONE small thing" debacle.

@Koenkk
Copy link
Owner

Koenkk commented Jun 6, 2024

@corporategoth could you check #20648 (comment) ?

@corporategoth
Copy link
Author

@Koenkk I saw - since I use Z2M on TrueNAS, I need to wait for TrueCharts to update to 1.38.0 to test new versions in a reasonable way. As soon ad Z2M 1.38.0 is in TrueCharts, I will indeed test it :)

@corporategoth
Copy link
Author

Weird - I see the truecharts repo is updated to have 1.38.0 (https://github.com/truecharts/charts/blob/master/charts/stable/zigbee2mqtt/Chart.yaml) - but refreshing the repo on my TrueNAS system doesn't show an upgrade available. I'll have to investigate that.

@corporategoth
Copy link
Author

Ahh, I see. While the charts repo was updated, the catalog repo (which is used by truenas) was not - https://github.com/truecharts/catalog/blob/main/stable/zigbee2mqtt/12.1.5/Chart.yaml
Ugh.

@corporategoth
Copy link
Author

OK, I switched to using a manual docker thing now.

Results so far:

grep '/config' log.log | grep -c "Receiv"
11804
grep '/config' log.log | grep -c "publish"
2381
grep '/config' log.log | grep -c "Skipp"
11803

First message:

[2024-06-07 01:34:52] debug:    z2m:mqtt: Received MQTT message on 'homeassistant/binary_sensor/1221051039810110150109113116116_0xe0798dfffea88b0a/connection_state/config' with data '{"device":{"hw_version":"EmberZNet 7.4.1 [GA]","identifiers":["zigbee2mqtt_bridge_0xe0798dfffea88b0a"],"manufacturer":"Zigbee2MQTT","model":"Bridge","name":"Zigbee2MQTT Bridge","sw_version":"1.38.0"},"device_class":"connectivity","entity_category":"diagnostic","name":"Connection state","object_id":"zigbee2mqtt_bridge_connection_state","origin":{"name":"Zigbee2MQTT","sw":"1.38.0","url":"https://www.zigbee2mqtt.io"},"payload_off":"offline","payload_on":"online","state_topic":"zigbee2mqtt/bridge/state","unique_id":"bridge_0xe0798dfffea88b0a_connection_state_zigbee2mqtt","value_template":"{{ value_json.state }}"}'

Last message:

[2024-06-07 01:34:58] debug:    z2m: Skipping discovery of 'light/1221051039810110150109113116116_33/light/config', already discovered

So down to 6 seconds. Impressive :)

Not sure what's going on with the message numbers there. Does this imply that I created another 2381 NEW config topics, skipping and left the 11803 current topics alone because they were already created in the MQTT server? Ouch.

I will try a second restart soon to see what the second run brings.

@corporategoth
Copy link
Author

Yeah, confirmed on the second run:

I now have:

grep "/config" log.log | grep -ci "Recei"
14455
grep "/config" log.log | grep -ci "publish"
1
grep "/config" log.log | grep -ci "skip"
14183

It still was finished in 5 seconds though. Not publishing saves a lot of time :D

So apparently I have 14k config topics (eep!) for my 174 devices.

@Koenkk
Copy link
Owner

Koenkk commented Jun 8, 2024

Great, so it seems we have some acceptable performance now? It was not only the publishing btw, for every entity it was generating the full config of the device, this is now calculated once for each device.

@corporategoth
Copy link
Author

With 1.38.0, I would say so. I'm sure it will be even better with device-based discovery if/when HA implements it, as that will reduce the number of config messages it receives (to check for deleted devices) too. But at 5s, it's and with the other changes, crashing on startup is a thing of the past. So thanks! :)

@Koenkk
Copy link
Owner

Koenkk commented Jun 8, 2024

Great, I guess we can finally close this issue, from the Z2M side we did everything we could 😄

@Koenkk Koenkk closed this as completed Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Feature request
Projects
None yet
Development

No branches or pull requests

5 participants