-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modbus TCP reconnect issue #39
Comments
How do you recover from this situation? By restarting modmqttd? |
that is correct. It also recovers if I unplug the ethernet from my computer since that causes my IP address to drop which forces the connection closed. I think if I wait tens of minutes the kernel does clean up that socket eventually, but I am not really sure whats going on there. I just noticed it happened when I wasn't paying attention and a new connection had opened up (different source port in tcp dump). |
just fyi, I tried rebuilding with libmodbus from source (v3.1.10) but had the same issue |
The packet trace really threw me for a loop. Notice how the device (192.168.2.2) keeps acking the same seq (1920) but my machine keeps resending the same segments (1640:1920). I didn't post it, but if you look at the kernel buffer for that connection in netstat/ss the buffer just continues to grow at a steady rate until it maxes out. Even if it did recover eventually, after a long enough drop in the network the appropriate behavaviour should be to drop the conneciton and reopen it, otherwise we'd be treated to 15+ mintues worth of responess as fast as the device could churn through the buffer. |
I am thinking about adding some kind of watchdog to modbus threads. First version would simply watch if thread can poll any data from any slave for a configurable time. If not, then thread will restart itself: it will destroy existing libmodbus context, create new one and do initial poll request. I had similar problem for RTU (#9), too. |
I guess there is no way to configure it right now where we only send data to the modbus device (only writes with no reads)? I was wondering if polling was the right place or if you really wanted to have your watchdog reset on any successful action. Is there any special consideration needed for the write side? I haven't dug far enough into the message queues to see if the modbus stuff maintains a list of the unsuccessful write requests and ensures those eventually get taken care of. If the context is recreated are all previous (unsuccessful) write requests lost? |
Just to be clear, is this something you would like me to take a stab at implementing? I am digging into the code as if I am going to fix it, but I certainly don't want to duplicate work if you are already planning on working on it |
I've fixed the simple part - bad response timeouts and apply to TCP mode in ae9ab63. As for watchdog, you are right - it should monitor reads and writes. An idea: ModbusThread maintains incoming queue, so when Watchdog will switch to new context then processing is paused. After a quick look I think that inialPoll is not needed, and there is no need to reset ModbusThread::mPoller. IModbusContext could store the last failed write per slave. After reset, a watchdog could get this data from old context and just call IModbusContext::writeModbusRegisters() for each one. If you want it fast, then make a PR. I won't be able to work on it in next week or two. If you have any questions about existing code, just ask here. Thanks! |
I appreciate your responsiveness. I am not sure if its totally worth trying to maintain the last write per slave as part of the initial fix. In the ideal case I would think we would want to try to maintain the QoS as specified by mqtt (as best we can anyways).. eg, for 0 (at most once) the failed write would just be allowed to fail; for 1 or 2, we would want to be persistent with those writes. (solving 2 would probably be a bit tricky though) I am somewhat new to mqtt so if I misunderstood something fundamental about the QoS, please excuse my ignorance. Solving that would require maintaining at least the last write for each relevant register (s), rather than just the last write per slave. Anyways, that sort of processing seems like it belongs at the higher level, not within the modbus layer itself.. Solving it in the modbus layer would mean adding in information about the QoS to the write requests, which feels wong. I do see there is a message in the reverse direction informing the the higher layer about the failed writes, so it does seem like this would be possible to achieve if it is indeed desirable. I could see wanting this eventually, but it seems like a much bigger can of worms. Side note, not sure if it has any effect, but it does look like you are processing RegisterWriteFailed messages as if they are RegisterReadFailed indescriminately in MqttClient::processRegistersOperationFailed. Getting off that tangent and back to the task at hand, I will probably tackle early next week unless something more important comes up. The codebase is clean, so I don't think understanding it will be an issue. I think any questions will mostly be about what you think is the best course of action. I know this is in regards to the part I just argued against doing, but I think the proposition to store data in the IModbusContext conflicts with the idea that we would create a new context when the watchdog triggers, however this did get me thinking.... If the context just automatically closed the connection if no successful action was taken within the configured time, wouldn't that solve the issue? The thread loop would see that the connection is closed an automatically reopen it. That work could be done rather quickly either
(this one feels a little silly, but maintains the IModbusContext as an abstract interface class) edit: Making the watchdog timeout configurable with this method doesnt really work well
thoughts? |
QoS is fulfilled when mosquitto callback is called, it does not matter for the rest of processing. Normally you have state topic where command sender can subscribe to. This way sender check if command was processed successfully. There is a new RPC-style API in MQTT5, but it is not implemented in mqmgateway yet.
I was thinking about destroying whole context, but maybe modbus_disconnect/modbus_connect is enough. Anyway I would rather not add this watchdog logic to ModbusThread directly, but to some ModbusThread::mWatchdog instance and feed this instance with data about read and write errors. Then ModbusThread should ask it if we should reconnect from its control loop. Watchdog should maintain all information it needs to make decision about reconnection. The condition for reconnection is "we are unable to read from any slaves for more than X seconds". You are right that it does not make sense to store last failed writes. If in the future there will be some logic to retry writes x times and after that send RegisterWriteFailed, maybe watchdog should retry last write operation after reconnection. Now write is tried only once, so we do not need to worry about this. This class could be extended in the future - for example configuration when you poll one slave every 1s, and second one every hour will not work well with condition above. We may want to exclude some slaves or specify number of tries instead of time period etc.... this should be read from modbus.watchdog configuration section.
The idea behind IModbusContext and IMqttImpl is that I can use mocked versions in unittests. This allows me to test gateway behavior without libmodbus and mosquitto dependencies. I treat them as stripped API for those external libraries and I cannot put there any internal logic, because it is not possible to create unit tests for it. For watchdog behavior I would like to have unittests too, they are required for merge with master branch. |
ok. I am happy with that implementation plan. I should be able to start it later this afternoon.
I suppose that's one way to look at it, but I think it also makes certain combinations of QoS pointless. This is hypothetical since I don't believe mqmgateway supports specifying the mqtt qos (but presuming it did), what would be the point to using QOS 2 between the broker and the and mqmgateway if mqmgateway can drop the message? If I had an application where I wanted to guarantee that some register eventually gets set to value X through the mqtt->modbus pipeline, then in this case using QoS 2 wouldn't help at all. The only way I could think to achieve that in that case would be to implement periodically send some message to the mqtt broker and monitor that register through an indepent return path. It woud make sense for the QoS of the return path to be 2, but for the send path setting the register QoS 2 or 1 woulkd be pointless and you would get the same guarantees out of using 0. Basically not extending the guarantees of the QoS all the way down to the modbus device means that at the application level users wouldn't be able to depend on those guarantees. This is totally an aside from the issue at hand, so please forgive me for continuing the digression; I just enjoy these sort of conversations. |
QoS is for message delivery, not message processing. For write you can set QoS to 2 if you want to avoid no write due to message lost in transit or possible multiple writes with QoS 1.
RPC-like calls are solved with MQTT5 request-response topic. A good description is here. |
Ok, so I got 3 possibilities sorta sketched out. Option 1: Wrapper class. The modbus_thread would wrap the constructed modbus context in this class which itself implements the IModbusContext interface. It would then pass this to the modbus_poller and additionally it would use this when attempting to write registers. Within the main loop it would ask this something like:
Most of the methods are just pass through. Its configuration could either be weapped into the ModbusNetworkConfig or made its own separate item.
Pros: the modbus_poller doesn't need to change at all, in fact, most of the code that interacts with the context doesn't need to change. Cons: To properly unit test this I'd need to mock out a clock class or something since using a wall clock internally is not really condusive to good unit tests. Option 2: Utility Class which is manually invoked This would need to be passed into the modbus_poller in addition to living in the thread since it would need to have the appropriate method invoked on each failure/success. Time is passed into each call here to make it easily unit testable.
Option 3: Message Queue based This is basically the same as the last one but works directly on the message queues. Or rather, the modbus_thread would invoke the appropriate method as part of its queue processing to inform this thing about the various events. Again, time is passed in here to assist with unit testing. To make it really work, I'd probably need to add a message to report successful writes of the register, since while it does send a MsgRegisterValues after successful write, it only does this if we are also polling those registers.
|
Option 2 is the one I like the best. But thinking about more I would rather add some read stats to ModbusPoller class. You can just do Do not bother about writes for now. Whole write handling needs to be rewritten due to #40. I plan to rename ModbusPoller to ModbusExecutor and add write logic to it. Unit tests rely o system clock now. It would be better to use some mockable class instead of std::chrono, but I did not find any good replacement solution at the time it was designed. Anyway for test it is enough to count number of fake connections in MockedModbusContext and assert on it after test. |
Is this getting closer to what you would like?
at that point, I am not even sure ModbusWatchdog even needs to be stateful if its just acting off of the stats provided to it in the instant. |
Closer. I have a feeling now that it may be better if I am not sure if my idea of monitoring every slave was correct. Maybe monitor every read in Watchdog timeframe is better? I mean if Watchdog is configured for 5s, then if there is no successful read in 5s connection should be reset.
|
It might be YAGNI, but I am inclined to say it would be better to have a ModbusStats object which shared and provided to each of the poller and the watchdog. The poller (or whatever it mutates into) would call into it to report events. The watchdog watchdog would access it to decide if we should disconnect. I could see those stats being useful for other things. For example, they could be periodically published via mqtt as a monitoring/diagnostics mechanism. The specifics of how the watchdog decides if we disconnect could freely change since since it would be encapsulated within it. If multiple different methods of making that determination are ever needed, that could be come a policy of the watchdog. A watchdog policy doesn't care about multiple slaves certainly works best for my use cases, but then again, in the past decade I've pretty much only exclusively used modbus with a single slave per tcp connection so I might be overlooking some other use case. But even if there was some slave that we had not tried to query in a whille, I can't see how resetting the connection would do any harm. If the connection can be brought back up it will be. If it can't then surely we wouldn't have been able to communicate with that slave anyways? Just for transparancy, while I do still desire to work on this to contribute (spirit of open source and all), the time frame is not urgent; so I until we decide on a design I am not putting too much time into this. Basically doing something as simple as this: Chance-Maritime-Technologies@40a2d60 pretty much mitigates the problem for my (at least within my usage) |
It will be less work to add watchdog after #40 is closed, so if this hotfix works for you then I would rather postpone it. |
ok |
If you don't mind I am planning to fix both this issue and #9. Iv'e already confirmed TCP reconnect problem in my setup. With a switch between gateway and device, connection is not resumed. |
of course I dont mind |
It is mostly done in modbus_watchdog branch. Two new features there:
Disconnect freezes the Executor state, so if the first write triggers reconnect, next will be sent after connection is established. This should improve reliability a little on RTU networks, where single read and write errors happens from time to time. I am not merging into the master branch yet due to suspicious logs related to command retries on my semi-broken RTU network. In your case it should work ok :-) |
Awesome. I'tt try it out. I wont be able to get to it today or tomorrow but should be able to try it out first thing Monday morning. Thank you! |
branch modbus_watchdog merged to master in a2fe853 after short tests. |
just wanted to confirm that this is indeed fixed. Thanks again |
Steps to reproduce:
Computer and device are both connected via a simple switch. Run the configuration below. Disconnect the device from the switch (or cut its power). Wait for some period of time then reconnect the ethernet or restore power.
At this point the connection doesn't recover.
What is observed:
Notice the packet trace on the right. You can see at time 1713450984.766486 I reconnected the device, but at this point the sliding window never progresses further...
What should Happen:
Frankly, libmodbus should be handling this... and I do believe this is a libmodbus bug... actually now that I am typing this I am wondering if this is fixed in a newer version... however:
I was wondering if you would be opposed to me adding in some application level guards against this sort of timeout? Basically a separate higher application level timeout to guard against libmodbus errors in the event we don't get any sort of response back within the given time by forcefully resetting the whole modbus connection.
(excuse me being lazy for not making this a separate bug/pull request but...) bonus bug:
The modbus timeout parameters are not being set for TCP connections, eg:
And also, the default value of 1s is not actually a valid value for the modbus timeout parameters:
https://github.com/stephane/libmodbus/blob/master/docs/modbus_set_response_timeout.md
the max is 999999 us
OS: ubuntu 22.04
Libmodbus version: libmodbus5-3.1.6-2
Device: https://www.advantech.com/en-us/products/a67f7853-013a-4b50-9b20-01798c56b090/adam-6066/mod_e256cafc-36dd-4740-86a0-d7e53b5c4ddf
Configuration (abbreviated):
The text was updated successfully, but these errors were encountered: