Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modbus TCP reconnect issue #39

Closed
JayHerpin opened this issue Apr 18, 2024 · 25 comments
Closed

Modbus TCP reconnect issue #39

JayHerpin opened this issue Apr 18, 2024 · 25 comments
Labels
bug Something isn't working

Comments

@JayHerpin
Copy link

JayHerpin commented Apr 18, 2024

Steps to reproduce:

Computer and device are both connected via a simple switch. Run the configuration below. Disconnect the device from the switch (or cut its power). Wait for some period of time then reconnect the ethernet or restore power.

At this point the connection doesn't recover.

What is observed:

image

Notice the packet trace on the right. You can see at time 1713450984.766486 I reconnected the device, but at this point the sliding window never progresses further...

What should Happen:

Frankly, libmodbus should be handling this... and I do believe this is a libmodbus bug... actually now that I am typing this I am wondering if this is fixed in a newer version... however:

I was wondering if you would be opposed to me adding in some application level guards against this sort of timeout? Basically a separate higher application level timeout to guard against libmodbus errors in the event we don't get any sort of response back within the given time by forcefully resetting the whole modbus connection.

(excuse me being lazy for not making this a separate bug/pull request but...) bonus bug:

The modbus timeout parameters are not being set for TCP connections, eg:

image

And also, the default value of 1s is not actually a valid value for the modbus timeout parameters:

https://github.com/stephane/libmodbus/blob/master/docs/modbus_set_response_timeout.md

the max is 999999 us


OS: ubuntu 22.04
Libmodbus version: libmodbus5-3.1.6-2
Device: https://www.advantech.com/en-us/products/a67f7853-013a-4b50-9b20-01798c56b090/adam-6066/mod_e256cafc-36dd-4740-86a0-d7e53b5c4ddf

Configuration (abbreviated):


modmqttd:
  # converter_search_path:
  #   - /usr/lib/modmqttd/
  # converter_plugins:
  # - stdconv.so
  # - exprconv.so
modbus:
  networks:
    - name: adam
      address: 192.168.2.2
      port: 502
      response_timeout: 200ms
      response_data_timeout: 200ms
      slaves:
        - address: 0
          delay_before_first_poll: 50ms
          poll_groups:
            - register: 1
              register_type: coil
              count: 6
            - register: 17
              register_type: coil
              count: 6

mqtt:
  client_id: modbus
  refresh: 1s
  broker:
    host: localhost
    port: 1883
    keepalive: 60
  objects:
    - topic: winch/anchor/manual_ctrl
      network: adam
      slave: 0
      refresh: 200ms
      state:
        register: 1
        register_type: coil
@BlackZork BlackZork added the bug Something isn't working label Apr 18, 2024
@BlackZork
Copy link
Owner

How do you recover from this situation? By restarting modmqttd?

@JayHerpin
Copy link
Author

that is correct. It also recovers if I unplug the ethernet from my computer since that causes my IP address to drop which forces the connection closed.

I think if I wait tens of minutes the kernel does clean up that socket eventually, but I am not really sure whats going on there. I just noticed it happened when I wasn't paying attention and a new connection had opened up (different source port in tcp dump).

@JayHerpin
Copy link
Author

just fyi, I tried rebuilding with libmodbus from source (v3.1.10) but had the same issue

@JayHerpin
Copy link
Author

JayHerpin commented Apr 19, 2024

The packet trace really threw me for a loop. Notice how the device (192.168.2.2) keeps acking the same seq (1920) but my machine keeps resending the same segments (1640:1920). I didn't post it, but if you look at the kernel buffer for that connection in netstat/ss the buffer just continues to grow at a steady rate until it maxes out.

Even if it did recover eventually, after a long enough drop in the network the appropriate behavaviour should be to drop the conneciton and reopen it, otherwise we'd be treated to 15+ mintues worth of responess as fast as the device could churn through the buffer.

@BlackZork
Copy link
Owner

I am thinking about adding some kind of watchdog to modbus threads. First version would simply watch if thread can poll any data from any slave for a configurable time. If not, then thread will restart itself: it will destroy existing libmodbus context, create new one and do initial poll request.

I had similar problem for RTU (#9), too.

@JayHerpin
Copy link
Author

I guess there is no way to configure it right now where we only send data to the modbus device (only writes with no reads)? I was wondering if polling was the right place or if you really wanted to have your watchdog reset on any successful action.

Is there any special consideration needed for the write side? I haven't dug far enough into the message queues to see if the modbus stuff maintains a list of the unsuccessful write requests and ensures those eventually get taken care of. If the context is recreated are all previous (unsuccessful) write requests lost?

@JayHerpin
Copy link
Author

Just to be clear, is this something you would like me to take a stab at implementing? I am digging into the code as if I am going to fix it, but I certainly don't want to duplicate work if you are already planning on working on it

@BlackZork
Copy link
Owner

I've fixed the simple part - bad response timeouts and apply to TCP mode in ae9ab63.

As for watchdog, you are right - it should monitor reads and writes.

An idea:
ModbusThread should have new mWatchdog that will check IModbusContext for errors in ModbusThread::handleRegisterError and ModbusThread::processWrite.

ModbusThread maintains incoming queue, so when Watchdog will switch to new context then processing is paused. After a quick look I think that inialPoll is not needed, and there is no need to reset ModbusThread::mPoller.

IModbusContext could store the last failed write per slave. After reset, a watchdog could get this data from old context and just call IModbusContext::writeModbusRegisters() for each one.

If you want it fast, then make a PR. I won't be able to work on it in next week or two. If you have any questions about existing code, just ask here.

Thanks!

@JayHerpin
Copy link
Author

JayHerpin commented Apr 19, 2024

I appreciate your responsiveness.

I am not sure if its totally worth trying to maintain the last write per slave as part of the initial fix. In the ideal case I would think we would want to try to maintain the QoS as specified by mqtt (as best we can anyways).. eg, for 0 (at most once) the failed write would just be allowed to fail; for 1 or 2, we would want to be persistent with those writes. (solving 2 would probably be a bit tricky though)

I am somewhat new to mqtt so if I misunderstood something fundamental about the QoS, please excuse my ignorance.

Solving that would require maintaining at least the last write for each relevant register (s), rather than just the last write per slave.

Anyways, that sort of processing seems like it belongs at the higher level, not within the modbus layer itself.. Solving it in the modbus layer would mean adding in information about the QoS to the write requests, which feels wong.

I do see there is a message in the reverse direction informing the the higher layer about the failed writes, so it does seem like this would be possible to achieve if it is indeed desirable. I could see wanting this eventually, but it seems like a much bigger can of worms.

Side note, not sure if it has any effect, but it does look like you are processing RegisterWriteFailed messages as if they are RegisterReadFailed indescriminately in MqttClient::processRegistersOperationFailed.

Getting off that tangent and back to the task at hand, I will probably tackle early next week unless something more important comes up. The codebase is clean, so I don't think understanding it will be an issue. I think any questions will mostly be about what you think is the best course of action.

I know this is in regards to the part I just argued against doing, but I think the proposition to store data in the IModbusContext conflicts with the idea that we would create a new context when the watchdog triggers, however this did get me thinking.... If the context just automatically closed the connection if no successful action was taken within the configured time, wouldn't that solve the issue? The thread loop would see that the connection is closed an automatically reopen it. That work could be done rather quickly either

  1. by changing IModbusContext to use a non-virtual public interface pattern and having it implement that behavior itself

  2. by utilizing a wrapper ModbusContext that uses the real modbus context internally and implements this (eg, it would forward most methods to the internal one, but monitor the calls and results, eg:

virtual std::vector<uint16_t> readModbusRegisters (....) { 
    try
    {
        mWrappedContext->readModbusRegisters (...);
        mLastSuccessfulOperation = now ();
    } catch (...) {
        if (now - lastSuccessfulOperation > timeout) disconnect ();
   }

(this one feels a little silly, but maintains the IModbusContext as an abstract interface class)

edit: Making the watchdog timeout configurable with this method doesnt really work well

  1. handle that logic in the modbus_thread, eg:

while(mShouldRun) {
            if (mModbus) {
                if (now () - mlastSuccessfullOperation > timeout)
                     mModbus->disconnect ();
                 
                 if (!mModbus->isConnected()) {
                 // rest of main loop

thoughts?

@BlackZork
Copy link
Owner

In the ideal case I would think we would want to try to maintain the QoS as specified by mqtt

QoS is fulfilled when mosquitto callback is called, it does not matter for the rest of processing. Normally you have state topic where command sender can subscribe to. This way sender check if command was processed successfully. There is a new RPC-style API in MQTT5, but it is not implemented in mqmgateway yet.

If the context just automatically closed the connection if no successful action was taken within the configured time, wouldn't that solve the issue?

I was thinking about destroying whole context, but maybe modbus_disconnect/modbus_connect is enough. Anyway I would rather not add this watchdog logic to ModbusThread directly, but to some ModbusThread::mWatchdog instance and feed this instance with data about read and write errors. Then ModbusThread should ask it if we should reconnect from its control loop. Watchdog should maintain all information it needs to make decision about reconnection.

The condition for reconnection is "we are unable to read from any slaves for more than X seconds". You are right that it does not make sense to store last failed writes. If in the future there will be some logic to retry writes x times and after that send RegisterWriteFailed, maybe watchdog should retry last write operation after reconnection. Now write is tried only once, so we do not need to worry about this.

This class could be extended in the future - for example configuration when you poll one slave every 1s, and second one every hour will not work well with condition above. We may want to exclude some slaves or specify number of tries instead of time period etc.... this should be read from modbus.watchdog configuration section.

by changing IModbusContext to use a non-virtual public interface pattern and having it implement that behavior itself

The idea behind IModbusContext and IMqttImpl is that I can use mocked versions in unittests. This allows me to test gateway behavior without libmodbus and mosquitto dependencies. I treat them as stripped API for those external libraries and I cannot put there any internal logic, because it is not possible to create unit tests for it.

For watchdog behavior I would like to have unittests too, they are required for merge with master branch.

@JayHerpin
Copy link
Author

ok. I am happy with that implementation plan. I should be able to start it later this afternoon.

In the ideal case I would think we would want to try to maintain the QoS as specified by mqtt

QoS is fulfilled when mosquitto callback is called, it does not matter for the rest of processing. Normally you have state topic where command sender can subscribe to. This way sender check if command was processed successfully. There is a new RPC-style API in MQTT5, but it is not implemented in mqmgateway yet.

I suppose that's one way to look at it, but I think it also makes certain combinations of QoS pointless. This is hypothetical since I don't believe mqmgateway supports specifying the mqtt qos (but presuming it did), what would be the point to using QOS 2 between the broker and the and mqmgateway if mqmgateway can drop the message?

If I had an application where I wanted to guarantee that some register eventually gets set to value X through the mqtt->modbus pipeline, then in this case using QoS 2 wouldn't help at all. The only way I could think to achieve that in that case would be to implement periodically send some message to the mqtt broker and monitor that register through an indepent return path. It woud make sense for the QoS of the return path to be 2, but for the send path setting the register QoS 2 or 1 woulkd be pointless and you would get the same guarantees out of using 0.

Basically not extending the guarantees of the QoS all the way down to the modbus device means that at the application level users wouldn't be able to depend on those guarantees.

This is totally an aside from the issue at hand, so please forgive me for continuing the digression; I just enjoy these sort of conversations.

@BlackZork
Copy link
Owner

It woud make sense for the QoS of the return path to be 2, but for the send path setting the register QoS 2 or 1 woulkd be pointless and you would get the same guarantees out of using 0.

QoS is for message delivery, not message processing. For write you can set QoS to 2 if you want to avoid no write due to message lost in transit or possible multiple writes with QoS 1.

Basically not extending the guarantees of the QoS all the way down to the modbus device means that at the application level users wouldn't be able to depend on those guarantees.

RPC-like calls are solved with MQTT5 request-response topic. A good description is here.

@JayHerpin
Copy link
Author

Ok, so I got 3 possibilities sorta sketched out.

Option 1: Wrapper class.

The modbus_thread would wrap the constructed modbus context in this class which itself implements the IModbusContext interface. It would then pass this to the modbus_poller and additionally it would use this when attempting to write registers.

Within the main loop it would ask this something like:


(! m_monitoredContext->timedOutSlaves ().empty ())
   m_monitoredContext->disconnect ();

Most of the methods are just pass through. Its configuration could either be weapped into the ModbusNetworkConfig or made its own separate item.

class ModbusWatchdog : public IModbusContext {
    public:
        // These are straight pass through
        virtual void connect() { mWrappedContext->connect(); }
        virtual bool isConnected() const { return mWrappedContext->isConnected(); }
        virtual void disconnect() { mWrappedContext->disconnect(); }
        virtual ~ModbusWatchdog() {}
        virtual ModbusNetworkConfig::Type getNetworkType() const final { return mWrappedContext->getNetworkType(); }

        explicit ModbusWatchdog (std::shared_ptr<IModbusContext> wrappedContext);
        virtual void init(const ModbusNetworkConfig& config) final;
        virtual void writeModbusRegisters(const MsgRegisterValues &msg) final {
            try {
              auto &lastSuccess = mTimeOfLastSuccess[msg.mSlaveId];
              // initialize last success to time of first attempt
                if (lastSuccess == std::chrono::steady_clock::time_point ())
                        lastSuccess =std::chrono::steady_clock::now();
                mWrappedContext->writeModbusRegisters(msg);
                lastSuccess = std::chrono::steady_clock::now();
                mConsecutiveFailures[msg.mSlaveId] = 0;
            } catch (...) {
                ++mConsecutiveFailures[msg.mSlaveId];
                throw;
            }
        }
        // this would look similar to writeModbusRegisters
        virtual std::vector<uint16_t> readModbusRegisters(int slaveId, const RegisterPoll& regData) final;

        // Get the list of slaves which are presently timed ouyt
        std::vector<int> timedOutSlaves() const {
          std::vector<int> ret;

          const auto now = std::chrono::steady_clock::now();
          for (auto &tols : mTimeOfLastSuccess)
            if (now - tols.second > mTimeout)
              ret.push_back(tols.first);

          return ret;
        }

    private:
        std::shared_ptr<IModbusContext> mWrappedContext;
        std::chrono::steady_clock::duration mTimeout;
        // Time of the last successful operation indexed per slave
        std::map<int, std::chrono::steady_clock::time_point> mTimeOfLastSuccess;
        std::map<int, int> mConsecutiveFailures;
};

Pros: the modbus_poller doesn't need to change at all, in fact, most of the code that interacts with the context doesn't need to change.

Cons: To properly unit test this I'd need to mock out a clock class or something since using a wall clock internally is not really condusive to good unit tests.

Option 2: Utility Class which is manually invoked

This would need to be passed into the modbus_poller in addition to living in the thread since it would need to have the appropriate method invoked on each failure/success.

Time is passed into each call here to make it easily unit testable.

///////////////////////////////////////////////////////////////////////////.
class ModbusWatchdog {
    public:

        void init (const ModbusWatchdogConfig &config);
        /// Could be extended to use 
        void recordFailure (std::chrono::steady_clock::time_point time, int slaveId, ....);
        void recordSuccess (std::chrono::steady_clock::time_point time, int slaveId, ....);

        std::vector<int> timedOutSlaves () const;
        bool isTimedOut () const { return ! timedOutSlaves.empty (); }
};

Option 3: Message Queue based

This is basically the same as the last one but works directly on the message queues. Or rather, the modbus_thread would invoke the appropriate method as part of its queue processing to inform this thing about the various events.

Again, time is passed in here to assist with unit testing.

To make it really work, I'd probably need to add a message to report successful writes of the register, since while it does send a MsgRegisterValues after successful write, it only does this if we are also polling those registers.

///////////////////////////////////////////////////////////////////////////.
class ModbusWatchdog {
    public:

        void init (const ModbusWatchdogConfig &config);

        void process(std::chrono::steady_clock::time_point time, QueueItem item) {
          if (item.isSameAs(typeid(MsgRegisterValues))) {
                  recordSuccess (time, item);
          } else if (item.isSameAs(typeid(MsgMqttNetworkState))) {
                  // reset all timeouts here? 
          } else if (item.isSameAs(typeid(MsgRegisterReadFailed))) {
                  recordFailure (time, item);
          } else if (item.isSameAs(typeid(MsgRegisterWriteFailed))) {
                  recordFailure (time, item);
          }  else {
                  assert (false); // unhandled
          }


          
        }

        std::vector<int> timedOutSlaves () const;
        bool isTimedOut () const { return ! timedOutSlaves.empty (); }
};

@BlackZork
Copy link
Owner

BlackZork commented Apr 23, 2024

Option 2 is the one I like the best. But thinking about more I would rather add some read stats to ModbusPoller class.

You can just do ModbusWatchdog::updateState(const ModbusPoller&) then.

Do not bother about writes for now. Whole write handling needs to be rewritten due to #40. I plan to rename ModbusPoller to ModbusExecutor and add write logic to it.

Unit tests rely o system clock now. It would be better to use some mockable class instead of std::chrono, but I did not find any good replacement solution at the time it was designed. Anyway for test it is enough to count number of fake connections in MockedModbusContext and assert on it after test.

@JayHerpin
Copy link
Author

Is this getting closer to what you would like?

class ModbusSlaveStats {
    public:
        // Reset internal stats (suhc as after reconnecting)
        void reset();

        /// Register a read failure
        void readFailure();
        /// Register a read success
        void readSuccess();

        /// Return the amount of time elapsed since our first read failure without a successful read.
        /// This returns 0s if the most recent read request was not a failure
        /// This also returns 0s if no read requests have been made
        std::chrono::steady_clock::duration readFailureTime () const;

        /// Return the number of consecutive read failures since we started having read issues
        /// This returns if the most recent read request was not failure or no read requests have been made
        int readFailureCount() const;

    private:
...
};


class ModbusPoller {
    public:
        // not sure what to call this
        void reset () {
            if (mModbus->isConnected ())
            {
                for (auto &s : mStats)
                    s.second.reset ();
            }
        }

        const std::map<int, ModbusSlaveStats> &slaveStats () const { return mStats; } 

    private:
        std::map<int, ModbusSlaveStats> mStats;
};

at that point, I am not even sure ModbusWatchdog even needs to be stateful if its just acting off of the stats provided to it in the instant.

@BlackZork
Copy link
Owner

Closer. I have a feeling now that it may be better if std::map<int, ModbusSlaveStats> mStats could be owned by Watchdog and Poller should be able to update them somehow (1). It makes more sense to me to reset Watchdog after successful reconnection instead of Poller. mStats data better fits as Watchdog data instead of Poller data(2).

I am not sure if my idea of monitoring every slave was correct. Maybe monitor every read in Watchdog timeframe is better? I mean if Watchdog is configured for 5s, then if there is no successful read in 5s connection should be reset.

  1. Poller.allDone() could be used as condition somewhere in ModbusThread control loop.
  2. Poller already has its own dynamic slave map with info what is left to poll, but I plan to add write queues to it, so it will maintain a static slave map with read and write queues.

@JayHerpin
Copy link
Author

It might be YAGNI, but I am inclined to say it would be better to have a ModbusStats object which shared and provided to each of the poller and the watchdog. The poller (or whatever it mutates into) would call into it to report events. The watchdog watchdog would access it to decide if we should disconnect.

I could see those stats being useful for other things. For example, they could be periodically published via mqtt as a monitoring/diagnostics mechanism.

The specifics of how the watchdog decides if we disconnect could freely change since since it would be encapsulated within it. If multiple different methods of making that determination are ever needed, that could be come a policy of the watchdog. A watchdog policy doesn't care about multiple slaves certainly works best for my use cases, but then again, in the past decade I've pretty much only exclusively used modbus with a single slave per tcp connection so I might be overlooking some other use case. But even if there was some slave that we had not tried to query in a whille, I can't see how resetting the connection would do any harm. If the connection can be brought back up it will be. If it can't then surely we wouldn't have been able to communicate with that slave anyways?

Just for transparancy, while I do still desire to work on this to contribute (spirit of open source and all), the time frame is not urgent; so I until we decide on a design I am not putting too much time into this.

Basically doing something as simple as this: Chance-Maritime-Technologies@40a2d60 pretty much mitigates the problem for my (at least within my usage)

@BlackZork
Copy link
Owner

Basically doing something as simple as this: Chance-Maritime-Technologies@40a2d60 pretty much mitigates the problem for my (at least within my usage)

It will be less work to add watchdog after #40 is closed, so if this hotfix works for you then I would rather postpone it.

@JayHerpin
Copy link
Author

ok

@BlackZork
Copy link
Owner

BlackZork commented May 4, 2024

If you don't mind I am planning to fix both this issue and #9. Iv'e already confirmed TCP reconnect problem in my setup. With a switch between gateway and device, connection is not resumed.

@JayHerpin
Copy link
Author

of course I dont mind

@BlackZork
Copy link
Owner

It is mostly done in modbus_watchdog branch. Two new features there:

  • a watchdog with configurable watch_period that tracks commands sent to all slaves. If there is no successful command execution in watch period then it calls mModbus->disconnect().
  • ModbusExecutor gets an ability to retry write and read commands, configurable on a network or a slave level.

Disconnect freezes the Executor state, so if the first write triggers reconnect, next will be sent after connection is established. This should improve reliability a little on RTU networks, where single read and write errors happens from time to time.

I am not merging into the master branch yet due to suspicious logs related to command retries on my semi-broken RTU network. In your case it should work ok :-)

@JayHerpin
Copy link
Author

Awesome. I'tt try it out. I wont be able to get to it today or tomorrow but should be able to try it out first thing Monday morning. Thank you!

@BlackZork
Copy link
Owner

branch modbus_watchdog merged to master in a2fe853 after short tests.

@JayHerpin
Copy link
Author

just wanted to confirm that this is indeed fixed. Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants