You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
we recently noticed some strange behaviour of mbusd in some of our more complex multimaster setups.
We use mbusd to read multiple devices one rtu bus from multiple modbus tcp masters.
After working flawlessly for some time, we noticed to mbusd started to cause a suspicious amount of cpu load.
Since this a production system and mbusd was compiled without the debug option,
I tried strace and observed the following sequence of syscalls:
$ strace -p $(pidof mbusd) -e _newselect, read
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=532417}) = 1 (in [8], left {tv_sec=0, tv_usec=532409})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=532122}) = 1 (in [8], left {tv_sec=0, tv_usec=532115})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=531826}) = 1 (in [8], left {tv_sec=0, tv_usec=531819})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=531533}) = 1 (in [8], left {tv_sec=0, tv_usec=531526})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=531242}) = 1 (in [8], left {tv_sec=0, tv_usec=531235})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530950}) = 1 (in [8], left {tv_sec=0, tv_usec=530943})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530655}) = 1 (in [8], left {tv_sec=0, tv_usec=530648})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530364}) = 1 (in [8], left {tv_sec=0, tv_usec=530358})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530072}) = 1 (in [8], left {tv_sec=0, tv_usec=530065})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=529778}) = 1 (in [8], left {tv_sec=0, tv_usec=529771})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=529483}) = 1 (in [8], left {tv_sec=0, tv_usec=529476})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=529190}) = 1 (in [8], left {tv_sec=0, tv_usec=529184})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=528899}) = 1 (in [8], left {tv_sec=0, tv_usec=528893})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=528604}) = 1 (in [8], left {tv_sec=0, tv_usec=528597})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=528309}) = 1 (in [8], left {tv_sec=0, tv_usec=528302})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527991}) = 1 (in [8], left {tv_sec=0, tv_usec=527984})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527694}) = 1 (in [8], left {tv_sec=0, tv_usec=527687})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527402}) = 1 (in [8], left {tv_sec=0, tv_usec=527395})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527109}) = 1 (in [8], left {tv_sec=0, tv_usec=527102})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=526815}) = 1 (in [8], left {tv_sec=0, tv_usec=526808})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=526520}) = 1 (in [8], left {tv_sec=0, tv_usec=526513})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=526225}) = 1 (in [8], left {tv_sec=0, tv_usec=526218})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525931}) = 1 (in [8], left {tv_sec=0, tv_usec=525924})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525640}) = 1 (in [8], left {tv_sec=0, tv_usec=525634})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525347}) = 1 (in [8], left {tv_sec=0, tv_usec=525340})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525055}) = 1 (in [8], left {tv_sec=0, tv_usec=525048})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=524761}) = 1 (in [8], left {tv_sec=0, tv_usec=524754})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=524470}) = 1 (in [8], left {tv_sec=0, tv_usec=524463})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=524179}) = 1 (in [8], left {tv_sec=0, tv_usec=524172})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523885}) = 1 (in [8], left {tv_sec=0, tv_usec=523878})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523593}) = 1 (in [8], left {tv_sec=0, tv_usec=523586})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523300}) = 1 (in [8], left {tv_sec=0, tv_usec=523294})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523007}) = 1 (in [8], left {tv_sec=0, tv_usec=523001})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=522716}) = 1 (in [8], left {tv_sec=0, tv_usec=522709})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=522422}) = 1 (in [8], left {tv_sec=0, tv_usec=522416})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=522128}) = 1 (in [8], left {tv_sec=0, tv_usec=522121})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=521835}) = 1 (in [8], left {tv_sec=0, tv_usec=521828})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=521543}) = 1 (in [8], left {tv_sec=0, tv_usec=521537})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=521249}) = 1 (in [8], left {tv_sec=0, tv_usec=521243})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520957}) = 1 (in [8], left {tv_sec=0, tv_usec=520950})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520666}) = 1 (in [8], left {tv_sec=0, tv_usec=520659})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520374}) = 1 (in [8], left {tv_sec=0, tv_usec=520367})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520081}) = 1 (in [8], left {tv_sec=0, tv_usec=520074})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=519789}) = 1 (in [8], left {tv_sec=0, tv_usec=519782})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=519498}) = 1 (in [8], left {tv_sec=0, tv_usec=519491})
As you can see the fd 8 has data pending to be read but no read syscall occurs.
mbusd did not recover until I killed the process responsible for the tcp connection behind 8.
I replicated the scenario on a test system with a debug build:
The strace output looked similar to one above.
I can reliably recreate this situation regardless of the hardware used (rs485 adapter and modbus rtu devices), by connecting 4 modbus TCP clients and then causing a timeout on the rtu bus (eg. by power-cycling the rtu device).
Naturally I did some prodding and poking to see why this one socket descriptors never gets read.
I changed the logging to print conn->sd instead of the ip address and added some log messages at interesting points in the code.
Relevant for the following explanation are two of those changes:
Logging of the sds in sdsetrd and sdsetwr after the select and the state of the corresponding connection. See here
Logging of curconn->sd in the while loop servicing the tcp connections See here.
Triggering the problem with those changes in place resulted in this output:
05 Jul 2019 14:38:40 conn_loop(): select(): max_sd = 8, t_out = 000000:083186
05 Jul 2019 14:38:40 conn_loop(): select() returns 1
05 Jul 2019 14:38:40 conn[8]: can read.
05 Jul 2019 14:38:40 conn[8]: state is CONN_HEADER
05 Jul 2019 14:38:40 conn[5] is being serviced.
05 Jul 2019 14:38:40 conn[6] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn_loop(): select(): max_sd = 8, t_out = 000000:083167
05 Jul 2019 14:38:40 conn_loop(): select() returns 1
05 Jul 2019 14:38:40 conn[8]: can read.
05 Jul 2019 14:38:40 conn[8]: state is CONN_HEADER
05 Jul 2019 14:38:40 conn[5] is being serviced.
05 Jul 2019 14:38:40 conn[6] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn_loop(): select(): max_sd = 8, t_out = 000000:083148
05 Jul 2019 14:38:40 conn_loop(): select() returns 1
05 Jul 2019 14:38:40 conn[8]: can read.
05 Jul 2019 14:38:40 conn[8]: state is CONN_HEADER
05 Jul 2019 14:38:40 conn[5] is being serviced.
05 Jul 2019 14:38:40 conn[6] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
As you can see the while loop iterates twice of the connection with sd 7, but never reaches the one with 8.
This means there is a control flow path through the switch statement for the connection state,
which skips the curconn = queue_next_elem(&queue, curconn); in lines 860 or 878.
Hence the sd 8 is never read and the next select statement terminates immediately.
Due to the complexity of the of the control flow in this section of the code I have been unable to determine the exact path leading to this problem.
Likely an additional break somewhere, but I'm not confident that I understand the code well enough to spot it.
However the condition of the surrounding while(len--) loop suggests that it was never intended to iterate over the same connection twice.
Therefore I would propose to move the curconn = queue_next_elem(&queue, curconn); out of the switch statement and into the while loop, since curconn has to advance regardless of the connections state.
See this commit.
After applying this fix, I have not been able to reproduce the busy loop behaviour.
This might also partially explain the behaviour described in #41
The text was updated successfully, but these errors were encountered:
Thank you for your efforts with this issue debugging! I think you're right, the connection queue handling definitely should be moved outside of the switch block, so I'm going to merge your fix.
I've deployed 0.4.0 on some staging systems and it dropped the load average from 0.8 to 0.04ish.
And it seems to stay stable at that level.
Usually we had at least one 'stuck' socket and high cpu load within a few hours.
Strace output also looks okay: The process spends the majority of its time sleeping while waiting for IO.
I'd confirm this one as fixed for now.
Hi,
we recently noticed some strange behaviour of mbusd in some of our more complex multimaster setups.
We use mbusd to read multiple devices one rtu bus from multiple modbus tcp masters.
After working flawlessly for some time, we noticed to mbusd started to cause a suspicious amount of cpu load.
Since this a production system and mbusd was compiled without the debug option,
I tried strace and observed the following sequence of syscalls:
As you can see the fd 8 has data pending to be read but no read syscall occurs.
mbusd did not recover until I killed the process responsible for the tcp connection behind 8.
I replicated the scenario on a test system with a debug build:
The strace output looked similar to one above.
I can reliably recreate this situation regardless of the hardware used (rs485 adapter and modbus rtu devices), by connecting 4 modbus TCP clients and then causing a timeout on the rtu bus (eg. by power-cycling the rtu device).
Naturally I did some prodding and poking to see why this one socket descriptors never gets read.
I changed the logging to print
conn->sd
instead of the ip address and added some log messages at interesting points in the code.Relevant for the following explanation are two of those changes:
sdsetrd
andsdsetwr
after the select and the state of the corresponding connection. See herecurconn->sd
in the while loop servicing the tcp connections See here.Triggering the problem with those changes in place resulted in this output:
As you can see the while loop iterates twice of the connection with sd 7, but never reaches the one with 8.
This means there is a control flow path through the switch statement for the connection state,
which skips the
curconn = queue_next_elem(&queue, curconn);
in lines 860 or 878.Hence the sd 8 is never read and the next select statement terminates immediately.
Due to the complexity of the of the control flow in this section of the code I have been unable to determine the exact path leading to this problem.
Likely an additional break somewhere, but I'm not confident that I understand the code well enough to spot it.
However the condition of the surrounding
while(len--)
loop suggests that it was never intended to iterate over the same connection twice.Therefore I would propose to move the
curconn = queue_next_elem(&queue, curconn);
out of the switch statement and into the while loop, since curconn has to advance regardless of the connections state.See this commit.
After applying this fix, I have not been able to reproduce the busy loop behaviour.
This might also partially explain the behaviour described in #41
The text was updated successfully, but these errors were encountered: