Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readable socket never gets read, conn_loop degrades into busy loop #52

Closed
LongHairedHacker opened this issue Jul 5, 2019 · 3 comments
Closed

Comments

@LongHairedHacker
Copy link

LongHairedHacker commented Jul 5, 2019

Hi,

we recently noticed some strange behaviour of mbusd in some of our more complex multimaster setups.
We use mbusd to read multiple devices one rtu bus from multiple modbus tcp masters.

After working flawlessly for some time, we noticed to mbusd started to cause a suspicious amount of cpu load.

Since this a production system and mbusd was compiled without the debug option,
I tried strace and observed the following sequence of syscalls:

$ strace -p $(pidof mbusd) -e _newselect, read
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=532417}) = 1 (in [8], left {tv_sec=0, tv_usec=532409})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=532122}) = 1 (in [8], left {tv_sec=0, tv_usec=532115})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=531826}) = 1 (in [8], left {tv_sec=0, tv_usec=531819})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=531533}) = 1 (in [8], left {tv_sec=0, tv_usec=531526})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=531242}) = 1 (in [8], left {tv_sec=0, tv_usec=531235})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530950}) = 1 (in [8], left {tv_sec=0, tv_usec=530943})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530655}) = 1 (in [8], left {tv_sec=0, tv_usec=530648})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530364}) = 1 (in [8], left {tv_sec=0, tv_usec=530358})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=530072}) = 1 (in [8], left {tv_sec=0, tv_usec=530065})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=529778}) = 1 (in [8], left {tv_sec=0, tv_usec=529771})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=529483}) = 1 (in [8], left {tv_sec=0, tv_usec=529476})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=529190}) = 1 (in [8], left {tv_sec=0, tv_usec=529184})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=528899}) = 1 (in [8], left {tv_sec=0, tv_usec=528893})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=528604}) = 1 (in [8], left {tv_sec=0, tv_usec=528597})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=528309}) = 1 (in [8], left {tv_sec=0, tv_usec=528302})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527991}) = 1 (in [8], left {tv_sec=0, tv_usec=527984})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527694}) = 1 (in [8], left {tv_sec=0, tv_usec=527687})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527402}) = 1 (in [8], left {tv_sec=0, tv_usec=527395})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=527109}) = 1 (in [8], left {tv_sec=0, tv_usec=527102})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=526815}) = 1 (in [8], left {tv_sec=0, tv_usec=526808})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=526520}) = 1 (in [8], left {tv_sec=0, tv_usec=526513})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=526225}) = 1 (in [8], left {tv_sec=0, tv_usec=526218})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525931}) = 1 (in [8], left {tv_sec=0, tv_usec=525924})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525640}) = 1 (in [8], left {tv_sec=0, tv_usec=525634})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525347}) = 1 (in [8], left {tv_sec=0, tv_usec=525340})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=525055}) = 1 (in [8], left {tv_sec=0, tv_usec=525048})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=524761}) = 1 (in [8], left {tv_sec=0, tv_usec=524754})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=524470}) = 1 (in [8], left {tv_sec=0, tv_usec=524463})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=524179}) = 1 (in [8], left {tv_sec=0, tv_usec=524172})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523885}) = 1 (in [8], left {tv_sec=0, tv_usec=523878})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523593}) = 1 (in [8], left {tv_sec=0, tv_usec=523586})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523300}) = 1 (in [8], left {tv_sec=0, tv_usec=523294})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=523007}) = 1 (in [8], left {tv_sec=0, tv_usec=523001})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=522716}) = 1 (in [8], left {tv_sec=0, tv_usec=522709})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=522422}) = 1 (in [8], left {tv_sec=0, tv_usec=522416})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=522128}) = 1 (in [8], left {tv_sec=0, tv_usec=522121})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=521835}) = 1 (in [8], left {tv_sec=0, tv_usec=521828})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=521543}) = 1 (in [8], left {tv_sec=0, tv_usec=521537})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=521249}) = 1 (in [8], left {tv_sec=0, tv_usec=521243})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520957}) = 1 (in [8], left {tv_sec=0, tv_usec=520950})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520666}) = 1 (in [8], left {tv_sec=0, tv_usec=520659})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520374}) = 1 (in [8], left {tv_sec=0, tv_usec=520367})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=520081}) = 1 (in [8], left {tv_sec=0, tv_usec=520074})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=519789}) = 1 (in [8], left {tv_sec=0, tv_usec=519782})
_newselect(9, [3 4 5 6 8], [], NULL, {tv_sec=0, tv_usec=519498}) = 1 (in [8], left {tv_sec=0, tv_usec=519491})

As you can see the fd 8 has data pending to be read but no read syscall occurs.
mbusd did not recover until I killed the process responsible for the tcp connection behind 8.

I replicated the scenario on a test system with a debug build:

04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475635 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475581 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475528 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475475 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475422 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475368 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475314 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475259 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475205 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475152 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475099 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:475045 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474990 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474937 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474883 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474829 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474773 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474720 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474666 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474612 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474558 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474503 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474449 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474394 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474341 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474286 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474233 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474179 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474126 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474072 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:474019 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473965 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473911 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473857 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473802 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473749 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473694 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473640 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473586 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 
04 Jul 2019 15:17:58 conn_loop(): select(): max_sd = 8, t_out = 000000:473533 
04 Jul 2019 15:17:58 conn_loop(): select() returns 2 

The strace output looked similar to one above.
I can reliably recreate this situation regardless of the hardware used (rs485 adapter and modbus rtu devices), by connecting 4 modbus TCP clients and then causing a timeout on the rtu bus (eg. by power-cycling the rtu device).

Naturally I did some prodding and poking to see why this one socket descriptors never gets read.
I changed the logging to print conn->sd instead of the ip address and added some log messages at interesting points in the code.
Relevant for the following explanation are two of those changes:

  1. Logging of the sds in sdsetrd and sdsetwr after the select and the state of the corresponding connection. See here
  2. Logging of curconn->sd in the while loop servicing the tcp connections See here.

Triggering the problem with those changes in place resulted in this output:

05 Jul 2019 14:38:40 conn_loop(): select(): max_sd = 8, t_out = 000000:083186 
05 Jul 2019 14:38:40 conn_loop(): select() returns 1 
05 Jul 2019 14:38:40 conn[8]: can read.
05 Jul 2019 14:38:40 conn[8]: state is CONN_HEADER
05 Jul 2019 14:38:40 conn[5] is being serviced.
05 Jul 2019 14:38:40 conn[6] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn_loop(): select(): max_sd = 8, t_out = 000000:083167 
05 Jul 2019 14:38:40 conn_loop(): select() returns 1 
05 Jul 2019 14:38:40 conn[8]: can read.
05 Jul 2019 14:38:40 conn[8]: state is CONN_HEADER
05 Jul 2019 14:38:40 conn[5] is being serviced.
05 Jul 2019 14:38:40 conn[6] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn_loop(): select(): max_sd = 8, t_out = 000000:083148 
05 Jul 2019 14:38:40 conn_loop(): select() returns 1 
05 Jul 2019 14:38:40 conn[8]: can read.
05 Jul 2019 14:38:40 conn[8]: state is CONN_HEADER
05 Jul 2019 14:38:40 conn[5] is being serviced.
05 Jul 2019 14:38:40 conn[6] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.
05 Jul 2019 14:38:40 conn[7] is being serviced.

As you can see the while loop iterates twice of the connection with sd 7, but never reaches the one with 8.
This means there is a control flow path through the switch statement for the connection state,
which skips the curconn = queue_next_elem(&queue, curconn); in lines 860 or 878.
Hence the sd 8 is never read and the next select statement terminates immediately.

Due to the complexity of the of the control flow in this section of the code I have been unable to determine the exact path leading to this problem.
Likely an additional break somewhere, but I'm not confident that I understand the code well enough to spot it.

However the condition of the surrounding while(len--) loop suggests that it was never intended to iterate over the same connection twice.
Therefore I would propose to move the curconn = queue_next_elem(&queue, curconn); out of the switch statement and into the while loop, since curconn has to advance regardless of the connections state.
See this commit.

After applying this fix, I have not been able to reproduce the busy loop behaviour.

This might also partially explain the behaviour described in #41

@3cky
Copy link
Owner

3cky commented Jul 8, 2019

Thank you for your efforts with this issue debugging! I think you're right, the connection queue handling definitely should be moved outside of the switch block, so I'm going to merge your fix.

@3cky 3cky closed this as completed in c582e6f Jul 8, 2019
@LongHairedHacker
Copy link
Author

Thanks for the quick response.

I've deployed 0.4.0 on some staging systems and it dropped the load average from 0.8 to 0.04ish.
And it seems to stay stable at that level.
Usually we had at least one 'stuck' socket and high cpu load within a few hours.
Strace output also looks okay: The process spends the majority of its time sleeping while waiting for IO.
I'd confirm this one as fixed for now.

@3cky
Copy link
Owner

3cky commented Jul 8, 2019

Great, thanks for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants