Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMD 5.11.20230318-labs-edition seems to freeze/block the livestatus socket #163

Open
infraweavers opened this issue Mar 29, 2023 · 6 comments

Comments

@infraweavers
Copy link
Contributor

infraweavers commented Mar 29, 2023

This is unfortunatley a little vague at the moment however it seems like when we put PG001 (host) into downtime on 5.11.20230318 naemon ends up locking up or getting broken in some way.

Under normal circumstances this returns this:

Every 2.0s: lsof /omd/sites/default/tmp/run/live                                                                                                                                                             OMD002: Wed Mar 29 10:43:51 2023
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
naemon  444013 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM
naemon  444027 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM

However when it is broken (i.e. thruk is timing out communicating with the socket) lsof shows:

OMD[default@OMD002]:~$ lsof /omd/sites/default/tmp/run/live
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF    NODE NAME
naemon  348464 default   12u  unix 0x00000000106e0b91      0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM
naemon  348464 default   19u  unix 0x000000003ea4cdc5      0t0 1772423 /omd/sites/default/tmp/run/live type=STREAM
naemon  348477 default   12u  unix 0x00000000106e0b91      0t0 1394514 /omd/sites/default/tmp/run/live type=STREAM

Which looks the same naemon has spun up another file handle to the socket or something.

Thruk has the following errors;

[2023/03/29 10:31:06][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.
[2023/03/29 10:31:07][OMD002][ERROR] 491: failed to connect - failed to connect to /omd/sites/default/tmp/run/live: Resource temporarily unavailable at /omd/sites/default/share/thruk/lib/Thruk/Backend/Manager.pm line 1631.

There is nothing significant or that looks like errors in the naemon.log itself nor the livestatus.log.

Our resolution for the problem is:

killall -9 naemon; omd restart naemon

We are not convinced that the downtime action is actually what is causing it, it may just be that it has correllated with the event multiple times.

@infraweavers
Copy link
Contributor Author

infraweavers commented Mar 29, 2023

Hmm,

We've just had this happen sporadically with just the two sockets showing in lsof:

naemon  444013 default   12u  unix 0x0000000086c5cff7      0t0 9310587 /omd/sites/default/tmp/run/live type=STREAM
naemon  444027 default   12u  unix 0x000000003aced7f1      0t0 1806884 /omd/sites/default/tmp/run/live type=STREAM

strace of the two pids, one is very busy one is not:

@OMD002:~$ sudo strace --attach=444027
strace: Process 444027 attached
restart_syscall(<... resuming interrupted read ...>) = 0
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500)  = 0 (Timeout)
kill(444013, 0)                         = 0
poll([{fd=13, events=POLLIN}], 1, 500^Cstrace: Process 444027 detached

@sni
Copy link
Contributor

sni commented Mar 29, 2023

this might be linked to the recent changes in naemon comment/downtime handling but needs more investigation.

@infraweavers
Copy link
Contributor Author

Yeah the other thread is absolutely hammering something like this and it looks like it's the same data over and over:

image

@infraweavers
Copy link
Contributor Author

this might be linked to the recent changes in naemon comment/downtime handling but needs more investigation.

Do you have any suggestions of how we can gather more information?

@infraweavers
Copy link
Contributor Author

infraweavers commented Mar 29, 2023

Also, this really doesn't look right:
image

It looks like the retention file has all the downtimes/comment data duplicated many times ...

@sni
Copy link
Contributor

sni commented Mar 29, 2023

there is something going wrong... i just updated the patch, since it wasn't the last version of that patch anyway. You could try tomorrows daily.
Btw, this is the PR in question:
naemon/naemon-core#420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants