Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check did not exit properly / Failed to register iobroker #433

Closed
19Alex opened this issue Sep 19, 2017 · 10 comments
Closed

Check did not exit properly / Failed to register iobroker #433

19Alex opened this issue Sep 19, 2017 · 10 comments
Assignees

Comments

@19Alex
Copy link

19Alex commented Sep 19, 2017

I have been struggling with a very strange problem on the supportforum. dwhitfield and bheden have asked me to enter my problem here at Github. The entire story can be read at the support forum: https://support.nagios.com/forum/viewtopic.php?f=7&t=45389

In essence the wproc is not catching the check output. Around the same time Nagios fails to register iobrokers for stdout and stderr:
`
[1504231962] wproc: Core Worker 68003: Failed to register iobroker for stdout

[1504231962] wproc: Core Worker 68003: Failed to register iobroker for stderr

[1504231962] Warning: Check of host 'VM-MIJNHELICON2' did not exit properly!

[1504231962] HOST ALERT: VM-MIJNHELICON2;UNREACHABLE;SOFT;1;(Host check did not exit properly)
`

The problem only occurred on openSUSE 42.3 x64, which I installed twice because I doubted myself. Only after installing on CentOS and running problem free, I began suspecting the OS. Later installed it on SLES 12.3 x64, also without even a glitch.

From my point of view, the move to SLES solved my problem. But I can imagine you might want to investigate the openSUSE box a little further. I have it still running.

@box293
Copy link
Contributor

box293 commented Sep 19, 2017

I'll start by gathering some information.

How did you install it? What guide did you follow? Was it this one?
https://support.nagios.com/kb/article.php?id=96

In the forum thread you mentioned "container", can you elaborate on this please.

Comparing your working SLES 12.3 system against the openSUSE 42.3 system what is the output of these commands:

ipcs -a
cat /etc/sysctl.conf
sysctl -a
cat /etc/security/limits.conf
ulimit -a

@19Alex
Copy link
Author

19Alex commented Sep 19, 2017

Yes, I used https://support.nagios.com/kb/article.php?id=96#SUSE
In the attachement are the requested outputs.
github.txt

@douglasawh
Copy link

@19Alex neither of your links works (the file does though). I'm not sure if this is an encoding issue from your end or github. Copy/paste of both links works fine, so maybe there's nothing you need to do and this is just an FYI for others on the thread.

@19Alex
Copy link
Author

19Alex commented Sep 19, 2017

@douglasawh thank you for your feedback, I didn't test them. But I have fixed them now.

@hedenface
Copy link
Contributor

Uh - well heck. I wish I would've asked for your limits.conf when this was in the support forum. Increase those values up a bit. I think starting at 10000 for hard/soft for each * and root should do it.

@19Alex
Copy link
Author

19Alex commented Sep 19, 2017

That's very interresting; the limits.conf on the new SLES-box only has comment lines in it. Would that mean it has no limits?

I have changed the values as asked on the openSUSE-box and restarted it. I'll let you know in the morning how it performed through the night.

@hedenface hedenface self-assigned this Sep 19, 2017
@19Alex
Copy link
Author

19Alex commented Sep 20, 2017

This is all very interesting, before I went home yesterday I saw 560 improperly exited checks in a period of 2 hours. I immediately increased the values to 1000000. This morning I see only 52 improperly exited checks since midnight. I am now going to remove the limits alltogether, and see what that does.

@box293
Copy link
Contributor

box293 commented Sep 21, 2017

That's very interresting; the limits.conf on the new SLES-box only has comment lines in it. Would that mean it has no limits?

Yes.

I checked openSUSE Leap 42.2 and 42.3 and they seem to have the same limits implemented. However 42.1 does not, so I assume based on the comment # harden against fork-bombs it was added by the openSUSE build team.

Good news is that it sounds like your issue is resolved. I've created the following KB article based on the information you gave us, this will help others in the future.
https://support.nagios.com/kb/article/nagios-core-failed-to-register-iobroker.html

@hedenface do you think it's possible for Nagios Core to detect such limits are causing said issue and add some more detailed logging?

@19Alex
Copy link
Author

19Alex commented Sep 21, 2017

This is not good, I removed the limits, the file now only has comment lines, but I still witnessed a few dozen improperly exited checks per day It's way less than before however.

@hedenface
Copy link
Contributor

@box293 I think that sounds like a great idea for an edition to Core 4.4. (#434)

@19Alex Yes but it's pretty obviously some kind of limit at this point. What do your cgroups look like? I'm not an OpenSUSE guy, but maybe the following:

cd /sys/fs/cgroup/cpuset/cgroup
mkdir priority
cd priority
cat cpu.shares

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants