-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] using OMD with nagflux grafana influxdb #3
Comments
Hi, |
When i disable nagflux OR influxdb it is OK. I think because no actions are taken with livestatus when one of them (nagflux or influxdb) is down. I am trying to find out how to debug this further, how to approach |
Nagflux stops automatically when you stop the influxdb. The CPU usage is strange, we haven't noticed such behaviour so far. But I didn't test it with neamon at all, but if you tried nagios too then it's maybe no problem with the core. Like I said the only interaction of nagflux is:
Is your installation very huge? But that's just a shot in the dark. |
Hello, we also see this problem. After some debuggen/strace we found the reason. In an instance without any hosts the cpu usage of nagios is about 0,6%. In another instance without host which was used before we have a cpu usage about 30 %. The strace shows that the archive of nagios is loaded in regular intervals. When deleting the nagios history the cpu usage stays at the normal level. @Griesbacher: Do you have any idea what is causing this behaviour? Jonathan |
Hi, |
Hello, the load is completely on nagios. A testsystem is easy to setup if you hav another nagios/omd instance running. I just created a new site and copied the old log files from "/opt/omd/sites/oldsite/var/nagios/archive/" to the new site ("/opt/omd/sites/newsite/var/nagios/archive/"). Do not forget to set the permissions to the site user. After this, start the instance and after a few minutes with nagflux enabled you see the nagios core taking the cpu. The strace then shows the reading of the logs from the archive directory. After reading the logs the cpu load is slowly going down until it reads the log again after 2 minutes. When using the older version b86c8d6 everything seems to be fine. When you provide a patch for the current version i will test it here in my environment. Edit: I just saw that in this version the timewait is already disabled. I will try to find the time to compile and test some other versions. Jonathan |
We measured the cpu usage of nagios on a pretty big setup and it's not going up, within one hour. Core was nagios(omd). ` |
Hello, the problem is indeed the negation in the query. When using the query above i see the problem with the cpu load. I have come to two solutions:
` Measurements from the system: real 0m30.946s real 0m0.003s I will compile the newest version of check_mk/nagflux without using negation. @Griesbacher: What exactly was the reason to negate the query? Is the query abov the on you intended? |
@jwesterholt Thanks for testing! The query is that way because of a bug in Icinga2 Livestatus: https://dev.icinga.org/issues/10179 which is still not fixed... and therefore this was a workaround I found but I didn't thought it has that much impact on Nagios(and we never faced this kind of problem). In my opinion it has to determined which version of Livestatus the system is using, to avoid such problems... |
@jwesterholt I added a little workaround in Nagflux, you could try it out by building it from source, if it's working I'll add it to OMD later. |
This seems to be working. With the old compilation after two minutes the load goes up. The new version is working fine. Adding this version to Consol-OMD would be great as my script compiles it from there. |
Hello,
I am using OMD with nagflux grafana and influxdb and experiencing very strange problem that any operations regarding log files takes very long time to respond. (i.e. in Thruk clicking Notifications, Alerts, Availability etc.) takes like 400 seconds to list Notifications. When i restart naemon/nagios/icinga (does not matter which core i use) then it is back to normal for 5 minutes and then again it takes 400 seconds or so to list Notifications or Alerts. After I disable either nagflux or influxdb then it is back to normal fast response time. How can it be related? what nagflux does to interfere with logs processing? Any thoughts would be helpful. thanks
The text was updated successfully, but these errors were encountered: