Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] using OMD with nagflux grafana influxdb #3

Closed
ghost opened this issue Mar 24, 2016 · 11 comments
Closed

[question] using OMD with nagflux grafana influxdb #3

ghost opened this issue Mar 24, 2016 · 11 comments

Comments

@ghost
Copy link

ghost commented Mar 24, 2016

Hello,

I am using OMD with nagflux grafana and influxdb and experiencing very strange problem that any operations regarding log files takes very long time to respond. (i.e. in Thruk clicking Notifications, Alerts, Availability etc.) takes like 400 seconds to list Notifications. When i restart naemon/nagios/icinga (does not matter which core i use) then it is back to normal for 5 minutes and then again it takes 400 seconds or so to list Notifications or Alerts. After I disable either nagflux or influxdb then it is back to normal fast response time. How can it be related? what nagflux does to interfere with logs processing? Any thoughts would be helpful. thanks

@Griesbacher
Copy link
Owner

Hi,
the only connection between Nagflux and the core is Livestatus, to logs(real files) it's only the spoolfiles. I'm not absolutely sure but I think Thruk is not using the logfiles instead it's also using Livestatus.
Does go back to normal when you just start the Influxdb?

@ghost
Copy link
Author

ghost commented Mar 24, 2016

When i disable nagflux OR influxdb it is OK. I think because no actions are taken with livestatus when one of them (nagflux or influxdb) is down. I am trying to find out how to debug this further, how to approach
p.s. i have noticed that when using influxdb+nagflux the core (nagios/naemon) cpu usage is 100% constantly

@Griesbacher
Copy link
Owner

Nagflux stops automatically when you stop the influxdb.

The CPU usage is strange, we haven't noticed such behaviour so far. But I didn't test it with neamon at all, but if you tried nagios too then it's maybe no problem with the core.

Like I said the only interaction of nagflux is:

  • Livestatus
  • Filehandle (spoolfiles: ~/var/pnp4nagios/spool/)
  • HTTP Influxdb

Is your installation very huge? But that's just a shot in the dark.

@jwesterholt
Copy link

Hello,

we also see this problem. After some debuggen/strace we found the reason.

In an instance without any hosts the cpu usage of nagios is about 0,6%. In another instance without host which was used before we have a cpu usage about 30 %.

The strace shows that the archive of nagios is loaded in regular intervals. When deleting the nagios history the cpu usage stays at the normal level.

@Griesbacher: Do you have any idea what is causing this behaviour?

Jonathan

@Griesbacher
Copy link
Owner

Hi,
@jwesterholt So you mean the load is only on nagios and not on nagflux, right? I have never looked deeper into Nagios so far, so I have no real clue. Like I said, the only connection between Nagflux and Nagios is the spoolfile folder and Livestatus.
Regarding the spoolfile folder maybe the nagios write operation did not finish, I waited in an older version of Nagflux a few Seconds to avoid any errors but we didn't encounter any problems by removing this wait, here is an old Version of it: b86c8d6 If you can build Nagflux from Source you could try to build in this delay, maybe it helps. But that's also a shot in the dark, like I said we have never encountered such problems, so I've no setup to test such behaviour.
Regarding Livestatus I have no clue...
Philip

@jwesterholt
Copy link

jwesterholt commented May 2, 2016

Hello,

the load is completely on nagios.

A testsystem is easy to setup if you hav another nagios/omd instance running. I just created a new site and copied the old log files from "/opt/omd/sites/oldsite/var/nagios/archive/" to the new site ("/opt/omd/sites/newsite/var/nagios/archive/"). Do not forget to set the permissions to the site user. After this, start the instance and after a few minutes with nagflux enabled you see the nagios core taking the cpu.

The strace then shows the reading of the logs from the archive directory. After reading the logs the cpu load is slowly going down until it reads the log again after 2 minutes.

When using the older version b86c8d6 everything seems to be fine.

When you provide a patch for the current version i will test it here in my environment.

Edit: I just saw that in this version the timewait is already disabled. I will try to find the time to compile and test some other versions.

Jonathan

@Griesbacher
Copy link
Owner

We measured the cpu usage of nagios on a pretty big setup and it's not going up, within one hour. Core was nagios(omd).
@jwesterholt Due to your hint of the archive and the two minutes I think the problem is not the spoolfile, maybe it's the livestatusquery. The reason for that is, nagflux querys livestatus every two minutes, which (I'm not totally sure) uses the archive to search for log entries. I'm currently using this query(The negation is for icinga2, because there was a bug I'll fix it by time):
`GET log
Columns: type time contact_name message
Filter: type ~ .*NOTIFICATION
Filter: time < %d
Negate:
OutputFormat: csv

`
%d is the unixtimestamp two minutes ago. You could try to execute this query on livestatus and if your nagios cpu usage is also going up...

@jwesterholt
Copy link

Hello,

the problem is indeed the negation in the query. When using the query above i see the problem with the cpu load.

I have come to two solutions:

  1. Increase the max_cached_messages of Livestatus (Search on https://mathias-kettner.de/checkmk_livestatus.html). This helps because we have a large environment and the test leaves us with around 10.000.000 Messages in the logfile since October 2015. I tested with max_cached_messages=20.000.000 and i also did not have any problem. The only problem is the memory usage (250 byte per cached message)
  2. Use the following query:
    `GET log
    Columns: type time contact_name message
    Filter: type ~ .*NOTIFICATION
    Filter: time > %d
    OutputFormat: csv

`

Measurements from the system:
OMD[test_nagflux]:~$ time unixcat < test.lql /opt/omd/sites/test_nagflux/tmp/run/live

real 0m30.946s
user 0m0.000s
sys 0m0.000s
OMD[test_nagflux]:~$ time unixcat < test2.lql /opt/omd/sites/test_nagflux/tmp/run/live

real 0m0.003s
user 0m0.000s
sys 0m0.000s
OMD[test_nagflux]:$ diff -u test.lql test2.lql
--- test.lql 2016-05-09 13:54:49.704681923 +0200
+++ test2.lql 2016-05-08 09:59:04.820232742 +0200
@@ -1,7 +1,6 @@
GET log
Columns: type time contact_name message
Filter: type ~ ._NOTIFICATION
-Filter: time < 1462457555
-Negate:
+Filter: time > 1462520867
OutputFormat: csv
OMD[test_nagflux]:
$ wc -l var/nagios/archive/_|tail -1
10549067 total

I will compile the newest version of check_mk/nagflux without using negation.

@Griesbacher: What exactly was the reason to negate the query? Is the query abov the on you intended?

@Griesbacher
Copy link
Owner

@jwesterholt Thanks for testing! The query is that way because of a bug in Icinga2 Livestatus: https://dev.icinga.org/issues/10179 which is still not fixed... and therefore this was a workaround I found but I didn't thought it has that much impact on Nagios(and we never faced this kind of problem).

In my opinion it has to determined which version of Livestatus the system is using, to avoid such problems...

@Griesbacher
Copy link
Owner

@jwesterholt I added a little workaround in Nagflux, you could try it out by building it from source, if it's working I'll add it to OMD later.

@jwesterholt
Copy link

This seems to be working. With the old compilation after two minutes the load goes up. The new version is working fine.

Adding this version to Consol-OMD would be great as my script compiles it from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants