[question] using OMD with nagflux grafana influxdb #3

ghost · 2016-03-24T09:15:48Z

Hello,

I am using OMD with nagflux grafana and influxdb and experiencing very strange problem that any operations regarding log files takes very long time to respond. (i.e. in Thruk clicking Notifications, Alerts, Availability etc.) takes like 400 seconds to list Notifications. When i restart naemon/nagios/icinga (does not matter which core i use) then it is back to normal for 5 minutes and then again it takes 400 seconds or so to list Notifications or Alerts. After I disable either nagflux or influxdb then it is back to normal fast response time. How can it be related? what nagflux does to interfere with logs processing? Any thoughts would be helpful. thanks

Griesbacher · 2016-03-24T10:48:10Z

Hi,
the only connection between Nagflux and the core is Livestatus, to logs(real files) it's only the spoolfiles. I'm not absolutely sure but I think Thruk is not using the logfiles instead it's also using Livestatus.
Does go back to normal when you just start the Influxdb?

ghost · 2016-03-24T14:53:44Z

When i disable nagflux OR influxdb it is OK. I think because no actions are taken with livestatus when one of them (nagflux or influxdb) is down. I am trying to find out how to debug this further, how to approach
p.s. i have noticed that when using influxdb+nagflux the core (nagios/naemon) cpu usage is 100% constantly

Griesbacher · 2016-03-29T09:05:16Z

Nagflux stops automatically when you stop the influxdb.

The CPU usage is strange, we haven't noticed such behaviour so far. But I didn't test it with neamon at all, but if you tried nagios too then it's maybe no problem with the core.

Like I said the only interaction of nagflux is:

Livestatus
Filehandle (spoolfiles: ~/var/pnp4nagios/spool/)
HTTP Influxdb

Is your installation very huge? But that's just a shot in the dark.

jwesterholt · 2016-04-29T14:33:17Z

Hello,

we also see this problem. After some debuggen/strace we found the reason.

In an instance without any hosts the cpu usage of nagios is about 0,6%. In another instance without host which was used before we have a cpu usage about 30 %.

The strace shows that the archive of nagios is loaded in regular intervals. When deleting the nagios history the cpu usage stays at the normal level.

@Griesbacher: Do you have any idea what is causing this behaviour?

Jonathan

Griesbacher · 2016-05-02T06:31:24Z

Hi,
@jwesterholt So you mean the load is only on nagios and not on nagflux, right? I have never looked deeper into Nagios so far, so I have no real clue. Like I said, the only connection between Nagflux and Nagios is the spoolfile folder and Livestatus.
Regarding the spoolfile folder maybe the nagios write operation did not finish, I waited in an older version of Nagflux a few Seconds to avoid any errors but we didn't encounter any problems by removing this wait, here is an old Version of it: b86c8d6 If you can build Nagflux from Source you could try to build in this delay, maybe it helps. But that's also a shot in the dark, like I said we have never encountered such problems, so I've no setup to test such behaviour.
Regarding Livestatus I have no clue...
Philip

jwesterholt · 2016-05-02T12:57:42Z

Hello,

the load is completely on nagios.

A testsystem is easy to setup if you hav another nagios/omd instance running. I just created a new site and copied the old log files from "/opt/omd/sites/oldsite/var/nagios/archive/" to the new site ("/opt/omd/sites/newsite/var/nagios/archive/"). Do not forget to set the permissions to the site user. After this, start the instance and after a few minutes with nagflux enabled you see the nagios core taking the cpu.

The strace then shows the reading of the logs from the archive directory. After reading the logs the cpu load is slowly going down until it reads the log again after 2 minutes.

When using the older version b86c8d6 everything seems to be fine.

When you provide a patch for the current version i will test it here in my environment.

Edit: I just saw that in this version the timewait is already disabled. I will try to find the time to compile and test some other versions.

Jonathan

Griesbacher · 2016-05-03T09:53:39Z

We measured the cpu usage of nagios on a pretty big setup and it's not going up, within one hour. Core was nagios(omd).
@jwesterholt Due to your hint of the archive and the two minutes I think the problem is not the spoolfile, maybe it's the livestatusquery. The reason for that is, nagflux querys livestatus every two minutes, which (I'm not totally sure) uses the archive to search for log entries. I'm currently using this query(The negation is for icinga2, because there was a bug I'll fix it by time):
`GET log
Columns: type time contact_name message
Filter: type ~ .*NOTIFICATION
Filter: time < %d
Negate:
OutputFormat: csv

`
%d is the unixtimestamp two minutes ago. You could try to execute this query on livestatus and if your nagios cpu usage is also going up...

jwesterholt · 2016-05-09T12:00:59Z

Hello,

the problem is indeed the negation in the query. When using the query above i see the problem with the cpu load.

I have come to two solutions:

Increase the max_cached_messages of Livestatus (Search on https://mathias-kettner.de/checkmk_livestatus.html). This helps because we have a large environment and the test leaves us with around 10.000.000 Messages in the logfile since October 2015. I tested with max_cached_messages=20.000.000 and i also did not have any problem. The only problem is the memory usage (250 byte per cached message)
Use the following query:
`GET log
Columns: type time contact_name message
Filter: type ~ .*NOTIFICATION
Filter: time > %d
OutputFormat: csv

`

Measurements from the system:
OMD[test_nagflux]:~$ time unixcat < test.lql /opt/omd/sites/test_nagflux/tmp/run/live

real 0m30.946s
user 0m0.000s
sys 0m0.000s
OMD[test_nagflux]:~$ time unixcat < test2.lql /opt/omd/sites/test_nagflux/tmp/run/live

real 0m0.003s
user 0m0.000s
sys 0m0.000s
OMD[test_nagflux]:$ diff -u test.lql test2.lql
--- test.lql 2016-05-09 13:54:49.704681923 +0200
+++ test2.lql 2016-05-08 09:59:04.820232742 +0200
@@ -1,7 +1,6 @@
GET log
Columns: type time contact_name message
Filter: type ~ ._NOTIFICATION
-Filter: time < 1462457555
-Negate:
+Filter: time > 1462520867
OutputFormat: csv
OMD[test_nagflux]:$ wc -l var/nagios/archive/_|tail -1
10549067 total

I will compile the newest version of check_mk/nagflux without using negation.

@Griesbacher: What exactly was the reason to negate the query? Is the query abov the on you intended?

Griesbacher · 2016-05-10T07:28:57Z

@jwesterholt Thanks for testing! The query is that way because of a bug in Icinga2 Livestatus: https://dev.icinga.org/issues/10179 which is still not fixed... and therefore this was a workaround I found but I didn't thought it has that much impact on Nagios(and we never faced this kind of problem).

In my opinion it has to determined which version of Livestatus the system is using, to avoid such problems...

Griesbacher · 2016-05-10T08:26:28Z

@jwesterholt I added a little workaround in Nagflux, you could try it out by building it from source, if it's working I'll add it to OMD later.

jwesterholt · 2016-05-10T16:15:27Z

This seems to be working. With the old compilation after two minutes the load goes up. The new version is working fine.

Adding this version to Consol-OMD would be great as my script compiles it from there.

Griesbacher closed this as completed in ConSol-Monitoring/omd@77c80d8 May 17, 2016

jowanw mentioned this issue Mar 28, 2017

Nagios high cpu - high cached log messages #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] using OMD with nagflux grafana influxdb #3

[question] using OMD with nagflux grafana influxdb #3

ghost commented Mar 24, 2016

Griesbacher commented Mar 24, 2016

ghost commented Mar 24, 2016

Griesbacher commented Mar 29, 2016

jwesterholt commented Apr 29, 2016

Griesbacher commented May 2, 2016

jwesterholt commented May 2, 2016 •

edited

Loading

Griesbacher commented May 3, 2016

jwesterholt commented May 9, 2016

Griesbacher commented May 10, 2016

Griesbacher commented May 10, 2016

jwesterholt commented May 10, 2016

[question] using OMD with nagflux grafana influxdb #3

[question] using OMD with nagflux grafana influxdb #3

Comments

ghost commented Mar 24, 2016

Griesbacher commented Mar 24, 2016

ghost commented Mar 24, 2016

Griesbacher commented Mar 29, 2016

jwesterholt commented Apr 29, 2016

Griesbacher commented May 2, 2016

jwesterholt commented May 2, 2016 • edited Loading

Griesbacher commented May 3, 2016

jwesterholt commented May 9, 2016

Griesbacher commented May 10, 2016

Griesbacher commented May 10, 2016

jwesterholt commented May 10, 2016

jwesterholt commented May 2, 2016 •

edited

Loading