Load explodes after every reload/restart of Icinga 2 #5465

pgress · 2017-08-07T08:52:50Z

We got a 2 master-cluster which gets very high load, when the core is reloaded.
The two Nodes contain 8vCPUs and 16GB RAM. So the Power shouldn't be a problem at all.
We got now about 700 Hosts with about 7000 Services in it.

We debugged this Problem already a little bit and found out, that there is no Check done until 5 to 6 minutes are gone. After this Duration alle checks start together which results in the high load. When we use a single-node instead of an Cluster, then we don't have the Problem. The Checks start immediately after the reload.

Version used (icinga2 --version): r2.7.0-1
Operating System and version: Debian GNU/Linux 8.9 (jessie)
Enabled features (icinga2 feature list): api checker command graphite ido-mysql mainlog notification
Icinga Web 2 version and modules (System - About):
cube | 1.0.0
graphite | 0.0.0.5
monitoring | 2.4.1
Config validation (icinga2 daemon -C):No Problems
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

host001:/root # icinga2 object list --type Endpoint
Object 'host001.localdomain' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 1:0-1:47
  * __name = "host001.localdomain"
  * host = "10.0.0.133"
    % = modified in '/etc/icinga2/zones.conf', lines 2:3-2:22
  * log_duration = 86400
  * name = "host001.localdomain"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 0
    * first_line = 1
    * last_column = 47
    * last_line = 1
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "host001.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 1:0-1:47
  * type = "Endpoint"
  * zone = ""

Object 'host002.localdomain' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 5:1-5:48
  * __name = "host002.localdomain"
  * host = "10.0.0.134"
    % = modified in '/etc/icinga2/zones.conf', lines 6:3-6:22
  * log_duration = 86400
  * name = "host002.localdomain"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 5
    * last_column = 48
    * last_line = 5
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "host002.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 5:1-5:48
  * type = "Endpoint"
  * zone = ""
host001:/root # icinga2 object list --type Zone
Object 'global_zone' of type 'Zone':
  % declared in '/etc/icinga2/zones.conf', lines 13:1-13:25
  * __name = "global_zone"
  * endpoints = null
  * global = true
    % = modified in '/etc/icinga2/zones.conf', lines 14:3-14:15
  * name = "global_zone"
  * package = "_etc"
  * parent = ""
  * source_location
    * first_column = 1
    * first_line = 13
    * last_column = 25
    * last_line = 13
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "global_zone" ]
    % = modified in '/etc/icinga2/zones.conf', lines 13:1-13:25
  * type = "Zone"
  * zone = ""

Object 'master' of type 'Zone':
  % declared in '/etc/icinga2/zones.conf', lines 9:1-9:20
  * __name = "master"
  * endpoints = [ "host001.localdomain", "host002.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 10:3-10:84
  * global = false
  * name = "master"
  * package = "_etc"
  * parent = ""
  * source_location
    * first_column = 1
    * first_line = 9
    * last_column = 20
    * last_line = 9
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "master" ]
    % = modified in '/etc/icinga2/zones.conf', lines 9:1-9:20
  * type = "Zone"
  * zone = ""

The text was updated successfully, but these errors were encountered:

dnsmichi · 2017-08-08T11:01:58Z

Do you have a specific performance analysis including graphs from work queues and enabled features, checks, etc ? It is hard to tell what exactly could cause this without more insights.

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#analyze-your-environment

pgress · 2017-08-08T11:55:57Z

Here is a snapshot from some of our Graphs: https://snapshot.raintank.io/dashboard/snapshot/XX5gtmp2yf4nnIXJ1oIpWt1FA263Um2S?orgId=2
It shows the binding between Load and Uptime. Additionally all Perfdata from the Icinga-Check is in the third panel.
When I watched it live, i could see that all CPU Cores were at max. Used Memory was getting higher, but not getting full. There were only minor IO on the Disk.
I've upload an part of the icinga2-log where we can see that there didn't happen anything some minutes.
icinga2-log.txt

dnsmichi · 2017-08-09T16:36:33Z

Thanks for the graphs. The last one uses the icinga check, which provides additional metrics about work queues in 2.7 - do you happen to have some stats/graphs on that too?

Logs look fine, nothing spectacular. This change in cpu load could come from the recent work queue additions for all features, i.e. graphite.

(for future reference - I modified the URL with render/ and added the screenshot here)

dnsmichi · 2018-01-08T13:01:33Z

Does this happen with 2.8 again? There were certain improvements for the cluster in this release.

Cheers,
Michael

uffsalot · 2018-01-17T09:51:58Z

In my setup yes.

Single Master:

CPU: Intel Xeon E3-1230 v5
RAM: 32GB DDR4 ECC
Storage: 2x 300GB SAS (Root FS) + 1 250GB SSD for Graphite

Our instance contains 512 Hosts an 5368 Services. 5 of them are Icinga2 Clients, the remaining hosts are checked by NRPE.

Version used (icinga2 --version): r2.8.0-1
Operating System and version: Debian GNU/Linux 8.10 (jessie)
Enabled features (icinga2 feature list): api checker command graphite ido-mysql mainlog notification
Icinga Web 2 version and modules (System - About):
businessprocess | 2.1.0
director | 1.4.2
fileshipper | 0.9.3
graphite | 0.0.0.5
monitoring | 2.5.0
vsphere | 1.1.0
Config validation (icinga2 daemon -C): No Problems

Simkimdm · 2018-01-18T08:55:50Z

In our setup it happens, too. a year ago it was so bad that the master Cluster could not catch anymore. So we had to expand our cluster with some satellites.
Most of our checks based on SNMP like check_nwc_health.
We have just 7959 Hosts and 16156 Service.
I tried to flat the peaks through the limitation of concurrent_checks = 256. but this have no effect on the Version 2,8.0.

dnsmichi · 2018-04-05T14:01:42Z

How many of these services actually invoke check_nwc_health? How's the average execution time and latency for these checks?

Thomas-Gelf · 2018-04-05T22:33:27Z

@dnsmichi: this is a real issue, root cause is our scheduling code in Icinga 2. Have a look at how Icinga 1.x tried to avoid such problems. It was far from being perfect, the 1.x scheduling code is a mess - but it's basics where well thought.

This issue is a combination of checks being rescheduled at restart (that's a bug, shouldn't happen) combined with a rescheduling/splaying logic where every checkable (and not a central scheduling logic) decides on it's own when to run the next check.

There is a related issue in our Icinga Support issue tracker. You'll find it combined with debug log, hints how to filter the log and a link pointing to the place in our code responsible for the "checks are not being executed for a long time" effect as explained by @pgress. Eventually talk to @gunnarbeutner, he should be aware of this issue - we discussed it quite some time ago.

The "reschedule everything when starting" issue should be easy to fix. Most of the heavy spikes shown by @uffsalot would then no longer appear. As you can see he has an environment with not too many monitored objects and not the greatest and latest but absolutely sufficient hardware. Especially given that most of his checks are NRPE-based, he should never experience what his graphs are showing.

Cheers,
Thomas

NB: Sooner or later we should consider implementing a scheduler logic taking the amount of active checks, their interval and their average execution time into account. It should try to fairly distribute new checks while respecting and preserving current schedule for existing ones.

paladox · 2018-04-26T00:11:38Z

We were also seeing the same thing when we tryed to upgraded to icinga2 @ miraheze using 1 core and 1gb of ram.

The cpu shot up after checks started running causing OOM errors and high cpu.

dnsmichi · 2018-05-09T13:33:51Z

There are some changes to this in current git master and the snapshot packages which will run into 2.9. This is scheduled for June.

paladox · 2018-05-09T13:36:10Z

@dnsmichi oh to reduce load?

Which changes? :)

dnsmichi · 2018-05-09T13:48:07Z

To influence check scheduling upon reload. Snapshot packages should be available for testing already.

widhalmt · 2018-05-29T15:11:37Z

I have the same problem in a customers setup. I'll try to get some tests / feedback from them as well.

paladox · 2018-06-01T16:08:17Z

Im guessing this 1a9c159 is the fix.

Crunsher · 2018-06-04T12:52:27Z

1a9c159 is not about this issue. While it might redistribute the load of early checks.

NB: Sooner or later we should consider implementing a scheduler logic taking the amount of active checks, their interval and their average execution time into account. It should try to fairly distribute new checks while respecting and preserving current schedule for existing ones.

Here we have the problem that we do not know much about the checks and can't make many guesses based on that information. A high execution time does not mean high load, a check with a long check interval might not have that because of it being a heavy check and the amount of active checks tells us nothing concrete either. The only thing that comes to my mind is randomizing execution checks with similar check intervals better.

dnsmichi · 2018-07-03T18:11:54Z

I'd suggest to test the snapshot packages and report back whether the problem persists or not.

Crunsher · 2018-09-04T12:02:46Z

Anycast ping, is someone experiencing this issue still with a recent Icinga 2 version?

dnsmichi · 2018-09-14T17:53:05Z

Might also be related to a setup where the master actively connects to all clients, in #6517.

paladox · 2018-09-14T17:55:27Z

We have experienced less load since using 2.9. We made sure only one check is run at one time.

(this is with nagios-nrpe-server (check_nrpe)) we doin't use icinga2 client.

MarcusCaepio · 2018-09-27T08:08:52Z

I also can see this with the very latest version 2.9.2. After a config reload, Satelites get a very huge load.

dnsmichi · 2018-09-27T08:12:27Z

Any chance you'll try the snapshot packages?

MarcusCaepio · 2018-09-27T09:29:49Z

Unfortunately not right at the moment, as I don't have an identical dev cluster right now. But if I can help with any further infos (total checks, plugins, etc), I would love to do it :)

dnsmichi · 2018-10-10T08:20:46Z

I believe that the load is caused by the reconnect timer, or many incoming connections with many separate threads being spawned. A full analysis is available in #6517.

MarcusCaepio · 2018-10-15T09:16:11Z

Still present in 2.10:
Master on Reload:

Sats on reload

dnsmichi · 2018-10-15T09:24:47Z

Ok, then @Thomas-Gelf was right about the scheduler. I was just guessing from the recent changes, and it is good to know that the possible areas are boiled down with a recent version, thanks.

phibos · 2021-02-08T04:25:29Z

Anything we can do to help to resolve this issue for the next release?

MarcusCaepio · 2021-02-20T01:29:37Z

I don't have this issue anymore with the latest icinga2 version

pluhin · 2021-02-22T07:59:49Z

Hi All,

Still have this issue. I have two satellites with 8CPU and 16GB, in zone I have ~250 hosts and ~4000 services

# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.0-1)

Copyright (c) 2012-2021 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Red Hat Enterprise Linux Server
  Platform version: 7.8 (Maipo)
  Kernel: Linux
  Kernel version: 3.10.0-1127.18.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-322-concurrent-0
  OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

pluhin · 2021-02-22T08:15:17Z

I found in api/log folder a lot of big log files, after cleaning - the satellites became cold

phibos · 2021-07-07T09:41:41Z

Anything we can do to help to resolve this issue for the next release?

We were able to fix this issue with the latest version and changing some default config values.

Al2Klimov · 2021-10-19T15:59:14Z

Which ones?

phibos · 2021-10-29T10:59:39Z

Which ones?

By default we had disabled the replay logs only on the command endpoints but now we have also disabled the replay logs on the monitoring server for all command endpoints

object Endpoint "icinga2-agent1.localdomain" {

  log_duration = 0
}

Al2Klimov · 2021-10-29T14:14:24Z

Colleagues, don't we recommend to do exactly that?

N-o-X · 2021-10-29T14:19:09Z

Yes, we do. It's disabled in all our example configs using command endpoint agents in our distributed monitoring docs and we even have a dedicated section: https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#disable-log-duration-for-command-endpoints

Al2Klimov · 2021-10-29T14:29:10Z

@pgress Please try this.

Al2Klimov · 2022-10-11T15:10:10Z

Anyone else of you who also got this fixed via #5465 (comment) ?

pluhin · 2022-10-12T05:53:56Z

@Al2Klimov did you add log log_duration = 0 to agents or to satellites?

julianbrost · 2022-10-12T06:35:19Z

You should set that on all Endpoint objects that represent a connection from or to an agent, see also https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#disable-log-duration-for-command-endpoints

pluhin · 2022-10-12T07:01:54Z

I've applied this solution, seems, did not help for me, but it was maybe year ago.

I had: 2 HA masters, 2x4 HA satellites. I added this parameter for satellites + one big zone, began to work well, but after few redeployment of master config (ansible remove /etc/icinga2 and create new with form repository), problem was back for me

pgress · 2022-11-25T13:41:42Z

Hey i dont work for the company anymore so i cant reproduce this issue anymore. So i will unassign this isssue.

pluhin · 2022-12-14T11:16:03Z

Let's me try
I added log_duration = 0 to all agents and satellites in master configuration

pluhin · 2023-10-03T12:03:41Z

Hi All
Several updates:

I added log_duration = 0 to all agents
10mins log duration for both satellites in HA mode
I updated 2 times config on master during 5 mins
satellites have been overladed again

dnsmichi added area/checks Check execution and results needs feedback We'll only proceed once we hear from you again labels Aug 8, 2017

dnsmichi removed the needs feedback We'll only proceed once we hear from you again label May 9, 2018

dnsmichi added bug Something isn't working needs feedback We'll only proceed once we hear from you again labels Jul 3, 2018

marcelfischer mentioned this issue Jan 18, 2021

graphite queue grows to ~10k or more after director deployment #8586

Open

Al2Klimov assigned phibos Oct 26, 2021

phibos removed their assignment Oct 29, 2021

Al2Klimov assigned julianbrost and N-o-X Oct 29, 2021

Al2Klimov unassigned julianbrost and N-o-X Oct 29, 2021

Al2Klimov assigned pgress Feb 15, 2022

pgress removed their assignment Nov 25, 2022

Al2Klimov assigned pluhin Mar 21, 2023

Al2Klimov added the needs feedback We'll only proceed once we hear from you again label May 16, 2023

Al2Klimov unassigned pluhin Apr 2, 2024

Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label May 14, 2024

Load explodes after every reload/restart of Icinga 2 #5465

Load explodes after every reload/restart of Icinga 2 #5465

Comments

pgress commented Aug 7, 2017 • edited Loading

dnsmichi commented Aug 8, 2017

pgress commented Aug 8, 2017

dnsmichi commented Aug 9, 2017

dnsmichi commented Jan 8, 2018

uffsalot commented Jan 17, 2018 • edited Loading

Simkimdm commented Jan 18, 2018

dnsmichi commented Apr 5, 2018

Thomas-Gelf commented Apr 5, 2018

paladox commented Apr 26, 2018 • edited Loading

dnsmichi commented May 9, 2018

paladox commented May 9, 2018

dnsmichi commented May 9, 2018

widhalmt commented May 29, 2018

paladox commented Jun 1, 2018

Crunsher commented Jun 4, 2018

dnsmichi commented Jul 3, 2018

Crunsher commented Sep 4, 2018

dnsmichi commented Sep 14, 2018

paladox commented Sep 14, 2018

MarcusCaepio commented Sep 27, 2018

dnsmichi commented Sep 27, 2018

MarcusCaepio commented Sep 27, 2018

dnsmichi commented Oct 10, 2018

MarcusCaepio commented Oct 15, 2018

dnsmichi commented Oct 15, 2018

phibos commented Feb 8, 2021

MarcusCaepio commented Feb 20, 2021

pluhin commented Feb 22, 2021

pluhin commented Feb 22, 2021

phibos commented Jul 7, 2021

Al2Klimov commented Oct 19, 2021

phibos commented Oct 29, 2021

Al2Klimov commented Oct 29, 2021 • edited Loading

N-o-X commented Oct 29, 2021

Al2Klimov commented Oct 29, 2021

Al2Klimov commented Oct 11, 2022

pluhin commented Oct 12, 2022

julianbrost commented Oct 12, 2022

pluhin commented Oct 12, 2022

pgress commented Nov 25, 2022

pluhin commented Dec 14, 2022

pluhin commented Oct 3, 2023

pgress commented Aug 7, 2017 •

edited

Loading

uffsalot commented Jan 17, 2018 •

edited

Loading

paladox commented Apr 26, 2018 •

edited

Loading

Al2Klimov commented Oct 29, 2021 •

edited

Loading