Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting Icinga service causes lots of alerts (systemd) #6873

Closed
ghost opened this issue Jan 3, 2019 · 16 comments · Fixed by #7118
Closed

Restarting Icinga service causes lots of alerts (systemd) #6873

ghost opened this issue Jan 3, 2019 · 16 comments · Fixed by #7118
Assignees
Labels
area/checks Check execution and results bug Something isn't working
Milestone

Comments

@ghost
Copy link

ghost commented Jan 3, 2019

Expected Behavior

systemctl restart icinga2 - no alerts after.

Current Behavior

After restarting the service lots of alerts appear on the dashboard. All checks that were running will return critical.

Possible Solution

Do not alert on <terminated by signal 15>

Steps to Reproduce (for bugs)

  1. systemctl restart icinga2
  2. see alerts
Screenshot

screen

Context

Your Environment

  • Version used (icinga2 --version): version: r2.10.2-1
  • Operating System and version: CentOS Linux release 7.6.1810 (Core)
  • Enabled features (icinga2 feature list): api checker command compatlog graphite ido-pgsql influxdb livestatus mainlog notification
  • Icinga Web 2 version and modules (System - About): 2.6.2
  • Config validation (icinga2 daemon -C):
output
[2019-01-03 11:23:34 +1100] information/cli: Icinga application loader (version: r2.10.2-1)
[2019-01-03 11:23:34 +1100] information/cli: Loading configuration file(s).
[2019-01-03 11:23:34 +1100] information/ConfigItem: Committing config item(s).
[2019-01-03 11:23:34 +1100] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2019-01-03 11:23:34 +1100] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2019-01-03 11:23:34 +1100] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2019-01-03 11:23:34 +1100] warning/ApiListener: Please read the upgrading documentation for v2.8: https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2019-01-03 11:23:34 +1100] information/ApiListener: My API identity: HOST_FQDN_REDACTED
[2019-01-03 11:23:35 +1100] warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 11:1-11:45) for type 'Notification' does not match anywhere!
[2019-01-03 11:23:35 +1100] warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 23:1-23:48) for type 'Notification' does not match anywhere!
[2019-01-03 11:23:35 +1100] warning/ApplyRule: Apply rule 'xmpp_host' (in /etc/icinga2/conf.d/notifications_xmpp.conf: 29:1-29:38) for type 'Notification' does not match anywhere!
[2019-01-03 11:23:35 +1100] warning/ApplyRule: Apply rule 'backup-downtime' (in /etc/icinga2/conf.d/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere!
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 4937 Services.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 LivestatusListener.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 367 Hosts.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 FileLogger.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 10 NotificationCommands.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 4003 Notifications.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 46 HostGroups.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 ApiListener.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 Downtime.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 15 Comments.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 4 Zones.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 3 Endpoints.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 ApiUser.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 CompatLogger.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 15 Users.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 226 CheckCommands.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 9 UserGroups.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 28 ServiceGroups.
[2019-01-03 11:23:35 +1100] information/ConfigItem: Instantiated 4 TimePeriods.
[2019-01-03 11:23:36 +1100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2019-01-03 11:23:36 +1100] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
output
icinga2 object list --type Endpoint
Object 'external_host_fqdn' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 3:1-3:45
  * __name = "external_host_fqdn"
  * host = "10.10.1.34"
    % = modified in '/etc/icinga2/zones.conf', lines 4:3-4:21
  * log_duration = 86400
  * name = "external_host_fqdn"
  * package = "_etc"
  * port = "5665"
    % = modified in '/etc/icinga2/zones.conf', lines 5:3-5:13
  * source_location
    * first_column = 1
    * first_line = 3
    * last_column = 45
    * last_line = 3
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "external_host_fqdn" ]
    % = modified in '/etc/icinga2/zones.conf', lines 3:1-3:45
  * type = "Endpoint"
  * zone = ""

Object 'primary_host_fqdn' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 13:1-13:48
  * __name = "master_host_fqdn"
  * host = "localhost"
    % = modified in '/etc/icinga2/zones.conf', lines 14:3-14:20
  * log_duration = 86400
  * name = "master_host_fqdn"
  * package = "_etc"
  * port = "5665"
    % = modified in '/etc/icinga2/zones.conf', lines 15:3-15:13
  * source_location
    * first_column = 1
    * first_line = 13
    * last_column = 48
    * last_line = 13
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "primary_host_fqdn" ]
    % = modified in '/etc/icinga2/zones.conf', lines 13:1-13:48
  * type = "Endpoint"
  * zone = ""

Object 'internal_host_fqdn' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 8:1-8:50
  * __name = "internal_host_fqdn"
  * host = "10.10.1.11"
    % = modified in '/etc/icinga2/zones.conf', lines 9:3-9:23
  * log_duration = 86400
  * name = "internal_host_fqdn"
  * package = "_etc"
  * port = "5665"
    % = modified in '/etc/icinga2/zones.conf', lines 10:3-10:13
  * source_location
    * first_column = 1
    * first_line = 8
    * last_column = 50
    * last_line = 8
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "internal_host_fqdn" ]
    % = modified in '/etc/icinga2/zones.conf', lines 8:1-8:50
  * type = "Endpoint"
  * zone = ""
@Crunsher
Copy link
Contributor

Crunsher commented Jan 7, 2019

That sounds familiar. Could you provide the log output around a restart? It'd be interesting why Icinga 2 kills these checks.

@Crunsher Crunsher added needs feedback We'll only proceed once we hear from you again area/checks Check execution and results labels Jan 7, 2019
@ghost
Copy link
Author

ghost commented Jan 8, 2019

@sammcj can you please provide that while I'm away?

@Crunsher
Copy link
Contributor

Crunsher commented Jan 8, 2019

I received a report from a colleague who can reproduce this, logs are sadly of no help. We currently suspect systemd to be the culprit, killing child processes before we can terminate them.

The solution would be to ignore checkresults when we are restarting/quitting and rescheduling them to be run when icinga2 runs again.

@Crunsher Crunsher removed the needs feedback We'll only proceed once we hear from you again label Jan 8, 2019
@lippserd lippserd added the bug Something isn't working label Feb 6, 2019
@lippserd
Copy link
Member

lippserd commented Feb 6, 2019

Possible fix: #6908

@lippserd lippserd added this to the 2.11.0 milestone Feb 6, 2019
@Al2Klimov Al2Klimov self-assigned this Feb 11, 2019
@Al2Klimov
Copy link
Member

I'd even say it's the fix as it doesn't let bad check results even happen instead of ignoring and re-scheduling.

@sammcj
Copy link

sammcj commented Feb 18, 2019

Any update on this? I see #6908 was merged - did that fix it?

@Al2Klimov
Copy link
Member

@sammcj Please could you test our snapshot packages and report the result?

@Al2Klimov Al2Klimov assigned ghost and unassigned Al2Klimov Feb 18, 2019
@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Feb 18, 2019
@ghost
Copy link
Author

ghost commented Feb 18, 2019

Will do that tomorrow as it's 23:26 in Melbourne, Australia

@sammcj
Copy link

sammcj commented Feb 19, 2019

@Al2Klimov FYI @alexizmailov works with / near me, he's pointed out that while it's been merged it hasn't been released yet so once it has been released, we'll install - test and update this ticket.

Thanks :)

@dnsmichi
Copy link
Contributor

That's a bit tricky, as we're waiting for your feedback to release this. So we'd appreciate it if you can test the snapshot package in your environment prior to any release :)

@ghost
Copy link
Author

ghost commented Feb 20, 2019

No, unfortunately the problem still persists, tested 3 times:

failure

root@dev-alex-02:~  # rpm -aq | grep icinga2
icinga2-common-2.10.2.219.ge555b2f-0.2019.02.09+1.el7.icinga.x86_64
icinga2-bin-2.10.2.219.ge555b2f-0.2019.02.09+1.el7.icinga.x86_64
icinga2-2.10.2.219.ge555b2f-0.2019.02.09+1.el7.icinga.x86_64
icinga2-ido-pgsql-2.10.2.219.ge555b2f-0.2019.02.09+1.el7.icinga.x86_64

@Al2Klimov
Copy link
Member

I'm afraid that fix is only the half rent. I just installed a fresh Icinga 2 on a fresh CentOS 7 – our icinga2.service doesn't specify any KillMode. The default is control-group so systemd kills all check plugins.

@Al2Klimov Al2Klimov unassigned ghost Feb 20, 2019
@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Feb 20, 2019
@dnsmichi
Copy link
Contributor

The mentioned fix also causes other problems with restart delays and missing Stop() calls then. I will partially revert this and re-evaluate a possible fix for the kill problem.

@dnsmichi dnsmichi self-assigned this Feb 20, 2019
@Al2Klimov
Copy link
Member

@alexizmailov Please try to change /usr/lib/systemd/system/icinga2.service as shown here and run systemctl daemon-reload. Does this help?

@Al2Klimov Al2Klimov assigned ghost Apr 15, 2019
@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Apr 15, 2019
@ghost
Copy link
Author

ghost commented Apr 15, 2019

I will try this tomorrow because it's 22:45 now in Melbourne,

@ghost
Copy link
Author

ghost commented Apr 16, 2019

Looks like it works, I restarted it 4 times and there were no alerts at all.

@Al2Klimov Al2Klimov assigned Al2Klimov and unassigned dnsmichi and ghost Apr 16, 2019
@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Apr 16, 2019
@dnsmichi dnsmichi changed the title Restarting Icinga service causes lots of alerts Restarting Icinga service causes lots of alerts (systemd) Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants