Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 sends notifications without logging about it and despite having a downtime #6057

Closed
baurmatt opened this issue Feb 1, 2018 · 26 comments · Fixed by #6270
Closed

Icinga2 sends notifications without logging about it and despite having a downtime #6057

baurmatt opened this issue Feb 1, 2018 · 26 comments · Fixed by #6270
Labels
area/notifications Notification events blocker Blocks a release or needs immediate attention bug Something isn't working
Milestone

Comments

@baurmatt
Copy link
Contributor

baurmatt commented Feb 1, 2018

Expected Behavior

  1. Every notifications send generates a Icinga2 log entry
  2. A downtimed host shouldn't send notifications

Current Behavior

We found that Icinga2 sends notifications without logging about it and despite having a downtime set.

Steps we do/see:

  1. Host and services are included and monitored from Icinga2 and all checks are UP/OK
  2. Host gets downtimed via API call
  3. Host gets normally stopped
  4. Icinga2 checks the host/sercices - checks are going into DOWN/CRITICAL state
  5. Icinga2 regularly gets reloaded because of config changes being synced
  6. After some time (minutes/perhaps 1-2 hours) another reload happens
  • While this reload happens, a notifications get set to the user telling about the DOWN state of the host
  • We never saw any notifications for services, only for hosts
  • This notification isn't logged by Icinga2 (.*information/Notification: Sending 'Problem' notification.*), it gets only logged by our notification script.

Notification script log:

{"date": "2018-02-01 15:58:47", "type": "email", "destination": "foobar@example.org", "notification_type": "PROBLEM", "service": "", "host": "my-fancy-hostname"}

Icinga2 log of the master which send the notification (and reloaded) from 15:58:*

[2018-02-01 15:58:00 +0100] information/HttpServerConnection: No messages for Http connection have been received in the last 10 seconds.
[2018-02-01 15:58:23 +0100] warning/TlsStream: TLS stream was disconnected.
[2018-02-01 15:58:23 +0100] warning/JsonRpcConnection: API client disconnected for identity 'master01'
[2018-02-01 15:58:23 +0100] warning/ApiListener: Removing API client for endpoint 'master01'. 0 API clients left.
[2018-02-01 15:58:28 +0100] information/ApiListener: New client connection for identity 'master01' from [89.89.89.89]:47040
[2018-02-01 15:58:28 +0100] information/ApiListener: Sending config updates for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Syncing configuration files for zone 'bki' to endpoint 'master01'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Syncing configuration files for zone 'blu' to endpoint 'master01'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'master01'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Syncing configuration files for zone 'master' to endpoint 'master01'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Syncing configuration files for zone 'mgmtbki' to endpoint 'master01'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Syncing configuration files for zone 'mgmtblu' to endpoint 'master01'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Finished sending config file updates for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:28 +0100] information/ApiListener: Syncing runtime objects to endpoint 'master01'.
[2018-02-01 15:58:29 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'master01'.
[2018-02-01 15:58:29 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:29 +0100] information/ApiListener: Sending replay log for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:29 +0100] information/ApiListener: Finished sending replay log for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:29 +0100] information/ApiListener: Finished syncing endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:30 +0100] information/ApiListener: Applying config update from endpoint 'master01' of zone 'master'.
[2018-02-01 15:58:30 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/bki//.timestamp
[2018-02-01 15:58:30 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/bki' (17 Bytes). Received timestamp '2018-02-01 15:58:28 +0100' (1517497108.852829), Current timestamp '2018-02-01 15:39:11 +0100' (1517495951.020393).
[2018-02-01 15:58:30 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/blu//.timestamp
[2018-02-01 15:58:30 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/blu' (17 Bytes). Received timestamp '2018-02-01 15:58:28 +0100' (1517497108.819657), Current timestamp '2018-02-01 15:39:11 +0100' (1517495951.050538).
[2018-02-01 15:58:30 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/globals.d//.timestamp
[2018-02-01 15:58:30 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/globals.d' (17 Bytes). Received timestamp '2018-02-01 15:58:28 +0100' (1517497108.823386), Current timestamp '2018-02-01 15:39:10 +0100' (1517495950.980507).
[2018-02-01 15:58:30 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/master//.timestamp
[2018-02-01 15:58:30 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/master' (17 Bytes). Received timestamp '2018-02-01 15:58:28 +0100' (1517497108.823882), Current timestamp '2018-02-01 15:39:11 +0100' (1517495951.024695).
[2018-02-01 15:58:30 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/mgmtbki//.timestamp
[2018-02-01 15:58:30 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/mgmtbki' (17 Bytes). Received timestamp '2018-02-01 15:58:28 +0100' (1517497108.826017), Current timestamp '2018-02-01 15:39:11 +0100' (1517495951.026952).
[2018-02-01 15:58:30 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/mgmtblu//.timestamp
[2018-02-01 15:58:30 +0100] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/mgmtblu//_etc/hosts.conf
[2018-02-01 15:58:30 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/mgmtblu' (105135 Bytes). Received timestamp '2018-02-01 15:58:28 +0100' (1517497108.824826), Current timestamp '2018-02-01 15:39:10 +0100' (1517495950.981618).
[2018-02-01 15:58:30 +0100] information/ApiListener: Restarting after configuration change.
[2018-02-01 15:58:30 +0100] warning/TlsStream: TLS stream was disconnected.
[2018-02-01 15:58:30 +0100] warning/JsonRpcConnection: API client disconnected for identity 'satellite06'
[2018-02-01 15:58:30 +0100] warning/ApiListener: Removing API client for endpoint 'satellite06'. 0 API clients left.
[2018-02-01 15:58:31 +0100] information/WorkQueue: #17 (JsonRpcConnection, #7) items: 2365, rate: 373.167/s (22390/min 22390/5min 22390/15min);
[2018-02-01 15:58:31 +0100] information/Application: Got reload command: Starting new instance.
[2018-02-01 15:58:41 +0100] information/ApiListener: Reconnecting to endpoint 'satellite06' via host 'satellite06.example.org' and port '5665'
[2018-02-01 15:58:41 +0100] information/ApiListener: New client connection for identity 'satellite06' to [89.89.89.88]:5665
[2018-02-01 15:58:41 +0100] information/ApiListener: Finished reconnecting to endpoint 'satellite06' via host 'satellite06.example.org' and port '5665'
[2018-02-01 15:58:41 +0100] information/ApiListener: Sending config updates for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:41 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'satellite06'.
[2018-02-01 15:58:41 +0100] information/JsonRpcConnection: Received certificate request for CN 'satellite06' signed by our CA.
[2018-02-01 15:58:41 +0100] information/ApiListener: Syncing configuration files for zone 'mgmtblu' to endpoint 'satellite06'.
[2018-02-01 15:58:41 +0100] information/JsonRpcConnection: The certificate for CN 'satellite06' cannot be renewed yet.
[2018-02-01 15:58:41 +0100] information/ApiListener: Finished sending config file updates for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:41 +0100] information/ApiListener: Syncing runtime objects to endpoint 'satellite06'.
[2018-02-01 15:58:41 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'satellite06'.
[2018-02-01 15:58:41 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:41 +0100] information/ApiListener: Sending replay log for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:41 +0100] information/ApiListener: Finished sending replay log for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:41 +0100] information/ApiListener: Finished syncing endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:45 +0100] information/ApiListener: New client connection from [127.0.0.1]:58126 (no client certificate)
[2018-02-01 15:58:45 +0100] information/HttpServerConnection: Request: GET /v1/status (from [127.0.0.1]:58126, user: icingabeat)
[2018-02-01 15:58:46 +0100] information/Application: Received request to shut down.
[2018-02-01 15:58:46 +0100] information/Application: Shutting down...
[2018-02-01 15:58:46 +0100] information/ApiListener: 'api' stopped.
[2018-02-01 15:58:46 +0100] information/CheckerComponent: 'checker' stopped.
[2018-02-01 15:58:46 +0100] information/ExternalCommandListener: 'command' stopped.
[2018-02-01 15:58:57 +0100] information/NotificationComponent: 'notification' started.
[2018-02-01 15:58:57 +0100] information/ApiListener: Adding new listener on port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Reconnecting to endpoint 'satellite04' via host 'satellite04.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Reconnecting to endpoint 'satellite03' via host 'satellite03.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Reconnecting to endpoint 'satellite06' via host 'satellite06.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Reconnecting to endpoint 'satellite01' via host 'satellite01.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Reconnecting to endpoint 'satellite02' via host 'satellite02.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Reconnecting to endpoint 'master01' via host 'master01.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Reconnecting to endpoint 'satellite05' via host 'mgmtsatellite01.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: New client connection for identity 'satellite01' to [89.89.89.87]:5665
[2018-02-01 15:58:57 +0100] information/ApiListener: Finished reconnecting to endpoint 'satellite01' via host 'satellite01.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Sending config updates for endpoint 'satellite01' in zone 'bki'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'bki' to endpoint 'satellite01'.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: Received certificate request for CN 'satellite01' signed by our CA.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: The certificate for CN 'satellite01' cannot be renewed yet.
[2018-02-01 15:58:57 +0100] information/ApiListener: New client connection for identity 'satellite06' to [89.89.89.88]:5665
[2018-02-01 15:58:57 +0100] information/ApiListener: Finished reconnecting to endpoint 'satellite06' via host 'satellite06.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Sending config updates for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'mgmtblu' to endpoint 'satellite06'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'satellite06'.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: Received certificate request for CN 'satellite06' signed by our CA.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: The certificate for CN 'satellite06' cannot be renewed yet.
[2018-02-01 15:58:57 +0100] information/ApiListener: New client connection for identity 'master01' to [89.89.89.89]:5665
[2018-02-01 15:58:57 +0100] information/ApiListener: Finished reconnecting to endpoint 'master01' via host 'master01.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Sending config updates for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'mgmtbki' to endpoint 'master01'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'master' to endpoint 'master01'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'blu' to endpoint 'master01'.
[2018-02-01 15:58:57 +0100] information/ApiListener: New client connection for identity 'satellite04' to [89.89.89.86]:5665
[2018-02-01 15:58:57 +0100] information/ApiListener: Finished reconnecting to endpoint 'satellite04' via host 'satellite04.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: Sending config updates for endpoint 'satellite04' in zone 'blu'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'blu' to endpoint 'satellite04'.
[2018-02-01 15:58:57 +0100] information/ApiListener: New client connection for identity 'satellite02' to [89.89.89.85]:5665
[2018-02-01 15:58:57 +0100] information/ApiListener: Finished reconnecting to endpoint 'satellite02' via host 'satellite02.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/ApiListener: New client connection for identity 'satellite03' to [89.89.89.84]:5665
[2018-02-01 15:58:57 +0100] information/ApiListener: Finished reconnecting to endpoint 'satellite03' via host 'satellite03.example.org' and port '5665'
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: Received certificate request for CN 'satellite02' signed by our CA.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: The certificate for CN 'satellite02' cannot be renewed yet.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: Received certificate request for CN 'satellite04' signed by our CA.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: The certificate for CN 'satellite04' cannot be renewed yet.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: Received certificate request for CN 'satellite03' signed by our CA.
[2018-02-01 15:58:57 +0100] information/JsonRpcConnection: The certificate for CN 'satellite03' cannot be renewed yet.
[2018-02-01 15:58:57 +0100] information/ApiListener: Finished sending config file updates for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing runtime objects to endpoint 'satellite06'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'satellite01'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'mgmtblu' to endpoint 'master01'.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for zone 'bki' to endpoint 'master01'.
[2018-02-01 15:58:57 +0100] information/CheckerComponent: 'checker' started.
[2018-02-01 15:58:57 +0100] information/DbConnection: 'ido-mysql' started.
[2018-02-01 15:58:57 +0100] information/ExternalCommandListener: 'command' started.
[2018-02-01 15:58:57 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'satellite04'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'master01'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'satellite06'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending replay log for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending config file updates for endpoint 'satellite04' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing runtime objects to endpoint 'satellite04'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending replay log for endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing endpoint 'satellite06' in zone 'mgmtblu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending config updates for endpoint 'satellite02' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing configuration files for zone 'bki' to endpoint 'satellite02'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'satellite02'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending config file updates for endpoint 'satellite01' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing runtime objects to endpoint 'satellite01'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending config file updates for endpoint 'satellite02' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing runtime objects to endpoint 'satellite02'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending config file updates for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing runtime objects to endpoint 'master01'.
[2018-02-01 15:58:58 +0100] information/ApiListener: New client connection for identity 'satellite05' to [89.89.89.84]:5665
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished reconnecting to endpoint 'satellite05' via host 'mgmtsatellite01.example.org' and port '5665'
[2018-02-01 15:58:58 +0100] information/JsonRpcConnection: Received certificate request for CN 'satellite05' signed by our CA.
[2018-02-01 15:58:58 +0100] information/JsonRpcConnection: The certificate for CN 'satellite05' cannot be renewed yet.
[2018-02-01 15:58:58 +0100] information/ApiListener: Applying config update from endpoint 'master01' of zone 'master'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'satellite02'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'satellite02' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending replay log for endpoint 'satellite02' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending replay log for endpoint 'satellite02' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing endpoint 'satellite02' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending config updates for endpoint 'satellite03' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing configuration files for zone 'blu' to endpoint 'satellite03'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'satellite03'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'satellite01'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'satellite01' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending replay log for endpoint 'satellite01' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending replay log for endpoint 'satellite01' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing endpoint 'satellite01' in zone 'bki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending config updates for endpoint 'satellite05' in zone 'mgmtbki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing configuration files for zone 'mgmtbki' to endpoint 'satellite05'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing configuration files for global zone 'globals.d' to endpoint 'satellite05'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending config file updates for endpoint 'satellite05' in zone 'mgmtbki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing runtime objects to endpoint 'satellite05'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'satellite04'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'satellite04' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending replay log for endpoint 'satellite04' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending replay log for endpoint 'satellite04' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing endpoint 'satellite04' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending config file updates for endpoint 'satellite03' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Syncing runtime objects to endpoint 'satellite03'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'master01'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending replay log for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Replayed 353 messages.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'satellite05'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'satellite05' in zone 'mgmtbki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending replay log for endpoint 'satellite05' in zone 'mgmtbki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending replay log for endpoint 'satellite05' in zone 'mgmtbki'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing endpoint 'satellite05' in zone 'mgmtbki'.
[2018-02-01 15:58:58 +0100] information/DbConnection: Resuming IDO connection: ido-mysql
[2018-02-01 15:58:58 +0100] information/IdoMysqlConnection: 'ido-mysql' resumed.
[2018-02-01 15:58:58 +0100] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.14.3')
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'satellite03'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'satellite03' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Sending replay log for endpoint 'satellite03' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending replay log for endpoint 'satellite03' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing endpoint 'satellite03' in zone 'blu'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished sending replay log for endpoint 'master01' in zone 'master'.
[2018-02-01 15:58:58 +0100] information/ApiListener: Finished syncing endpoint 'master01' in zone 'master'.

We also have debuglog from this point in time but i would be really hard to anonymize. Please tell me if you think this would help debugging this and we probably find a way to get it to you.

Possible Solution

?

Steps to Reproduce (for bugs)

Not really sure yet on a small test setup. I can more or less frequently reproduce it on our setup.

Context

We're trying to avoid sending notifications to users for hosts which are normally stopped.

Your Environment

  • Version used (icinga2 --version): r2.8.1-1
  • Operating System and version: Ubuntu 16.04.3
  • Enabled features (icinga2 feature list): Enabled features: api checker command ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): 2.5.0, monitoring/custom
  • Config validation (icinga2 daemon -C):
root@master02:/var/log/icinga2# icinga2 daemon -C
information/cli: Icinga application loader (version: r2.8.1-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
warning/ApiListener: Please read the upgrading documentation for v2.8: https://www.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
information/ApiListener: My API identity: master02
information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 2, rate: 8214.58/s (492875/min 492875/5min 492875/15min);
information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
...
Some apply rule warnings
...
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 6 Zones.
information/ConfigItem: Instantiated 8 Endpoints.
information/ConfigItem: Instantiated 5 ApiUsers.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 90709 Notifications.
information/ConfigItem: Instantiated 2 NotificationCommands.
information/ConfigItem: Instantiated 11 CheckCommands.
information/ConfigItem: Instantiated 2595 Downtimes.
information/ConfigItem: Instantiated 444 HostGroups.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 2738 Hosts.
information/ConfigItem: Instantiated 259 Comments.
information/ConfigItem: Instantiated 257 UserGroups.
information/ConfigItem: Instantiated 51944 Dependencies.
information/ConfigItem: Instantiated 413 Users.
information/ConfigItem: Instantiated 4 TimePeriods.
information/ConfigItem: Instantiated 60952 Services.
information/ConfigItem: Instantiated 1 ServiceGroup.
information/ConfigItem: Instantiated 262 ScheduledDowntimes.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
object Endpoint NodeName {
  host = "10.0.0.1"
}

object Endpoint "satellite05"  {
  host = "mgmtsatellite01.example.org"
}

object Endpoint "master02"  {
  host = "master02.example.org"
}

object Endpoint "satellite05"  {
  host = "satellite06.example.org"
}

object Endpoint "satellite03"  {
  host = "satellite03.example.org"
}

object Endpoint "satellite04"  {
  host = "satellite04.example.org"
}

object Endpoint "satellite01"  {
  host = "satellite01.example.org"
}

object Endpoint "satellite02"  {
  host = "satellite02.example.org"
}

object Zone "bki"  {
  endpoints = [ "satellite01", "satellite02", ]
  parent = "master"
}

object Zone "blu"  {
  endpoints = [ "satellite03", "satellite04", ]
  parent = "master"
}

object Zone "globals.d"  {
  global = true
}

object Zone "master"  {
  endpoints = [ "master01", "master02", ]
}

object Zone "mgmtbki"  {
  endpoints = [ "satellite05", ]
  parent = "master"
}

object Zone "mgmtblu"  {
  endpoints = [ "satellite05", ]
  parent = "master"
}

@Crunsher
Copy link
Contributor

Crunsher commented Feb 2, 2018

Notifications in conjuncture with reloads are a known problem. #5521 for example.

@Crunsher Crunsher added bug Something isn't working area/notifications Notification events labels Feb 2, 2018
@baurmatt
Copy link
Contributor Author

baurmatt commented Feb 2, 2018

Is there anything we can do to solve this problem?

@Crunsher
Copy link
Contributor

Crunsher commented Feb 2, 2018

I don't know of any workaround

@baurmatt
Copy link
Contributor Author

baurmatt commented Feb 2, 2018

What i meant was, is there any information we can provide which helps you track down and fix the issue?

@Crunsher
Copy link
Contributor

Crunsher commented Feb 2, 2018

Oh! A minimal configuration example/list of steps that lets us reliably reproduce the issue is always welcome.

@baurmatt
Copy link
Contributor Author

baurmatt commented Feb 2, 2018

Still trying to realiably reproduce this on a small setup... -.- Strange thing we realized while looking through our logs is, that the NotificationComponent gets started after the config reloads but we never see any stop log message. Is this correct? It seems like it should log https://github.com/Icinga/icinga2/blob/master/lib/notification/notificationcomponent.cpp#L69

@Crunsher
Copy link
Contributor

Crunsher commented Feb 5, 2018

During a reload this does not happen, the new process takes over as the old one silently dies. I'm going to check whether two NotificationComponents can run at the same time and if this can be a problem in case it is.

@kepi
Copy link

kepi commented Feb 9, 2018

Same problem in our setup, let us know if we can provide some more info.

To provide little context how important this fix is: Almost every day it wakes me and other admins up, often after long night work. Because icinga gets reloaded at morning when changing oncall contacts. Or just some admin is adding new check when I'm sleeping. Please help us save our sleep and sanity :)

We belived it will be fixed (similar/same behaviour already mentioned in #4696, #5224, #5361) but we have icinga 2.8.1 now and problem remains.

@Mikesch-mp
Copy link
Contributor

This happens a lot if you use scheduledDowntimes.

@marcofl
Copy link

marcofl commented Feb 15, 2018

We see similar issues with notifications being sent out right after reload of icinga2 for services / hosts which have a scheduled Downtime. I'm not really able to reproduce this in a small vagrant setup. For me it seems to only happen in lager installations where restarts / reloads take some time and a lot of system resources.

information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 640 Zones.
information/ConfigItem: Instantiated 638 Endpoints.
information/ConfigItem: Instantiated 5 ApiUsers.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 8786 Notifications.
information/ConfigItem: Instantiated 8 NotificationCommands.
information/ConfigItem: Instantiated 301 CheckCommands.
information/ConfigItem: Instantiated 26239 Downtimes.
information/ConfigItem: Instantiated 38 HostGroups.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 696 Hosts.
information/ConfigItem: Instantiated 237 Comments.
information/ConfigItem: Instantiated 5 UserGroups.
information/ConfigItem: Instantiated 10 Users.
information/ConfigItem: Instantiated 2 TimePeriods.
information/ConfigItem: Instantiated 39544 Services.
information/ConfigItem: Instantiated 2 ServiceGroups.
information/ConfigItem: Instantiated 13054 ScheduledDowntimes.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ConfigItem: Instantiated 1 InfluxdbWriter.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 ExternalCommandListener.

@anan80
Copy link

anan80 commented Feb 20, 2018

i have same notification issue when servers are kept in scheduled downtime and icinga randomly sending notification mails. - icinga 2.8.0

@n0npax
Copy link

n0npax commented Mar 20, 2018

We have the same notification issue. Icinga 2.8.1, config 6 satellites & 2 masters. Seems to be related with config reload too.

@dnsmichi
Copy link
Contributor

Looks like a race condition. The NotificationComponent is started before the FileLogger object is instantiated, and as such the icinga2.log doesn't have this record. startup.log should hold this detail.

Notification events without a call to Log() do not exist. It is hard to reproduce in this case, as one cannot influence the object load order.

I would advise to dump all notification (and related objects) attributes in a debug dummy notification.
https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#notification

  • last_notification
  • next_notification
  • last_problem_notification
  • "now" (depending on your script language, date +%s or likewise).

https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#host

  • host state, last_state, state_type, last_in_down, downtime_depth, etc.

@winter1967
Copy link

winter1967 commented Mar 26, 2018

Notification events without a call to Log() do not exist.

It exists at our environment since Icinga 2.5.4 ;-)
It happens sporadicaly, but not all time. So we have over 900 host object it happens every day but on differents systems.
Maybe it happens more often, when we create / import new systems with icinga2 director which causes a lot of load.

@dnsmichi
Copy link
Contributor

It exists at our environment since Icinga 2.5.4 ;-)

You've missed my point. In terms of code, no such events will ever pass through the Log() function call. It may be the case that the logging file stream is not there, and this results in an event literally logging to /dev/null.

@Thomas-Gelf
Copy link
Contributor

@lippserd: this issue, #5521 and quite some other linked/closed issues all have the same root cause and terribly hurt, especially large environments. Could you please raise priority and give someone some dedicated time to fix this? During restart/reload (read: with every config deployment) we are sending out erroneous notifications because the notification component starts and fires before the core learned about currently active downtimes. I've been shown another proof for this in 2.8.1 right now.

Notification are sent during downtime with no trace in IDO and/or log files. Happens immediately after reloading/restarting Icinga.

Thanks a lot,
Thomas

@lippserd
Copy link
Member

@Thomas-Gelf Indeed, this is a big issue. We'll see what we can do to fix this as soon as possible.

@dnsmichi dnsmichi added the blocker Blocks a release or needs immediate attention label Apr 17, 2018
@dnsmichi dnsmichi added this to the 2.9.0 milestone Apr 18, 2018
@dnsmichi
Copy link
Contributor

img_0078

@Crunsher
Copy link
Contributor

Proposal

Add a virtual type to all config objects to help Icinga2 work with them.

Explanation

Currently there is not much difference to icinga between a Host and the NotificationComponent, we can't give any securities about the load order.
To remedy this add a virtual type to all objects and start them by type. Something like Logger -> Hosts, Services, etc. -> NotificationComponent. In this case three types would be required, pre-config, config, post-config ('config' refering to monitoring objects like Hosts, Services and Downtimes here)

Problems

Figuring out the correct types for everything is a big load of work, a sensible default is needed. And there a a lot of possible bugs and dependency misjudgments which might have already existed but were just not visible since we loaded/started everything at more or less the same time.

@Wintermute2k6
Copy link

Wintermute2k6 commented Apr 23, 2018

removed comment because it was related to different issues.

@Mikesch-mp
Copy link
Contributor

@Wintermute2k6 that should not belong to this issue. First issue is this one, the second one is what you copied here. Its a complete different problem.

@dnsmichi
Copy link
Contributor

Agreed. @Wintermute2k6 your problem sounds like #2844 and #4272. Please proceed there.

@Wintermute2k6
Copy link

@Mikesch-mp & @dnsmichi i agree with both of you. But its not my issue at all.
Regards

@dnsmichi
Copy link
Contributor

I've read #558349 and this issue is just one of many. ScheduledDowntime objects and their generation of runtime downtime objects is not part of this issue.

We're discussing and elaborating fixes for unwanted notifications caused by runtime race conditions, either influenced by not-yet activated downtimes/acks, or too-soon-running features like the notification component.

@Wintermute2k6
Copy link

No problem with that at all i will split it into the accordingly matching issues.

@Crunsher
Copy link
Contributor

Config to test attached, was done with a sleep in the Downtime start method to force the bug
i2_test.zip

dnsmichi pushed a commit that referenced this issue May 4, 2018
This patch ensures that specific configuration types
are pre-activated and post-activated. In general,
logging is first, then common configuration objects
like host/service, downtimes, etc.
In the end, all features are activated after to ensure
that notifications are only sent once downtimes are applied.
A similar thing happens for starting with checks too early.
The ApiListener feature runs first to allow cluster connections
at first glance.

fixes #6057
fixes #6231
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events blocker Blocks a release or needs immediate attention bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.