Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 sends notification during reload (despite downtime) and does not log it #6846

Closed
ekeih opened this issue Dec 19, 2018 · 19 comments
Closed
Labels
area/notifications Notification events bug Something isn't working ref/NC

Comments

@ekeih
Copy link
Contributor

ekeih commented Dec 19, 2018

This is similar to #6057. It is even the exact same Icinga2 setup 馃槈

Expected Behavior

  1. Icinga2 should not send a notification for a service that is in downtime (and active checks disabled) during a reload.
  2. When Icinga2 sends a notification it should at least log it.

Current Behavior

We have two masters (master01 and master02) which are responsible for the notifications in our cluster.

  • Service has 7 configured contacts
  • Service is critical
  • All 7 contacts receive notifications
  • Service is in downtime (configured via API)
  • Active checks are disabled (configured via API)
  • Nothing changes for 30 days and Icinga2 reloads every 30 minutes during this time. I do not think that the number of days is relevant but it shows that it works most of the time.
  • Icinga2 reloads
# master01
...
[2018-12-15 20:26:05 +0100] information/ExternalCommandListener: 'command' stopped.
[2018-12-15 20:26:19 +0100] information/FileLogger: 'main-log' started.
[2018-12-15 20:26:22 +0100] information/ApiListener: 'api' started.
...

# master02
[2018-12-15 20:26:23 +0100] information/ApiListener: Restarting after configuration change.
  • 20:26:43 master02 sends a notification to one contact
    This should not happen because the service is still in downtime (and also the active checks are disabled).
    I guess it only happened for one contact because the other notification objects seem to be "scheduled" on master01.
    Icinga2 does not log this notification, but our notification handler does. The notification is also not visible in Icingaweb2.
  • 20:31:10 we acknowledge the alert
  • All 7 contacts receive a notification. This seems to be correct because all contacts received the critical alert 30 days ago.

Possible Solution

I guess that it is again some issue with the startup order or a race condition. But I have no specific idea.

Steps to Reproduce (for bugs)

It seems to happen at random times. I do not even know how often. Currently I only have this one example and no idea how to reproduce it.

Context

We have 2 masters and 6 satellites (4 zones, 2x2 and 2x1 satellites) and deploy our configuration via Puppet on master01 at xx:22. (In this case at 20:22.) Usually Icinga2 selects master02 as the IDO-DB master.

Your Environment

  • Version used (icinga2 --version): r2.10.2-1
  • Operating System and version: Ubuntu 16.04
  • Enabled features (icinga2 feature list): api checker command ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About): 2.6.1
  • Config validation (icinga2 daemon -C):
[2018-12-19 17:40:51 +0100] information/cli: Icinga application loader (version: r2.10.2-1)
[2018-12-19 17:40:51 +0100] information/cli: Loading configuration file(s).
[2018-12-19 17:40:53 +0100] information/ConfigItem: Committing config item(s).
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used.
[2018-12-19 17:40:53 +0100] warning/ApiListener: Please read the upgrading documentation for v2.8: https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/
[2018-12-19 17:40:53 +0100] information/ApiListener: My API identity: icinga2.live.master01
[2018-12-19 17:41:03 +0100] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 30.5333/s (1832/min 1832/5min 1832/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:03 +0100] information/WorkQueue: #7 (IdoMysqlConnection, ido-mysql) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 636 ScheduledDowntimes.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 63109 Services.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 2523 Hosts.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 52704 Dependencies.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 49410 Notifications.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 339 HostGroups.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8603 Downtimes.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 522 Comments.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 6 Zones.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8 Endpoints.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 8 ApiUsers.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 331 Users.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 14 CheckCommands.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 201 UserGroups.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 1 ServiceGroup.
[2018-12-19 17:41:10 +0100] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-12-19 17:41:10 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-12-19 17:41:10 +0100] information/cli: Finished validating the configuration file(s).

Let me know if you think that a detailed configuration of the zones end endpoints would be helpful or if I can do something to help.

@ekeih ekeih changed the title Icinga2 sends notification during reload and does not log it Icinga2 sends notification during reload (despite downtime) and does not log it Dec 19, 2018
@dnsmichi dnsmichi added the area/notifications Notification events label Dec 20, 2018
@widhalmt
Copy link
Member

ref/NC/597231

@ekeih
Copy link
Contributor Author

ekeih commented Jan 23, 2019

@widhalmt Any chance that the internal ticket contains any useful information to reproduce this issue or something else? Let me know if I can do anything to help with this problem.

@widhalmt
Copy link
Member

@ekeih Sorry. All we got so far was that the setup in question hit the same problem. I'm still evaluating but I already escaleted the ticket internally so it might well be that someone else will come back to you for details.

@ekeih
Copy link
Contributor Author

ekeih commented Jan 23, 2019

Thanks for the info! 馃檪

@Al2Klimov Al2Klimov self-assigned this Jan 25, 2019
@Al2Klimov
Copy link
Contributor

Whiteboxtested v2.10.2 with the following result:

NotificationComponent#NotificationTimerHandler() considers Checkable#IsInDowntime() which looks for downtimes added via Checkable#RegisterDowntime() in Downtime#Start() on downtime activation. The present downtimes always register themselves successfully as the checkables are loaded before the downtimes and the downtimes are activated after loading. In addition NotificationComponents are activated after downtimes.

(Thanks to @Crunsher for consultation.)

TL;DR: The code seems clean. CC @lippserd

@ekeih Please could you provide debug logs around the time the notification was send? And: Are you sure that Icinga 2 calls the notification script?

@Al2Klimov Al2Klimov assigned ekeih and unassigned Al2Klimov Jan 25, 2019
@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Jan 25, 2019
@SimonHoenscheid
Copy link

We also see this behavior in our setup too.

@lippserd
Copy link
Member

@SimonHoenscheid Do you have some debug logs and information about your setup?

@ekeih
Copy link
Contributor Author

ekeih commented Jan 28, 2019

Please could you provide debug logs around the time the notification was send?

Currently not. Last time we used the debug log it grew extremely fast. So I am not able to enable it for several days in production.
I will try to gather more information in our testing environment but it's quite hard without a way to actively reproduce it.

And: Are you sure that Icinga 2 calls the notification script?

Yes, I am 100% sure that it is called by Icinga2.

@jschanz
Copy link
Contributor

jschanz commented Jan 29, 2019

Found that issue, because we had the same problem in the past.
During a reload the notification script is sometimes triggered even though a downtime is set for a service.
This happens not for every service and it's not reproducible.

[2019-01-25 18:40:41 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 3, rate: 2495.82/s (149749/min 780477/5min 2344675/15min);
[2019-01-25 18:40:51 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 1, rate: 2505.15/s (150309/min 780335/5min 2344155/15min);
[2019-01-25 18:41:01 +0100] information/WorkQueue: #34 (JsonRpcConnection, #23) items: 4, rate: 2514.7/s (150882/min 780287/5min 2347299/15min);
[2019-01-25 18:41:02 +0100] information/Application: Reload requested, letting new process take over.
[2019-01-25 18:41:02 +0100] information/ApiListener: 'api' stopped.
[2019-01-25 18:41:02 +0100] information/CheckerComponent: 'checker' stopped.
[2019-01-25 18:41:03 +0100] information/ExternalCommandListener: 'command' stopped.
[2019-01-25 18:42:01 +0100] information/FileLogger: 'main-log' started.
[2019-01-25 18:42:10 +0100] information/WorkQueue: #10 (DaemonCommand::Run) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2019-01-25 18:42:13 +0100] information/ApiListener: 'api' started.
[2019-01-25 18:42:13 +0100] information/ApiListener: Copying 3 zone configuration files for zone 'xx.yy.zz.de' to '/var/lib/icinga2/api/zones/xx.yy.zz.de'.
[2019-01-25 18:42:13 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/xx.yy.zz.de' (0 Bytes). Received timestamp '2019-01-25 18:42:13 +0100' (1548438133.171498), Current timestamp '2019-01-25 11:50:50 +0100' (1548413450.769627).
[2019-01-25 18:42:13 +0100] information/ApiListener: Copying 12 zone configuration files for zone 'director-global' to '/var/lib/icinga2/api/zones/director-global'.

Historie in Icingweb2 is empty for that timestamp ...

grafik

... SMS-Gateway shows Icinga has triggered the notification.

grafik

Systemd logs something like this

Jan 25 18:41:02 localhost.localdomain systemd[1]: icinga2.service: Supervising process 32212 which is not our child. We'll most likely not notice when it exits.

We're running a single master with several zones. So I think it's not a cluster problem ...

@jschanz
Copy link
Contributor

jschanz commented Jan 29, 2019

Another idea ... Is it possible, that the newly spawn Icinga2 process will send this notification while starting up? Because ACKs sent via API are shown in Icingaweb2:

grafik

The notification script automatically sends an ACK to the API for the triggered service:

if ( $response->is_success ) {
  print $response->decoded_content . "\n";
  my $retval = $response->decoded_content;
  $post_data = "\"type\": \"Host\", \"filter\": \"host.name==\\\"$icinga2_host_name\\\"\", \"author\": \"Notification-Manager\", \"comment\": \"Es wurde erfolgreich eine SMS an $smsgw_pager ($icinga2_user_display_name) versendet. Status der SMS: $retval.\", \"notify\": false, \"sticky\": true";
  my $command = "curl -u '$icinga2_api_user:$icinga2_api_password' -H 'Accept: application/json' -X POST -k '$endpoint_url' -d '{ $post_data}'";
  exec($command);
} else {
  my $retval = $response->decoded_content;
  my $retmsg = $response->status_line;
  $post_data = "\"type\": \"Host\", \"filter\": \"host.name==\\\"$icinga2_host_name\\\"\", \"author\": \"Notification-Manager\", \"comment\": \"Es trat ein Fehler beim versenden der SMS an $smsgw_pager ($icinga2_user_display_name) auf. Status der SMS: $retval ($retmsg).\", \"notify\": false, \"sticky\": true";
  my $command = "curl -u '$icinga2_api_user:$icinga2_api_password' -H 'Accept: application/json' -X POST -k '$endpoint_url' -d '{ $post_data}'";
  exec($command);

  print $response->decoded_content . "\n";
  die $response->status_line . "\n";
}

So the API accepts and process connections and events, but logs nothing to /var/log/icinga2/icinga2.log.

The reload on 25th at ~ 18:40 was triggered by Director:

grafik

As I understand the state of Icinga2 is written to
/var/lib/icinga2/icinga2.state during a restart.

So ... Which Icinga2-Core receives the API-Call during reload in that case? The old or the new one? Why is an API-Call persisten, but nothing is logged or written to ido?
Any ideas?

Update:

After digging through the notificationcomponent code as @Al2Klimov described, I found, that the following isn't never logged to the /var/log/icinga2/icinga2.log in our environments:

void NotificationComponent::Stop(bool runtimeRemoved)
{
	Log(LogInformation, "NotificationComponent")
		<< "'" << GetName() << "' stopped.";

	ObjectImpl<NotificationComponent>::Stop(runtimeRemoved);
}

With a

systemctl restart icinga2

only the startup of the component is logged:

[2019-01-29 15:11:45 +0100] information/NotificationComponent: 'notification' started.

A reload with systemd or director also brings up the NotificationComponent message in the logfile.
At January, 25th, the NotificationComponent was also started about a minute after the "notification" was triggered.

[2019-01-25 18:42:14 +0100] information/NotificationComponent: 'notification' started.

An "NotificationComponent ... stopped" event, which should be thrown during shutdown as I understand, wasn't also logged in that case.

I assume something goes wrong during shutdown, which leads to a trigger that sends notifications out.

@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Jan 29, 2019
@Al2Klimov
Copy link
Contributor

During reload the daemon starts a copy of itself and then shuts down. There's a small period of time during which any of them could receive the request. At the moment log messages about API requests may be lost, but #6827 would fix that.

@SimonHoenscheid
Copy link

@SimonHoenscheid Do you have some debug logs and information about your setup?

I can sit down next week and put everything together. regarding debug logs, I need to have a look of the impacts this will have on our system.

@marcofl
Copy link

marcofl commented Feb 11, 2019

During a reload the notification script is sometimes triggered even though a downtime is set for a service.

We see the same issue:
screen shot 2019-02-11 at 09 41 47
At 03:08 there was a Icinga2 reload, but nothing was logged about a notification:

[2019-02-11 03:09:50 +0100] information/FileLogger: 'main-log' started.
[2019-02-11 03:09:56 +0100] information/ApiListener: 'api' started.
[2019-02-11 03:09:57 +0100] information/ExternalCommandListener: 'command' started.
[2019-02-11 03:09:57 +0100] information/GelfWriter: 'gelf' started.
[2019-02-11 03:09:57 +0100] information/InfluxdbWriter: 'influxdb' started.
[2019-02-11 03:09:57 +0100] information/CheckerComponent: 'checker' started.
[2019-02-11 03:09:57 +0100] information/NotificationComponent: 'notification' started.
[2019-02-11 03:09:57 +0100] information/DbConnection: 'ido-mysql' started.

what is even more strange:

  • graylog logged a notification at 02:04

This does not make any sense, right? when the service was critical, graylog logged the notification but it was not actually triggered. during the reload at 03:08 the / a notification was triggered but not logged to graylog?

@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Feb 11, 2019
@Al2Klimov
Copy link
Contributor

@ekeih You could auto-trucate the debug log by a cron script not to run out of disk space.

@MaBauMeBad
Copy link

Same here. If any addtional Logs or Information are required, please feel free to ask.

Thanks

@ekeih
Copy link
Contributor Author

ekeih commented Jun 18, 2019

@MaBauMeBad I think the main reason that there is no progress is that nobody provided debug logs. I currently do not have the resources to do so.
So if you can provide debug logs of this issue, this would probably help ;)

@dnsmichi dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels Jun 19, 2019
@MarcusCaepio
Copy link
Contributor

Hi all,
just wanted to confirm this. Had it right now. Scenario:
Scheduled Downtimes of several Hosts, which are off. Edited the Icinga2 Config of other hosts, reloaded the config and got notifications of the downtimed hosts.

Cheers,
Marcus

@dnsmichi
Copy link
Contributor

Please test this with 2.11 RC - https://icinga.com/2019/07/25/icinga-2-11-release-candidate/

@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Aug 27, 2019
@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Sep 17, 2019
@dnsmichi
Copy link
Contributor

I consider this fixed with 2.11 from the many patches we've implemented in this region.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events bug Something isn't working ref/NC
Projects
None yet
Development

No branches or pull requests

10 participants