Set the notification mode times.begin is not 0, the first notification has a delay #5561

silenceJI · 2017-09-07T03:35:34Z

Expected Behavior

Current Behavior

There are three kinds of notifications for each of my monitoring items:

mail-notification:

object Notification "mail-notification" {
	import "mail-service-notification"

	host_name = "l-jyftest1.ops.beta.cn0"
	interval = 300.000000
	service_name = "check_load"
	times = {
		begin = 0.000000
	}
	users = [ "mika.liu", "yufeng.ji", "kai.duan", "liangliang.pan", "zhangj.zhang", "jiwei.zhao", "yinan.gao", "yankun.zhu" ]
	version = 1504145701.727930
	zone = "master"
}

push-notification:

object Notification "mail-notification" {
	import "push-service-notification"

	host_name = "l-jyftest1.ops.beta.cn0"
	interval = 300.000000
	service_name = "check_load"
	times = {
		begin = 0.000000
	}
	users = [ "qunaropspush" ]
	vars["userlist"] = "mika.liu,yufeng.ji,kai.duan,liangliang.pan,zhangj.zhang,jiwei.zhao,yinan.gao,yankun.zhu"
	version = 1504145701.562974
	zone = "master"
}

ivr-notification:

object Notification "ivr-notification" {
	import "ivr-service-notification"

	host_name = "l-jyftest1.ops.beta.cn0"
	interval = 300.000000
	service_name = "check_load"
	times = {
		begin = 900.000000
	}
	users = [ "qunaropsivr" ]
	vars["userlist"] = "mika.liu,yufeng.ji,kai.duan,liangliang.pan,zhangj.zhang,jiwei.zhao,yinan.gao,yankun.zhu"
	version = 1504145701.873894
	zone = "master"
}

The current situation has been found, ivr-notification when the first notification, there are some delays, than the "times.begin" set the time later.

Possible Solution

Now set the "times.begin = 900" will probably be delayed for more than 1 minute, if there is no way to solve, we may be in the normal begin time: "times.begin = 900" - 60s

Steps to Reproduce (for bugs)

as the picture shows:

Theoretically speaking, "ivr-notification" The first notice should be (19:39:51 + 15min) = 19:54:51;
but now "ivr-notification" The first notice is "19:56:10";

delay time: (19:56:10 - 19:54:51) = 1min 19s;

Context

Your Environment

Version used (icinga2 --version):

icinga2 - The Icinga 2 network monitoring daemon (version: r2.7.0-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-514.26.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown

Operating System and version:

CentOS Linux release 7.3.1611 (Core)

Enabled features (icinga2 feature list):

Disabled features: command compatlog debuglog gelf graphite ido-pgsql influxdb livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-mysql mainlog notification

Icinga Web 2 version and modules (System - About):
Config validation (icinga2 daemon -C):

icinga2 daemon -C
information/cli: Icinga application loader (version: r2.7.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/Zone: The Zone object 'zone-cn0' has more than two endpoints. Due to a known issue this type of configuration is strongly discouraged and may cause Icinga to use excessive amounts of CPU time.
information/ApiListener: My API identity: l-cn0qicingam1.ops.cn2
information/ConfigItem: Instantiated 5 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 3 Zones.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 5 Endpoints.
information/ConfigItem: Instantiated 1 UserGroup.
information/ConfigItem: Instantiated 41614 Notifications.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 274 CheckCommands.
information/ConfigItem: Instantiated 124 Downtimes.
information/ConfigItem: Instantiated 2 HostGroups.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 1 EventCommand.
information/ConfigItem: Instantiated 5581 Hosts.
information/ConfigItem: Instantiated 10458 Users.
information/ConfigItem: Instantiated 3 TimePeriods.
information/ConfigItem: Instantiated 51407 Services.
information/ConfigItem: Instantiated 4 ServiceGroups.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).

If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

/*
 * Generated by Icinga 2 node setup commands
 * on 2017-01-09 17:12:48 +0800
 */

object Endpoint "l-cn0qicingam1.ops.cn2" {
	host = "10.90.0.81"
	log_duration = 5m
}

object Endpoint "l-cn0qicingam2.ops.cn2" {
	host = "10.90.0.82"
	log_duration = 5m
}

object Endpoint "l-cn0qicingan1.ops.cn2" {
	host = "10.90.0.85"
	log_duration = 5m
}

object Endpoint "l-cn0qicingan2.ops.cn2" {
	host = "10.90.0.86"
	log_duration = 5m
}

object Endpoint "l-cn0qicingan3.ops.cn2" {
	host = "10.90.0.84"
	log_duration = 5m
}

object Zone "master" {
	endpoints = [ "l-cn0qicingam1.ops.cn2", "l-cn0qicingam2.ops.cn2" ]
}

object Zone "zone-cn0" {
	endpoints = [ "l-cn0qicingan1.ops.cn2", "l-cn0qicingan2.ops.cn2", "l-cn0qicingan3.ops.cn2" ]
	parent = "master"
}

/* sync global commands */
object Zone "global-templates" {
        global = true
}

The text was updated successfully, but these errors were encountered:

dnsmichi · 2017-09-07T07:23:04Z

Please fetch the runtime information for the ivr-notification from the REST API /v1/objects/notifications/ at several points in time:

when the service is OK
when the service is NOT-OK and in in its first hard state
after the notification was actually sent

mdetrano · 2017-10-11T23:43:24Z

I am seeing the same issue with initial notification delays (later than "begin" time). I used the REST API to monitor the notification objects and I believe I understand at least what it's doing.

when a service is OK, the notification is still calculating a next_notification time based on the notification interval.
when a service has a problem, IF the next_notification at that moment comes AFTER the begin time, then the next_notification is updated according to the begin time, and the notification goes out on time, as expected.
However, IF the next_notification comes BEFORE the begin time, then the next_notification is not changed, until that "next_notification" time arrives...at which point it will be changed to that moment + the begin time.

So, if begin time is 10 minutes, the interval is 15 minutes, and a service goes critical 14 minutes before its currently listed "next_notification"...then the notification goes out at 10 minutes.

If that same service is 5 minutes from the "next" notification when it goes critical, then the notification goes out at 15 minutes instead of 10 minutes. (5 remaining minutes + the 10 minutes "begin" time).

mdetrano · 2017-10-20T16:03:25Z

I dug into the source code a bit and believe I have found where the problem is located. In lib/icinga/notification.cpp , in the function: void Notification::BeginExecuteNotification(...), starting around line 293 or so, there are these instructions for setting up the next notification based on times.begin :

/* we need to adjust the next notification time
* to now + begin delaying the first notification
*/

double nextProposedNotification = now + timesBegin + 1.0;
if (GetNextNotification() > nextProposedNotification)
    SetNextNotification(nextProposedNotification);

I have tweaked this a bit and could at least make it work with either of a couple of changes. I could remove the conditional entirely:

/* we need to adjust the next notification time
* to now + begin delaying the first notification
*/

double nextProposedNotification = now + timesBegin + 1.0;
SetNextNotification(nextProposedNotification);

Or I could offset the begin time against the last state change, instead of "now" (which could be "later", if the condition prevented the SetNextNotification on the first try ).

/* we need to adjust the next notification time
* to now + begin delaying the first notification  (well, maybe not "now" after all)
*/

double nextProposedNotification = checkable->GetLastHardStateChange() + timesBegin + 1.0;
if (GetNextNotification() > nextProposedNotification)
    SetNextNotification(nextProposedNotification);

I am not sure why the condition (GetNextNotification() > nextProposedNotification) is taken into consideration, but suspect there is a good reason for that. (re-notifications for previous events?)

In any case, though, shouldn't the "begin" time of a notification be referencing the object's hard state change time, instead of "now" ?

mdetrano · 2017-12-19T17:25:03Z

I have researched this issue a bit more and just submitted a pull request for some changes which I think better address the problem.

The last hard state change time isn't necessarily the best moment to base the begin and end times on since a notification setup for, say, warning and critical states, will have those times pushed back if the service flips between critical and warning before they are reached (this could delay notifications indefinitely and lead to some uncomfortable questions from admins who never got notified of the problem).

In the pull request, I implemented a field to track the actual trigger time of the notification (when it first goes into an applicable hard problem state). That trigger time is reset on recovery or on changing to a non-applicable state. It's a small change in the behaviour of the system, so it should be carefully reviewed, but I believe most users would expect and want the Notifications to act like this, especially if they want a specific layered escalation system for handling problems.

dnsmichi · 2018-01-08T12:55:54Z

refs #5890

refs #5561

refs #5561 (cherry picked from commit 3b20547)

dnsmichi added needs feedback We'll only proceed once we hear from you again area/notifications Notification events labels Sep 7, 2017

dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels May 9, 2018

Al2Klimov self-assigned this Jan 11, 2019

Al2Klimov added a commit that referenced this issue Jan 11, 2019

Notification#BeginExecuteNotification(): SetNextNotification() correctly

3b20547

refs #5561

Al2Klimov mentioned this issue Jan 11, 2019

Notification#BeginExecuteNotification(): SetNextNotification() correctly #6896

Merged

dnsmichi added this to the 2.11.0 milestone Feb 22, 2019

dnsmichi closed this as completed in #6896 Feb 22, 2019

dnsmichi modified the milestones: 2.11.0, 2.10.4 Mar 19, 2019

dnsmichi pushed a commit that referenced this issue Mar 19, 2019

Notification#BeginExecuteNotification(): SetNextNotification() correctly

bac6f17

refs #5561 (cherry picked from commit 3b20547)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set the notification mode times.begin is not 0, the first notification has a delay #5561

Set the notification mode times.begin is not 0, the first notification has a delay #5561

silenceJI commented Sep 7, 2017

dnsmichi commented Sep 7, 2017

mdetrano commented Oct 11, 2017

mdetrano commented Oct 20, 2017

mdetrano commented Dec 19, 2017

dnsmichi commented Jan 8, 2018

Set the notification mode times.begin is not 0, the first notification has a delay #5561

Set the notification mode times.begin is not 0, the first notification has a delay #5561

Comments

silenceJI commented Sep 7, 2017

Expected Behavior

Current Behavior

mail-notification:

push-notification:

ivr-notification:

Possible Solution

Steps to Reproduce (for bugs)

as the picture shows:

Context

Your Environment

dnsmichi commented Sep 7, 2017

mdetrano commented Oct 11, 2017

mdetrano commented Oct 20, 2017

mdetrano commented Dec 19, 2017

dnsmichi commented Jan 8, 2018