Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the notification mode times.begin is not 0, the first notification has a delay #5561

Closed
silenceJI opened this issue Sep 7, 2017 · 5 comments · Fixed by #6896
Closed
Assignees
Labels
area/notifications Notification events bug Something isn't working
Milestone

Comments

@silenceJI
Copy link

Expected Behavior

Current Behavior

There are three kinds of notifications for each of my monitoring items:

mail-notification:

object Notification "mail-notification" {
	import "mail-service-notification"

	host_name = "l-jyftest1.ops.beta.cn0"
	interval = 300.000000
	service_name = "check_load"
	times = {
		begin = 0.000000
	}
	users = [ "mika.liu", "yufeng.ji", "kai.duan", "liangliang.pan", "zhangj.zhang", "jiwei.zhao", "yinan.gao", "yankun.zhu" ]
	version = 1504145701.727930
	zone = "master"
}

push-notification:

object Notification "mail-notification" {
	import "push-service-notification"

	host_name = "l-jyftest1.ops.beta.cn0"
	interval = 300.000000
	service_name = "check_load"
	times = {
		begin = 0.000000
	}
	users = [ "qunaropspush" ]
	vars["userlist"] = "mika.liu,yufeng.ji,kai.duan,liangliang.pan,zhangj.zhang,jiwei.zhao,yinan.gao,yankun.zhu"
	version = 1504145701.562974
	zone = "master"
}

ivr-notification:

object Notification "ivr-notification" {
	import "ivr-service-notification"

	host_name = "l-jyftest1.ops.beta.cn0"
	interval = 300.000000
	service_name = "check_load"
	times = {
		begin = 900.000000
	}
	users = [ "qunaropsivr" ]
	vars["userlist"] = "mika.liu,yufeng.ji,kai.duan,liangliang.pan,zhangj.zhang,jiwei.zhao,yinan.gao,yankun.zhu"
	version = 1504145701.873894
	zone = "master"
}

The current situation has been found, ivr-notification when the first notification, there are some delays, than the "times.begin" set the time later.

Possible Solution

Now set the "times.begin = 900" will probably be delayed for more than 1 minute, if there is no way to solve, we may be in the normal begin time: "times.begin = 900" - 60s

Steps to Reproduce (for bugs)

as the picture shows:

image

Theoretically speaking, "ivr-notification" The first notice should be (19:39:51 + 15min) = 19:54:51;
but now "ivr-notification" The first notice is "19:56:10";

delay time: (19:56:10 - 19:54:51) = 1min 19s;

Context

Your Environment

  • Version used (icinga2 --version):
icinga2 - The Icinga 2 network monitoring daemon (version: r2.7.0-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-514.26.2.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown
  • Operating System and version:
CentOS Linux release 7.3.1611 (Core)
  • Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog gelf graphite ido-pgsql influxdb livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-mysql mainlog notification
  • Icinga Web 2 version and modules (System - About):
  • Config validation (icinga2 daemon -C):
icinga2 daemon -C
information/cli: Icinga application loader (version: r2.7.0-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
warning/Zone: The Zone object 'zone-cn0' has more than two endpoints. Due to a known issue this type of configuration is strongly discouraged and may cause Icinga to use excessive amounts of CPU time.
information/ApiListener: My API identity: l-cn0qicingam1.ops.cn2
information/ConfigItem: Instantiated 5 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 3 Zones.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 5 Endpoints.
information/ConfigItem: Instantiated 1 UserGroup.
information/ConfigItem: Instantiated 41614 Notifications.
information/ConfigItem: Instantiated 4 NotificationCommands.
information/ConfigItem: Instantiated 274 CheckCommands.
information/ConfigItem: Instantiated 124 Downtimes.
information/ConfigItem: Instantiated 2 HostGroups.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 1 EventCommand.
information/ConfigItem: Instantiated 5581 Hosts.
information/ConfigItem: Instantiated 10458 Users.
information/ConfigItem: Instantiated 3 TimePeriods.
information/ConfigItem: Instantiated 51407 Services.
information/ConfigItem: Instantiated 4 ServiceGroups.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
/*
 * Generated by Icinga 2 node setup commands
 * on 2017-01-09 17:12:48 +0800
 */

object Endpoint "l-cn0qicingam1.ops.cn2" {
	host = "10.90.0.81"
	log_duration = 5m
}

object Endpoint "l-cn0qicingam2.ops.cn2" {
	host = "10.90.0.82"
	log_duration = 5m
}

object Endpoint "l-cn0qicingan1.ops.cn2" {
	host = "10.90.0.85"
	log_duration = 5m
}

object Endpoint "l-cn0qicingan2.ops.cn2" {
	host = "10.90.0.86"
	log_duration = 5m
}

object Endpoint "l-cn0qicingan3.ops.cn2" {
	host = "10.90.0.84"
	log_duration = 5m
}

object Zone "master" {
	endpoints = [ "l-cn0qicingam1.ops.cn2", "l-cn0qicingam2.ops.cn2" ]
}

object Zone "zone-cn0" {
	endpoints = [ "l-cn0qicingan1.ops.cn2", "l-cn0qicingan2.ops.cn2", "l-cn0qicingan3.ops.cn2" ]
	parent = "master"
}

/* sync global commands */
object Zone "global-templates" {
        global = true
}
@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 7, 2017

Please fetch the runtime information for the ivr-notification from the REST API /v1/objects/notifications/ at several points in time:

  • when the service is OK
  • when the service is NOT-OK and in in its first hard state
  • after the notification was actually sent

@dnsmichi dnsmichi added needs feedback We'll only proceed once we hear from you again area/notifications Notification events labels Sep 7, 2017
@mdetrano
Copy link

I am seeing the same issue with initial notification delays (later than "begin" time). I used the REST API to monitor the notification objects and I believe I understand at least what it's doing.

  • when a service is OK, the notification is still calculating a next_notification time based on the notification interval.

  • when a service has a problem, IF the next_notification at that moment comes AFTER the begin time, then the next_notification is updated according to the begin time, and the notification goes out on time, as expected.

  • However, IF the next_notification comes BEFORE the begin time, then the next_notification is not changed, until that "next_notification" time arrives...at which point it will be changed to that moment + the begin time.

So, if begin time is 10 minutes, the interval is 15 minutes, and a service goes critical 14 minutes before its currently listed "next_notification"...then the notification goes out at 10 minutes.

If that same service is 5 minutes from the "next" notification when it goes critical, then the notification goes out at 15 minutes instead of 10 minutes. (5 remaining minutes + the 10 minutes "begin" time).

@mdetrano
Copy link

I dug into the source code a bit and believe I have found where the problem is located. In lib/icinga/notification.cpp , in the function: void Notification::BeginExecuteNotification(...), starting around line 293 or so, there are these instructions for setting up the next notification based on times.begin :

/* we need to adjust the next notification time
* to now + begin delaying the first notification
*/

double nextProposedNotification = now + timesBegin + 1.0;
if (GetNextNotification() > nextProposedNotification)
    SetNextNotification(nextProposedNotification);

I have tweaked this a bit and could at least make it work with either of a couple of changes. I could remove the conditional entirely:

/* we need to adjust the next notification time
* to now + begin delaying the first notification
*/

double nextProposedNotification = now + timesBegin + 1.0;
SetNextNotification(nextProposedNotification);

Or I could offset the begin time against the last state change, instead of "now" (which could be "later", if the condition prevented the SetNextNotification on the first try ).

/* we need to adjust the next notification time
* to now + begin delaying the first notification  (well, maybe not "now" after all)
*/

double nextProposedNotification = checkable->GetLastHardStateChange() + timesBegin + 1.0;
if (GetNextNotification() > nextProposedNotification)
    SetNextNotification(nextProposedNotification);

I am not sure why the condition (GetNextNotification() > nextProposedNotification) is taken into consideration, but suspect there is a good reason for that. (re-notifications for previous events?)

In any case, though, shouldn't the "begin" time of a notification be referencing the object's hard state change time, instead of "now" ?

@mdetrano
Copy link

I have researched this issue a bit more and just submitted a pull request for some changes which I think better address the problem.

The last hard state change time isn't necessarily the best moment to base the begin and end times on since a notification setup for, say, warning and critical states, will have those times pushed back if the service flips between critical and warning before they are reached (this could delay notifications indefinitely and lead to some uncomfortable questions from admins who never got notified of the problem).

In the pull request, I implemented a field to track the actual trigger time of the notification (when it first goes into an applicable hard problem state). That trigger time is reset on recovery or on changing to a non-applicable state. It's a small change in the behaviour of the system, so it should be carefully reviewed, but I believe most users would expect and want the Notifications to act like this, especially if they want a specific layered escalation system for handling problems.

@dnsmichi
Copy link
Contributor

dnsmichi commented Jan 8, 2018

refs #5890

@dnsmichi dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels May 9, 2018
@Al2Klimov Al2Klimov self-assigned this Jan 11, 2019
@dnsmichi dnsmichi added this to the 2.11.0 milestone Feb 22, 2019
@dnsmichi dnsmichi modified the milestones: 2.11.0, 2.10.4 Mar 19, 2019
dnsmichi pushed a commit that referenced this issue Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants