Skip to content
This repository has been archived by the owner. It is now read-only.

[dev.icinga.com #3441] wrong escalation notification due to state based escalation range behaviour changes #1165

Closed
icinga-migration opened this issue Nov 13, 2012 · 14 comments

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

commented Nov 13, 2012

This issue has been migrated from Redmine: https://dev.icinga.com/issues/3441

Created by fmbiete on 2012-11-13 17:18:07 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2012-11-28 15:11:20 +00:00)
Target Version: 1.8.2
Last Update: 2012-11-28 15:11:20 +00:00 (in Redmine)

Icinga Version: 1.8.1
OS Version: Debian Squeeze

Hi,

I have some tests with notification escalation.

1st notification goes to a dummy contact
3rd notification goes to a level 1 contact
10th notification goes to a level 2 contact
15th notification goes to a level 3 contact

We are seeing a warning notification to level 1.
Before it gets to level 2 limit the problem gets resolved.
A recovery notification is sent to level 1, level 2 and level 3.

Why??

Icinga 1.7.2 didn't have that problem.

Specs:
Debian Squeeze 32 bits
Icinga 1.8.1 + IDOUtils
Mysql 5.5

Changesets

2012-11-28 14:37:56 +00:00 by mfriedrich a881643

core: fix wrong escalation notification due to state based escalation range behaviour changes

re-enabling the state based escalation ranges lead into a weird
behavorial change, as the general "is the escalation valid for a
notification" condition was met, but another filter was added (the state
checks and their counters).
Since the default users do not use state based escalation ranges, there
is no other way revoking that behaviour change than making this fully
optional, and reverting to the old known default behaviour by
introducing a new config option, which remains disabled by default.

enable_state_based_escalation_ranges=0

this may not be the best idea within a bugfix release either, but still
it allows those actually wanting to use the state based escalation
ranges to use it without recompiling as we had the request to change
within #2878 already.

reverting to the old known behaviour will probably fix #3441 as well, as
it turns out to be the possible root cause for the faulty condition
checks when an escalation is valid for a notification.

refs #2878
refs #3441

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 13, 2012

Updated by fmbiete on 2012-11-13 17:21:18 +00:00

host-name service-name OK 13-11-2012 18:04:27 fmbiete-n1 service-notify-email 80.54
host-name service-name OK 13-11-2012 18:04:25 fmbiete-n2 service-notify-xmpp 80.54
host-name service-name OK 13-11-2012 18:04:19 fmbiete-n3 service-notify-gtalk 80.54
host-name service-name CRITICAL 13-11-2012 18:03:24 contacto-dummy service-notify-dummy 100.32
host-name service-name WARNING 13-11-2012 17:58:09 contacto-dummy service-notify-dummy 99.57

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 25, 2012

Updated by mfriedrich on 2012-11-25 12:45:34 +00:00

  • Status changed from New to Feedback

any test configs and/or debug logs for that? it possibly requires more debugging, so don't expect it to be fixed within 1.8.2 as this is already in the release cycle.

#2878 might be related to that one. will try to debug that one next week myself once i got better connection.
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 25, 2012

Updated by fmbiete on 2012-11-25 17:19:55 +00:00

Config. Add some host to hostgroup domain-routers-cpds, and contacts to the contacts_group

# Service Templates
define service {
        name       pnp-svc
        register   0
        action_url /pnp4nagios/graph?host=$HOSTNAME$&srv=$SERVICEDESC$' class='tips' rel='/pnp4nagios/popup?host=$HOSTNAME$&srv=$SERVICEDESC$
}

define service {
        name                            domain-service
        check_interval                  5
        retry_interval                  1
        max_check_attempts              3
        notification_interval           5
        notification_options            w,c,r,s
        contact_groups                  dummy-group ;notification: dummy echo
        process_perf_data               1
        register                        0
}

define service {
        name                            domain-service-pnp
        use                             pnp-svc
        process_perf_data               1
        register                        0
}
define service {
        name                            domain-service-24x7
        use                             domain-service
        check_period                    24x7
        notification_period             24x7
        register                        0
}

define service {
        name                            domain-service-passive-24x7
        use                             domain-service-24x7
        active_checks_enabled           0
        passive_checks_enabled          1
        initial_state                   o
        max_check_attempts              1
        check_command                   check_dummy!0 ;always return ok
        check_freshness                 0
        register                        0
}

define service {
        name                            domain-service-24x7-5
        use                             domain-service-24x7
        check_interval                  5
        notification_interval           5
        register                        0
}
define service {
        name                            domain-service-passive-24x7-5
        use                             domain-service-24x7-5,domain-service-passive-24x7
        register                        0
}

define service {
        name                            domain-service-passive-pnp-24x7-5
        use                             domain-service-passive-24x7-5,domain-service-pnp
        register                        0
}


# Service itself
define service {
        use                             domain-service-passive-pnp-24x7-5
        hostgroup_name                  domain-routers-cpds
        servicegroups                   domain-escalation
        service_description             Service Name
}


# Service group for escalation
define servicegroup {
        servicegroup_name       domain-escalation
        alias                   ServiceGroup Name
        register                0
}


# Escalation steps
define serviceescalation {
        servicegroup_name       domain-escalation
        first_notification      3
        last_notification       0
        contact_groups          domain-level1
}

define serviceescalation {
        servicegroup_name       domain-escalation
        first_notification      10
        last_notification       0
        contact_groups          domain-level2
}

define serviceescalation {
        servicegroup_name       domain-escalation
        first_notification      20
        last_notification       0
        contact_groups          domain-level3
}
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 25, 2012

Updated by fmbiete on 2012-11-25 17:29:21 +00:00

Log: Warning is set to 95, Critical to 100

[1353690732] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;0;93.32|mbps=93.32;95;100
[1353690805] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;0;93.32
[1353690846] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;1;97.76|mbps=97.76;95;100
[1353690853] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;1;97.76
[1353690853] SERVICE ALERT: hostname.domain;Bandwidth;WARNING;SOFT;1;97.76
[1353690966] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;1;98.86|mbps=98.86;95;100
[1353690966] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;1;98.86
[1353690966] SERVICE ALERT: hostname.domain;Bandwidth;WARNING;SOFT;2;98.86
[1353691086] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;2;102.40|mbps=102.40;95;100
[1353691086] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;2;102.40
[1353691086] SERVICE ALERT: hostname.domain;Bandwidth;CRITICAL;HARD;3;102.40
[1353691087] SERVICE NOTIFICATION: contacto-dummy;hostname.domain;Bandwidth;CRITICAL;domain-service-notify-dummy;102.40
[1353691219] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;2;101.09|mbps=101.09;95;100
[1353691281] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;2;101.09
[1353691339] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;2;102.29|mbps=102.29;95;100
[1353691398] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;2;102.29
[1353691398] SERVICE NOTIFICATION: contacto-dummy;hostname.domain;Bandwidth;CRITICAL;domain-service-notify-dummy;102.29
[1353691448] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;2;100.29|mbps=100.29;95;100
[1353691448] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;2;100.29
[1353691567] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;0;91.25|mbps=91.25;95;100
[1353691567] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;0;91.25
[1353691567] SERVICE ALERT: hostname.domain;Bandwidth;OK;HARD;3;91.25
[1353691567] SERVICE NOTIFICATION: domain-contact1-n1;hostname.domain;Bandwidth;OK;domain-service-notify-email;91.25
[1353691567] SERVICE NOTIFICATION: domain-contact2-n1;hostname.domain;Bandwidth;OK;domain-service-notify-email;91.25
[1353691567] SERVICE NOTIFICATION: domain-contact3-n1;hostname.domain;Bandwidth;OK;domain-service-notify-email;91.25
[1353691567] SERVICE NOTIFICATION: domain-contact4-n1;hostname.domain;Bandwidth;OK;domain-service-notify-email;91.25
[1353691567] SERVICE NOTIFICATION: domain-contact5-n1;hostname.domain;Bandwidth;OK;domain-service-notify-email;91.25
[1353691567] SERVICE NOTIFICATION: domain-contact1-n2;hostname.domain;Bandwidth;OK;domain-service-notify-gtalk;91.25
[1353691569] SERVICE NOTIFICATION: domain-contact2-n2;hostname.domain;Bandwidth;OK;domain-service-notify-gtalk;91.25
[1353691570] SERVICE NOTIFICATION: domain-contact3-n2;hostname.domain;Bandwidth;OK;domain-service-notify-xmpp;91.25
[1353691572] SERVICE NOTIFICATION: domain-contact1-n3;hostname.domain;Bandwidth;OK;domain-service-notify-gtalk;91.25
[1353691683] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;0;92.91|mbps=92.91;95;100
[1353691683] PASSIVE SERVICE CHECK: hostname.domain;Bandwidth;0;92.91
[1353691819] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hostname.domain;Bandwidth;0;84.85|mbps=84.85;95;100
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 25, 2012

Updated by mfriedrich on 2012-11-25 23:16:14 +00:00

  • File added 0001-add-enable_state_based_escalation_ranges-and-disable.patch

can you test the attached git patch? it applies on top of 'mfriedrich/core' and likely 'next' too. it reverts some changes and makes state based escalation ranges on the escalation is valid for notification checks an optional filter then. haven't tested that now, as i am lacking off time to do so.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 25, 2012

Updated by mfriedrich on 2012-11-25 23:25:11 +00:00

  • Status changed from Feedback to Assigned
  • Assigned to set to mfriedrich
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 26, 2012

Updated by fmbiete on 2012-11-26 09:52:27 +00:00

I have applied the path without enabling the new parameter

enable_state_based_escalation_ranges=0

I will post the results.

Thank you very much

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 26, 2012

Updated by mfriedrich on 2012-11-26 13:08:15 +00:00

  • Subject changed from Wrong escalation notification to wrong escalation notification due to state based escalation range behaviour changes
  • Category set to Escalations
  • Target Version set to 1.8.2

it's likely the behaviour state change, so the revert to the disabled default should fix it. but as usual, test it til tuesday night, so it could be added to 1.8.2

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 26, 2012

Updated by alexbrueckel on 2012-11-26 14:39:22 +00:00

I've applied the patch to a dev system and now the original problem seems to be solved.

But: as far as i can see, if notifications escalate, only the last escalation group gets the recovery, the normal contact and escalation groups in between get nothing after critical.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 26, 2012

Updated by fmbiete on 2012-11-26 17:47:06 +00:00

That seems to fix the problem in my system.

The recovery is sent only to the last level in the escalation.

Thank you very much

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2012

Updated by mfriedrich on 2012-11-28 14:25:31 +00:00

  • File deleted 0001-add-enable_state_based_escalation_ranges-and-disable.patch
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2012

Updated by mfriedrich on 2012-11-28 14:35:41 +00:00

alexbrueckel wrote:

But: as far as i can see, if notifications escalate, only the last escalation group gets the recovery, the normal contact and escalation groups in between get nothing after critical.

that's likely to be reproduced in a different issue, please report so, and add all valuable test config, logs, tests, etc.

i've added 3441.cfg to the issue which contains partly config from the reporter, but actually working with the default 'make install-testconfig'.

thanks for the tests, then it will apply to 1.8.2

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2012

Updated by mfriedrich on 2012-11-28 14:41:37 +00:00

  • Status changed from Assigned to 7
  • Done % changed from 0 to 100
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Nov 28, 2012

Updated by mfriedrich on 2012-11-28 15:11:20 +00:00

  • Status changed from 7 to Resolved
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.