Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga 2.12.3 is sending notifications twice for BPM Checks if timeperiod is starting #8667

Closed
tuxracer1337 opened this issue Mar 5, 2021 · 8 comments · Fixed by #8708
Closed
Assignees
Milestone

Comments

@tuxracer1337
Copy link

Describe the bug

Since we upgraded from icinga 2.11.5 to Icinga 2.12.3 we have a new behavior for notifications by BPM Checks.

If the check changes from Ok to CRIT outside of time period and then the notification time period beginns, Icinga send all notifications twice in less than 1 minute for the BPM checks.

In the logs we can see one notification is triggert as "Sending reminder 'Problem'"....
and the second notification is triggert as "Sending 'Problem'"....
This is new in icinga 2.12.3, beacuse we verified it with older logs and there is only "Sending reminder 'Problem'....." and not the second one.

The following double log entries we have for each user we want to send notifications for this check.

[2021-03-03 07:15:00 +0100] information/Notification: Sending reminder 'Problem' notification 'DUMMYHOST!BPM-CHECK!N-TICKETSYSTEM-SERVICES' for user 'ticket'
[2021-03-03 07:15:00 +0100] information/Notification: Completed sending 'Problem' notification ''DUMMYHOST!BPM-CHECK!N-TICKETSYSTEM-SERVICES' for checkable ''DUMMYHOST!BPM-CHECK!N-TICKETSYSTEM-SERVICES' for user 'ticket' using command 'NC-TICKET-SERVICES'.
[2021-03-03 07:15:16 +0100] information/Notification: Sending 'Problem' notification 'DUMMYHOST!BPM-CHECK!N-TICKETSYSTEM-SERVICES' for user 'ticket'
[2021-03-03 07:15:16 +0100] information/Notification: Completed sending 'Problem' notification 'DUMMYHOST!BPM-CHECK!N-TICKETSYSTEM-SERVICES' for user 'ticket' using command 'NC-TICKET-SERVICES''.

And it isn't also related to a notification command, because all defined notification commands have the same behavior.

To Reproduce

'/usr/bin/icingacli' 'businessprocess' 'process' 'check' 'BPM-CHECK' '--blame' '--config' 'BPM_CHECK-CONFIG' '--downtime-is-ok' '--root-cause' '--state-type' 'hard'
object Service "BPM-CHECK" {
    host_name = "DUMMYHOST"
    import "SN-generic-service-template"
    check_command = "CC-icingacli-businessprocess"
    max_check_attempts = "16"
    check_interval = 1m
    retry_interval = 1m
    enable_perfdata = false
    command_endpoint = "icingawebsystem"
    vars.icingacli_businessprocess_config = "MY-CONFIG"
    vars.icingacli_businessprocess_process = "MY-PROCESS"
    vars.icingacli_businessprocess_statetype = "hard"
    vars.notification_ticket = "TICKET_SYSTEM_NOTIFICATION_GROUP"
    vars.notification_ticket_newtickets = true
    vars.notification_mail = [ "USER_GROUP1", "USER_GROUP2" ]
    vars.notification_period = {
        cism = "T-SERVICETIMES"
        mail = "T-SERVICETIMES"
        sms = "T-SERVICETIMES"
    }
    vars.service_info = "Check fancy BPM stuff"
}
object TimePeriod "T-SERVICETIMES" {
  import "legacy-timeperiod"
  excludes = [ "T-FeiertageBRD" ]
  display_name = "Workinghours of ServiceCenter"
  ranges = {
    "monday"    = "07:15-20:00"
    "tuesday"   = "07:15-20:00"
    "wednesday" = "07:15-20:00"
    "thursday"  = "07:15-20:00"
    "friday"    = "07:15-20:00"
    "saturday"  = "07:15-16:00"
  }
}

Environment

  • Version used (icinga2 --version): 2.12.3
  • Operating System and version: Masters and Satellites run on SLES12.5
  • Enabled features (icinga2 feature list): api checker ido-mysql influxdb mainlog notification
  • Icinga Web 2 version and modules (System - About):
  Icingaweb | 2.7.4
  businessprocess | 2.2.0
  cube | 1.0.1
  director | 1.5.1
  doc | 2.7.4
  fileshipper | 1.0.1 
  grafana | 1.2.5
  ipl | v0.1.1 
  monitoring | 2.7.4
  nagvis | 1.1.1
@julianbrost julianbrost self-assigned this Mar 8, 2021
@julianbrost
Copy link
Contributor

I tried to reproduce this and did not succeed so far. Can you please also share the corresponding notification object/apply rule?

@tuxracer1337
Copy link
Author

tuxracer1337 commented Mar 10, 2021

Sure - here we are.

{
    "results": [
        {
            "attrs": {
                "__name": "DUMMYHOST!BPM-CHECK!NOTIFICATION-TICKET-SERVICES",
                "active": true,
                "command": "NC-TICKET -SERVICES",
                "command_endpoint": "",
                "ha_mode": 0.0,
                "host_name": "DUMMYHOST",
                "interval": 0.0,
                "last_notification": 1614840674.895214,
                "last_problem_notification": 1614838508.164547,
                "name": "NOTIFICATION-TICKET-SERVICES",
                "next_notification": 1615384635.623886,
                "no_more_notifications": false,
                "notification_number": 0.0,
                "notified_problem_users": [],
                "original_attributes": null,
                "package": "director",
                "paused": false,
                "period": "T-SERVICETIMES",
                "service_name": "BPM-CHECK",
                "source_location": {
                    "first_column": 0.0,
                    "first_line": 1.0,
                    "last_column": 46.0,
                    "last_line": 1.0,
                    "path": "/var/lib/icinga2/api/packages/director/db9e3cf4-d828-46b3-bc33-caabd55334f9/zones.d/ZONE/notifications/N-TICKET-SERVICES.conf"
                },
                "states": [
                    "OK",
                    "Critical"
                ],
                "templates": [
                    "N-TICKET-SERVICES",
                    "NT-TICKET-SERVICES"
                ],
                "times": null,
                "type": "Notification",
                "types": [
                    "Problem",
                    "Recovery"
                ],
                "user_groups": null,
                "users": [
                    "TICKET"
                ],
                "vars": {
                    "notification_ticket_newtickets": true,
                    "notification_ticketgroup": "FANCY_TICKET_GROUP",
                    "notification_priority": "1"
                },
                "version": 0.0,
                "zone": "PROD-ZONE"
            },
            "joins": {},
            "meta": {},
            "name": "DUMMYHOST!BPM-CHECK!N-TICKET-SERVICES",
            "type": "Notification"
        },
        {
            "attrs": {
                "__name": "DUMMYHOST!BPM-CHECK!N-SERVICES4USERS-MAIL",
                "active": true,
                "command": "NC-htmlmail-service",
                "command_endpoint": "",
                "ha_mode": 0.0,
                "host_name": "DUMMYHOST",
                "interval": 0.0,
                "last_notification": 1614840674.894563,
                "last_problem_notification": 1614838508.170979,
                "name": "N-SERVICES4USERS-MAIL",
                "next_notification": 1615384635.629311,
                "no_more_notifications": false,
                "notification_number": 0.0,
                "notified_problem_users": [],
                "original_attributes": null,
                "package": "director",
                "paused": false,
                "period": "T-SERVICECENTER",
                "service_name": "BPM-CHECK",
                "source_location": {
                    "first_column": 1.0,
                    "first_line": 9.0,
                    "last_column": 53.0,
                    "last_line": 9.0,
                    "path": "/var/lib/icinga2/api/packages/director/db9e3cf4-d828-46b3-bc33-caabd55334f9/zones.d/ZONE/notifications/N-Services-Users.conf"
                },
                "states": [
                    "OK",
                    "Critical"
                ],
                "templates": [
                    "N-SERVICES4USERS-MAIL",
                    "NT-MAIL-SERVICE"
                ],
                "times": {
                    "begin": 30.0
                },
                "type": "Notification",
                "types": [
                    "Problem",
                    "Recovery"
                ],
                "user_groups": [
                    "USER_GROUP1",
                    "USER_GROUP2"
                ],
                "users": null,
                "vars": {
                    "notification_ticket": true,
                    "notification_name": "N-TEST-SERVICES4USERS-MAIL"
                },
                "version": 0.0,
                "zone": "PROD-ZONE"
            },
            "joins": {},
            "meta": {},
            "name": "DUMMYHOST!BPM-CHECK!N-SERVICES4USERS-MAIL",
            "type": "Notification"
        },
        {
            "attrs": {
                "__name": "DUMMYHOST!BPM-CHECK!N-SERVICES4SMS",
                "active": true,
                "command": "NC-SMS-SERVICE",
                "command_endpoint": "",
                "ha_mode": 0.0,
                "host_name": "DUMMYHOST",
                "interval": 0.0,
                "last_notification": 1614840674.894931,
                "last_problem_notification": 1614838508.162016,
                "name": "N-SERVICES4SMS",
                "next_notification": 1615384635.638462,
                "no_more_notifications": false,
                "notification_number": 0.0,
                "notified_problem_users": [],
                "original_attributes": null,
                "package": "director",
                "paused": false,
                "period": "T-SERVICETIMES",
                "service_name": "BPM-CHECK",
                "source_location": {
                    "first_column": 1.0,
                    "first_line": 9.0,
                    "last_column": 46.0,
                    "last_line": 9.0,
                    "path": "/var/lib/icinga2/api/packages/director/db9e3cf4-d828-46b3-bc33-caabd55334f9/zones.d/ZONE/notifications/N-Services-SMS.conf"
                },
                "states": [
                    "OK",
                    "Critical"
                ],
                "templates": [
                    "N-SERVICES4SMS",
                    "NT-SMS-SERVICE"
                ],
                "times": null,
                "type": "Notification",
                "types": [
                    "Problem",
                    "Recovery"
                ],
                "user_groups": [
                    "USER_GROUP1"
                ],
                "users": null,
                "vars": null,
                "version": 0.0,
                "zone": "PROD-ZONE"
            },
            "joins": {},
            "meta": {},
            "name": "DUMMYHOST!BPM-CHECK!N-SERVICES4SMS",
            "type": "Notification"
        }
    ]
}

@julianbrost
Copy link
Contributor

I gave this another attempt, still no success with reproducing this.

But there's something that surprises me: all the notification objects you posted have interval = 0, so this should actually disable reminder notifications altogether.

Can you reproduce the issue with Icinga logging at level notice? The notification code should log some additional information at this level that should be helpful to figure out why it sends this notification.

@tuxracer1337
Copy link
Author

@julianbrost

I tried to find something. I enabled the debug log too, but i can't find a helpfull informations.
But i found something light that for many other Checks:

debug.log:[2021-03-18 13:10:24 +0100] notice/NotificationComponent: Reminder notification 'fancyhost!fancycheck!fancynotification': Notification was sent out once and interval=0 disables reminder notifications.

And this one I missed for the check I reported with the first posting.

Maybe it's possible for you to give me a call (Eric Lippmann knows my phone number) to have a look inside.

@julianbrost
Copy link
Contributor

Can you share the debug log if I give you an upload link to our Nextcloud? Looking at these would probably be the first thing I'd to if we had a call anyways. If you can't, can you please instead share all log messages containing the name of an affected notification?

Also, can you answer the following question, then I'd have another look at the source to see if I can figure something out and depending on the result, decide if it makes sense to have a look at this together.

  • The notifications for which you received the log message saying "Notification was sent out once and interval=0 disables reminder notifications": do these use the same configuration as the one(s) that generated the reminder notification in error (i.e. are generated by the same apply rule)?
  • Just to be sure: the service that generated the notification, it has interval = 0 set but did not generate that log message?
  • How often does this happen? Can you reproduce this reliably? Do you have the same notification config for many services and does this happen to some or all of them? If it does not happen always, how often does it and do you notice any patterns, like for example does it always happen for the same few services?

@julianbrost
Copy link
Contributor

And one more question: what's the zone of the service object? From the config snippets I can see that it has command_endpoint set, so is it in the master zone and checks are executed on icingawebsystem which is a direct child of the master zone?

@N-o-X
Copy link
Contributor

N-o-X commented Mar 30, 2021

Still wasn't able to reproduce this, but found a very helpful clue on what is happening:

[2021-03-30 14:40:51 +0200] notice/NotificationComponent: Attempting to send reminder notification 'notification-test-host!mail-testgroups-host'.

I'm pretty sure I've found the cause with #8708.

Before sending a reminder notification (and the mentioned log message), we check if problem notifications are suppressed:

/* Don't send reminder notifications before initial ones. */
if (checkable->GetSuppressedNotifications() & NotificationProblem)
continue;

But here we only check if problem notifications are suppressed on the checkable and not on the notification object itself.

I found only one possible race condition that would run into this issue.
If NotificationComponent::NotificationTimerHandler() gets triggered a few milliseconds before the time period begins, FireSuppressedNotifications() would get the time and not send the notifications, because it's not in the time period. A few lines after that, BeginExecuteNotification() would be called, which again gets the time and checks the time period. If this check happens a few milliseconds after the first one, the reminder notification will be send.

This should be quite hard to hit, but could happen in a very large environment.

@tuxracer1337 I've build custom packages for SLES 12.5, so you can test if the fix resolves your issue:
https://git.icinga.com/packaging/rpm-icinga2/-/jobs/97920/artifacts/browse/build/RPMS/x86_64/

@N-o-X N-o-X assigned N-o-X and unassigned julianbrost Mar 31, 2021
@N-o-X
Copy link
Contributor

N-o-X commented Apr 29, 2021

@tuxracer1337 any news on this? Did the fix resolve it for you?

@icinga-probot icinga-probot bot added this to the 2.13.0 milestone May 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants