WIP: Merge adjacent downtime segments #6579

efuss · 2018-08-29T16:58:11Z

Put an already running downtime in effect immediately:
If Icinga2 was restarted with a newly configured downtime that should
be in effect at the time of restart, the should-be-running segment of
it was not put into effect.

Merge adjacent downtime segments:
As legacy time periods can't span midnight, a configured downtime
spanning midnight is technically two (immediately adjecent) segments.
As segments were queued individually, at midnight, the downtime
technically ended (sending a DowntimeEnd) only to start immediately
again (sending a DowntimeStart notification).
With this fix, an immediately following segment is merged into the
current one in case the current one is ending soon (where "soon" is
defined as "12 hours or less"). The time limit is arbitrary, but
necessary to prevent endless merging in case of a 7*24 downtime.

Note that the diff of scheduleddowntime.cpp looks weird because it
inserts a new FindRunningSegment() function in front of FindNextSegment()
and the initial lines of bot functions look sufficiently similar to make
diff believe FindNextSegment() got changed.

Put running downtimes in effect and merge segments Put an already running downtime in effect immediately: If Icinga2 was restarted with a newly configured downtime that should be in effect at the time of restart, the should-be-running segment of it was not put into effect. Merge adjacent downtime segments: As legacy time periods can't span midnight, a configured downtime spanning midnight is technically two (immediately adjecent) segments. As segments were queued individually, at midnight, the downtime technically ended (sending a DowntimeEnd) only to start immediately again (sending a DowntimeStart notification). With this fix, an immediately following segment is merged into the current one in case the current one is ending soon (where "soon" is defined as "12 hours or less"). The time limit is arbitrary, but necessary to prevent endless merging in case of a 7*24 downtime.

mcktr · 2018-09-17T19:56:52Z

Test

Create a new ScheduledDowntime which is already running. Ensure that the downtime is in effect immediately.

object Host "test-host-17" {
	address = "127.0.0.1"

	check_command = "icmp"

	check_interval = 1s
	retry_interval = 1s
}

object ScheduledDowntime "test-time-102" {
	host_name = "test-host-17"

	author = "icingaadmin"
	comment = "Some comment"

	ranges = {
		"2018-09-17" = "21:12-23:00"
	}
}

To verify that the downtime is immediately in effect I queried the API.

API: https://127.0.0.1:5665/v1/objects/downtimes

The Downtime is created.

{
  "results": [
    {
      "attrs": {
        "__name": "test-host-17!de62b38e-4701-4912-b7e8-4b2ee3ca3684",
        "active": true,
        "author": "icingaadmin",
        "comment": "Some comment",
        "config_owner": "test-host-17!test-time-102",
        "duration": 0.0,
        "end_time": 1537218000.0,
        "entry_time": 1537211501.8049991131,
        "fixed": true,
        "ha_mode": 0.0,
        "host_name": "test-host-17",
        "legacy_id": 1.0,
        "name": "de62b38e-4701-4912-b7e8-4b2ee3ca3684",
        "original_attributes": null,
        "package": "_api",
        "paused": false,
        "scheduled_by": "test-host-17!test-time-102",
        "service_name": "",
        "source_location": {
          "first_column": 0.0,
          "first_line": 1.0,
          "last_column": 69.0,
          "last_line": 1.0,
          "path": "/usr/local/icinga2/var/lib/icinga2/api/packages/_api/af31cc1a-a1fe-4cd2-9295-414903dc953f/conf.d/downtimes/test-host-17!de62b38e-4701-4912-b7e8-4b2ee3ca3684.conf"
        },
        "start_time": 1537211520.0,
        "templates": [
          "de62b38e-4701-4912-b7e8-4b2ee3ca3684"
        ],
        "trigger_time": 1537211521.8073070049,
        "triggered_by": "",
        "triggers": [
          
        ],
        "type": "Downtime",
        "version": 1537211501.805038929,
        "was_cancelled": false,
        "zone": ""
      },
      "joins": {
        
      },
      "meta": {
        
      },
      "name": "test-host-17!de62b38e-4701-4912-b7e8-4b2ee3ca3684",
      "type": "Downtime"
    }
  ]
}

API: https://127.0.0.1:5665/v1/objects/hosts

Notice the downtime_depth attribute. The host has 1 Downtime which is in effect.

{
  "results": [
    {
      "attrs": {
        "__name": "test-host-17",
        "acknowledgement": 0.0,
        "acknowledgement_expiry": 0.0,
        "action_url": "",
        "active": true,
        "address": "127.0.0.1",
        "address6": "",
        "check_attempt": 1.0,
        "check_command": "icmp",
        "check_interval": 1.0,
        "check_period": "",
        "check_timeout": null,
        "command_endpoint": "",
        "display_name": "test-host-17",
        "downtime_depth": 1.0,
        "enable_active_checks": true,
        "enable_event_handler": true,
        "enable_flapping": false,
        "enable_notifications": true,
        "enable_passive_checks": true,
        "enable_perfdata": true,
        "event_command": "",
        "flapping": false,
        "flapping_current": 0.0,
        "flapping_last_change": 0.0,
        "flapping_threshold": 0.0,
        "flapping_threshold_high": 30.0,
        "flapping_threshold_low": 25.0,
        "force_next_check": false,
        "force_next_notification": false,
        "groups": [
          
        ],
        "ha_mode": 0.0,
        "icon_image": "",
        "icon_image_alt": "",
        "last_check": 1537211631.3775560856,
        "last_check_result": {
          "active": true,
          "check_source": "metis",
          "command": [
            "/usr/lib/nagios/plugins/check_icmp",
            "-c",
            "200,15%",
            "-w",
            "100,5%",
            "-H",
            "127.0.0.1"
          ],
          "execution_end": 1537211631.3773798943,
          "execution_start": 1537211631.3726069927,
          "exit_status": 0.0,
          "output": "OK - 127.0.0.1: rta 0.025ms, lost 0%",
          "performance_data": [
            "rta=0.025ms;100.000;200.000;0;",
            "pl=0%;5;15;;",
            "rtmax=0.075ms;;;;",
            "rtmin=0.011ms;;;;"
          ],
          "schedule_end": 1537211631.3775560856,
          "schedule_start": 1537211631.3709609509,
          "state": 0.0,
          "ttl": 0.0,
          "type": "CheckResult",
          "vars_after": {
            "attempt": 1.0,
            "reachable": true,
            "state": 0.0,
            "state_type": 1.0
          },
          "vars_before": {
            "attempt": 1.0,
            "reachable": true,
            "state": 0.0,
            "state_type": 1.0
          }
        },
        "last_hard_state": 0.0,
        "last_hard_state_change": 1537211502.6487550735,
        "last_reachable": true,
        "last_state": 0.0,
        "last_state_change": 1537211502.6487550735,
        "last_state_down": 0.0,
        "last_state_type": 1.0,
        "last_state_unreachable": 0.0,
        "last_state_up": 1537211631.3776059151,
        "max_check_attempts": 3.0,
        "name": "test-host-17",
        "next_check": 1537211632.3776650429,
        "notes": "",
        "notes_url": "",
        "original_attributes": null,
        "package": "_etc",
        "paused": false,
        "retry_interval": 1.0,
        "severity": 1.0,
        "source_location": {
          "first_column": 0.0,
          "first_line": 1.0,
          "last_column": 25.0,
          "last_line": 1.0,
          "path": "/usr/local/icinga2/etc/icinga2/devel/test.conf"
        },
        "state": 0.0,
        "state_type": 1.0,
        "templates": [
          "test-host-17"
        ],
        "type": "Host",
        "vars": null,
        "version": 0.0,
        "volatile": false,
        "zone": ""
      },
      "joins": {
        
      },
      "meta": {
        
      },
      "name": "test-host-17",
      "type": "Host"
    }
  ]
}

Problem

After a while there are multiple Downtimes created.

API: https://127.0.0.1:5665/v1/objects/downtimes

There are multiple entries.

{
  "results": [
    {
      "attrs": {
        "__name": "test-host-17!de62b38e-4701-4912-b7e8-4b2ee3ca3684",
        "active": true,
        "author": "icingaadmin",
        "comment": "Some comment",
        "config_owner": "test-host-17!test-time-102",
        "duration": 0.0,
        "end_time": 1537218000.0,
        "entry_time": 1537211501.8049991131,
        "fixed": true,
        "ha_mode": 0.0,
        "host_name": "test-host-17",
        "legacy_id": 1.0,
        "name": "de62b38e-4701-4912-b7e8-4b2ee3ca3684",
        "original_attributes": null,
        "package": "_api",
        "paused": false,
        "scheduled_by": "test-host-17!test-time-102",
        "service_name": "",
        "source_location": {
          "first_column": 0.0,
          "first_line": 1.0,
          "last_column": 69.0,
          "last_line": 1.0,
          "path": "/usr/local/icinga2/var/lib/icinga2/api/packages/_api/af31cc1a-a1fe-4cd2-9295-414903dc953f/conf.d/downtimes/test-host-17!de62b38e-4701-4912-b7e8-4b2ee3ca3684.conf"
        },
        "start_time": 1537211520.0,
        "templates": [
          "de62b38e-4701-4912-b7e8-4b2ee3ca3684"
        ],
        "trigger_time": 1537211521.8073070049,
        "triggered_by": "",
        "triggers": [
          
        ],
        "type": "Downtime",
        "version": 1537211501.805038929,
        "was_cancelled": false,
        "zone": ""
      },
      "joins": {
        
      },
      "meta": {
        
      },
      "name": "test-host-17!de62b38e-4701-4912-b7e8-4b2ee3ca3684",
      "type": "Downtime"
    },
    {
      "attrs": {
        "__name": "test-host-17!fa08e2ad-8d9d-4de6-977f-b55d12f75b93",
        "active": true,
        "author": "icingaadmin",
        "comment": "Some comment",
        "config_owner": "test-host-17!test-time-102",
        "duration": 0.0,
        "end_time": 1537218000.0,
        "entry_time": 1537211561.805934906,
        "fixed": true,
        "ha_mode": 0.0,
        "host_name": "test-host-17",
        "legacy_id": 2.0,
        "name": "fa08e2ad-8d9d-4de6-977f-b55d12f75b93",
        "original_attributes": null,
        "package": "_api",
        "paused": false,
        "scheduled_by": "test-host-17!test-time-102",
        "service_name": "",
        "source_location": {
          "first_column": 0.0,
          "first_line": 1.0,
          "last_column": 69.0,
          "last_line": 1.0,
          "path": "/usr/local/icinga2/var/lib/icinga2/api/packages/_api/af31cc1a-a1fe-4cd2-9295-414903dc953f/conf.d/downtimes/test-host-17!fa08e2ad-8d9d-4de6-977f-b55d12f75b93.conf"
        },
        "start_time": 1537211520.0,
        "templates": [
          "fa08e2ad-8d9d-4de6-977f-b55d12f75b93"
        ],
        "trigger_time": 1537211561.8158910275,
        "triggered_by": "",
        "triggers": [
          
        ],
        "type": "Downtime",
        "version": 1537211561.8060190678,
        "was_cancelled": false,
        "zone": ""
      },
      "joins": {
        
      },
      "meta": {
        
      },
      "name": "test-host-17!fa08e2ad-8d9d-4de6-977f-b55d12f75b93",
      "type": "Downtime"
    },
    {
      "attrs": {
        "__name": "test-host-17!3176080d-a1f0-4772-8ce1-968cc3e37fba",
        "active": true,
        "author": "icingaadmin",
        "comment": "Some comment",
        "config_owner": "test-host-17!test-time-102",
        "duration": 0.0,
        "end_time": 1537218000.0,
        "entry_time": 1537211621.8196120262,
        "fixed": true,
        "ha_mode": 0.0,
        "host_name": "test-host-17",
        "legacy_id": 3.0,
        "name": "3176080d-a1f0-4772-8ce1-968cc3e37fba",
        "original_attributes": null,
        "package": "_api",
        "paused": false,
        "scheduled_by": "test-host-17!test-time-102",
        "service_name": "",
        "source_location": {
          "first_column": 0.0,
          "first_line": 1.0,
          "last_column": 69.0,
          "last_line": 1.0,
          "path": "/usr/local/icinga2/var/lib/icinga2/api/packages/_api/af31cc1a-a1fe-4cd2-9295-414903dc953f/conf.d/downtimes/test-host-17!3176080d-a1f0-4772-8ce1-968cc3e37fba.conf"
        },
        "start_time": 1537211520.0,
        "templates": [
          "3176080d-a1f0-4772-8ce1-968cc3e37fba"
        ],
        "trigger_time": 1537211626.8038098812,
        "triggered_by": "",
        "triggers": [
          
        ],
        "type": "Downtime",
        "version": 1537211621.8197479248,
        "was_cancelled": false,
        "zone": ""
      },
      "joins": {
        
      },
      "meta": {
        
      },
      "name": "test-host-17!3176080d-a1f0-4772-8ce1-968cc3e37fba",
      "type": "Downtime"
    },
[...]
  ]
}

(truncated) (full)

Quering the host via API I can confirm that there are multiple downtimes in effect.

API: https://127.0.0.1:5665/v1/objects/hosts

Notice the downtime_depth attribute, there are 12 Downtimes active.

{
  "results": [
    {
      "attrs": {
        "__name": "test-host-17",
        "acknowledgement": 0.0,
        "acknowledgement_expiry": 0.0,
        "action_url": "",
        "active": true,
        "address": "127.0.0.1",
        "address6": "",
        "check_attempt": 1.0,
        "check_command": "icmp",
        "check_interval": 1.0,
        "check_period": "",
        "check_timeout": null,
        "command_endpoint": "",
        "display_name": "test-host-17",
        "downtime_depth": 12.0,
        "enable_active_checks": true,
        "enable_event_handler": true,
        "enable_flapping": false,
[...]
    }
  ]
}

(truncated) (full)

Looking into the debug log I can see multiple log messages like the following. All log messages are for the same ScheduledDowntime object.

[2018-09-17 21:34:42 +0200] debug/ScheduledDowntime: Creating new Downtime for ScheduledDowntime "test-host-17!test-time-102"

Short summary: The Downtime is re-created every minute, thereby we have multiple downtime objects after a while.

mcktr

See my test.

In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.

Revert commit 406e5f2 as the patched file was from the wrong branch

In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.

Revert commit 2e721ba since it put scheduleddowntime.cpp in the wrong place.

In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.

efuss · 2018-09-18T10:16:59Z

Thanks for testing and reporting the failure.

There was a < comparison in ScheduledDowntime:FindRunningSegment() that should be a <=.
The problem didn't show up with me because there was always an adjacent segment to merge. I'm unsure why it didn't show up before I implemented the merging.

I'm also unsure as to whether GitHub automagically updated my Pull Request.
It took me less than 5 minutes to understand the problem from your report, 5 to 10 minutes to find the flaw in my change, seconds to correct it and about an hour fighting with GitHub to hopefully get the Pull Request right.

dnsmichi · 2018-10-10T13:58:34Z

@mcktr Can you have look again please? I'd like to include this in 2.11 then.

mcktr · 2018-10-10T18:58:23Z

Yep, I will have a look in the next days.

mcktr · 2018-10-15T09:48:22Z

Thanks for updating the PR.

Tests

Put running downtime in effect immediately

object Host "devel-host-001" {
	import "devel-host-template"

	address = "127.0.0.1"
}

object ScheduledDowntime "devel-downtime-001" {
	host_name = "devel-host-001"

	author = "icingaadmin"
	comment = "Some comment"

	ranges = {
		"2018-10-15" = "10:00-12:00"
	}
}

Downtime is in effect immediately after restart. Waiting for 10 minutes, there are no other downtimes created -> Good. This part of the patch works.

Merge adjacent downtime segments

object Host "devel-host-001" {
	import "devel-host-template"

	address = "127.0.0.1"
}

object ScheduledDowntime "devel-downtime-001" {
	host_name = "devel-host-001"

	author = "icingaadmin"
	comment = "Some comment"

	ranges = {
		"2018-10-15" = "10:00-11:30"
	}
}

object ScheduledDowntime "devel-downtime-002" {
	host_name = "devel-host-001"

	author = "icingaadmin"
	comment = "Some comment"

	ranges = {
		"2018-10-15" = "11:30-12:00"
	}
}

Watching the log to verify the merging.

[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Try merge
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: By us, ends soon (Mon Oct 15 11:30:00 2018)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Finding next scheduled downtime segment for time 1539595447 (minBegin Mon Oct 15 11:30:00 2018)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Evaluating segment: 2018-10-15: 10:00-11:30
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Next Segment doesn't fit: Thu Jan  1 01:00:00 1970 != Mon Oct 15 11:30:00 2018
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Not by us (devel-host-001!devel-downtime-002 != devel-host-001!devel-downtime-001)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: No merge
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Creating new Downtime for ScheduledDowntime "devel-host-001!devel-downtime-001"
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Finding running scheduled downtime segment for time 1539595447 (minEnd Mon Oct 15 12:00:00 2018)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Evaluating (running?) segment: 2018-10-15: 10:00-11:30
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Considering (running?) segment: Mon Oct 15 10:00:00 2018 -> Mon Oct 15 11:30:00 2018
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: ending too early.
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Finding next scheduled downtime segment for time 1539595447 (minBegin -)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Evaluating segment: 2018-10-15: 10:00-11:30
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Try merge
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Not by us (devel-host-001!devel-downtime-001 != devel-host-001!devel-downtime-002)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: By us, ends soon (Mon Oct 15 12:00:00 2018)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Finding next scheduled downtime segment for time 1539595447 (minBegin Mon Oct 15 12:00:00 2018)
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Evaluating segment: 2018-10-15: 11:30-12:00
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Considering segment: Mon Oct 15 11:30:00 2018 -> Mon Oct 15 12:00:00 2018
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: beginning to early.
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Next Segment doesn't fit: Thu Jan  1 01:00:00 1970 != Mon Oct 15 12:00:00 2018
[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: No merge

The first merge attempt is against the Downtime itself, I guess we can safely skip this attempt.

The second merge attempt is against the adjacent downtime, which starts right after the initial downtime. But due to a date mismatch (Thu Jan 1 01:00:00 1970 != Mon Oct 15 12:00:00 2018) it doesn't fit and no merge will happen. I assume that some default date value is used since the date is equal to unixtime 0.

Please have a look into the second part of your patch.

efuss · 2018-10-15T11:21:51Z

The second merge attempt is against the adjacent downtime

The patch's intention is not to merge adjacent downtimes. The intent is to merge adjacent segments of the same downtime (especially because the legacy timeperiod format doesn't allow to specify periods spanning midnight). In your example, the behaviour is as intended, and, I guess, what makes sense: one downtime ends and another one starts. You may wish to be notified etc. in that case, I suppose.

[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Evaluating segment: 2018-10-15: 10:00-11:30 [2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Next Segment doesn't fit: Thu Jan 1 01:00:00 1970 != Mon Oct 15 11:30:00 2018

I'll have to dig into this. Maybe it's an error, maybe just the debug output is misleading.

[2018-10-15 11:24:07 +0200] debug/ScheduledDowntime: Not by us (devel-host-001!devel-downtime-002 != devel-host-001!devel-downtime-001)

This is exactly what I would expect to happen. These are two different doentimes.

Please have a look into the second part of your patch.

I'll look into the confusing debug output.

mcktr · 2018-10-15T17:32:26Z

Thanks for the clarification. 👍

So the above example is correctly not merged together since this are two different downtimes.

More Tests

object ScheduledDowntime "devel-downtime-001" {
	host_name = "devel-host-001"

	author = "icingaadmin"
	comment = "Some comment"

	ranges = {
		"2018-10-15" = "17:55-18:00,18:00-18:05"
	}
}

If I understand you correctly the second segment should be merged into the first, since this is the same downtime with two adjacent segments, correct?

It does.

[2018-10-15 17:55:33 +0200] debug/ScheduledDowntime: Try merge
[2018-10-15 17:55:33 +0200] debug/ScheduledDowntime: By us, ends soon (Mon Oct 15 18:00:00 2018)
[2018-10-15 17:55:33 +0200] debug/ScheduledDowntime: Finding next scheduled downtime segment for time 1539618933 (minBegin Mon Oct 15 18:00:00 2018)
[2018-10-15 17:55:33 +0200] debug/ScheduledDowntime: Evaluating segment: 2018-10-15: 17:55-18:00,18:00-18:05
[2018-10-15 17:55:33 +0200] debug/ScheduledDowntime: Considering segment: Mon Oct 15 18:00:00 2018 -> Mon Oct 15 18:05:00 2018
[2018-10-15 17:55:33 +0200] debug/ScheduledDowntime: (best match yet)
[2018-10-15 17:55:33 +0200] debug/ScheduledDowntime: Next Segment fits, extending end time Mon Oct 15 18:00:00 2018 to Mon Oct 15 18:05:00 2018

One thing I noticed in Icinga Web 2 during the tests:

Notice the negative expires in value. After all segments of the downtime are expired and the downtime is finally over, it won't get cleared in Icinga Web 2. When the transition from the first segment to the second segment happens there shows up a second Downtime, but this one will be cleared on expiration. The Downtime with negative expires in value stays there until I restart the Icinga 2 daemon.

A guess what happens: The Downtime is initially created with the original end time of the first segment, after a while it will be merged with the adjacent segment (you can verify this via the API). But the updated end time does not find it's way into the database which Icinga Web 2 uses. So in Icinga Web 2 we'll see the old end time of the first segment for the Downtime which results into an negative expires in value.

I'm looking forward to your feedback.

efuss · 2018-10-16T12:49:57Z

ranges = { "2018-10-15" = "17:55-18:00,18:00-18:05" }

While this is not the intended use case (which is more like "friday" = "17:00-24:00" "saturday" = "00:00-24:00" "sunday" = "00:00-24:00" "monday" = "00:00-09:00" or simply 24*7), yes, these are two adjecent segments of the same downtime.

It does.

Fine.

A guess what happens:

Well, I don't have to guess and you are right. The reason for the delayed merge (which may look awkward) is that, in case of a 24*7 downtime, I would be merging endlessly. While there may be a way to detect that, it appeared easier to merge on demand.

you can verify this via the API

No need to since that's exactly how I designed it.

But the updated end time does not find it's way into the database which Icinga Web 2 uses.

Hm. Could someone educate me on the subject? Are you simply not meant to extend a downtime? Is Icinga Web 2 at fault not to re-checking it? Is it simply me needing to call notify_api_of_downtime_change() or something? What happens if you change a downtime manually entered via the Web Interface (should that be possible, I don't do that)?

mcktr · 2018-10-17T15:25:46Z

Are you simply not meant to extend a downtime? Is Icinga Web 2 at fault not to re-checking it? Is it simply me needing to call notify_api_of_downtime_change() or something?

There is currently no way to extend an existing downtime, except for remove and re-create it with the new ending time. In terms of configuration objects, such as hosts, services, users, downtimes, etc., Icinga Web 2 relies on the IDO database. The downtime object which you can fetch via the API is current, so updating the API wouldn't help here.

I took pen and paper and draw some pictures to illustrate the problem a bit more.

We have the following ScheduledDowntime object.

object ScheduledDowntime "devel-downtime-001" {
	host_name = "devel-host-001"

	author = "icingaadmin"
	comment = "Some comment"

	ranges = {
		"2018-10-17" = "14:30-14:35,14:35-14:40"
	}
}

The expected behavior is a downtime start notification at 14:30 and a downtime end notification at 14:40.

The current behavior looks like the following:

The downtime for the first segment (14:30-14:35) is created. The downtime object is written to the IDO database and exposed via API. A downtime start notification will be sent at 14:30.

The initially created downtime is merged with the adjacent segment (14:35-14:40). This is done by setting the downtime ending time to the ending time of the adjacent segment (14:40).

if (segment.first == current_end) {
	Log(LogDebug, "ScheduledDowntime") << "Next Segment fits, extending end time " << Utility::FormatDateTime("%c", current_end) << " to " << Utility::FormatDateTime("%c", segment.second);
	downtime->SetEndTime(segment.second, false);
	return;

This works for the API but the IDO database never sees the updated ending time.

Another downtime is created for the second segment (14:35-14:40). The downtime object is written to the IDO database and exposed via API. A downtime start notification will be sent at 14:35.

At the end we have two downtimes and four notifications will be sent out.

Summary:

Extending the downtime ending time by setting it to the end time of the second segment is not sufficient.

the downtime object in the IDO database must be updated when merging the second segment

Looking in downtime.cpp we have Downtime::AddDowntime and Downtime::RemoveDowntime implemented, but unfornatlly not Downtime::UpdateDowntime (or simmilar).

the downtime object for the second segment must not be created

If a adjacent segment is merged, there shouldn't be an another downtime for the second segment.

Since your first part (put an already running downtime in effect immediately) of the patch works flawlessly could you split up the first part in another PR? This way we can focus here to get the second part working. :)

efuss · 2018-10-17T15:36:48Z

There is currently no way to extend an existing downtime

I see.

Another downtime is created for the second segment (14:35-14:40).

This would be an error I'll have to correct.

the downtime object in the IDO database must be updated when merging the second segment

I have no ideea how to do that.

the downtime object for the second segment must not be created

Yes.

could you split up the first part in another PR?

I'll try to do that. But I'm currently (Icinga2-wise) also working on trying to clean up ProcessCheckResult() to be able to introduce a Soft-OK state I'd like to have.

Surpress a mislading debug message stating the next segment won't fit because its start time (The Epoch) didn't match. Instead, log that no next segment exists.

efuss · 2018-10-18T17:05:18Z

1. I fixed the confusing debug message (The Epoch not matching the current end time) in efuss@1e58214. 2. You mention that after merging, a second downtime for the second segment is created. I can't figure out how that could happen. Other than your hand-written notes, could you provide me with debug output (or whatever) showing that effect? 3. I was thinking of, instead of merging one segment ahead, merging as much as possible when creating the downtime in the first place, avoiding the trouble that extending its end time may cause. However, I'd need to recognize a 24*7 downtime in order to avoid endless merging and couldn't come up with an idea for that (any recognition method that came to my mind could be subverted by, i.e. writing the "Monday" part as "00:00-12:00,12:00-24:00"). 4. I (hopefully) split off the put-in-effect-immediately part in Pull Request #6704. I haven't even compile tested that yet.

efuss · 2018-10-19T12:00:17Z

EF> [...] I'd need to recognize a 24*7 downtime in order to avoid endless merging and couldn't come up with an idea for that [...] I just had an idea for that. I'll probably come up with an implementation for this entirely different approach to downtime segment merging next week.

efuss · 2018-10-23T17:28:39Z

EF> I just had an idea for that. I'll probably come up with an implementation for this entirely different approach to downtime segment merging next week. Unfortunately, it's complicated. I need more input/advice on this. The objective is to work around a defiency of legacy time periods, in which it's impossible to directly specify an interval spanning midnight. In particular, it's impossible to explicitly specify an ``always'' interval. This implies the necessity of splitting scheduled downtimes at midnight, leading to spurious end/begin notifications, rendering downtime notifications unusuable. My first approach was to merge future segments into a running downtime on demand. This changes an already active downtime's end, leading to problems in at least Icinga Web 2. Could some senior developer please comment on this? Is it just that I should additionally be doing this-and-that when extending the active downtime's end (or Icinga Web 2 should be doing such) or is it simply contrary to Icinga2's design to tamper with an active downtime? My second approach is to merge as many segments as possible prior to actually installing the downtime. This clearly avoids the problem mentioned above. The tricky part is how to detect endless downtimes, which is close to impossible due to the convoluted syntax of legacy time periods. I do have an algorithm that can be proven to reliably detect any actually infinite specification; however, it can be fooled into regarding an actually finite specification as infinite: Take the finite specification "Monday" = "00:00 - 24:00" [...] "Friday" = "00:00 - 24:00" and add <date of next saturday> = "00:00 - 00:24" <date of next sunday> = "00:00 - 00:24" and it will flag an infinite loop where there is none. One can work around this, but you'll only neeed a more convoluted example to fool an improved version. One possible solution is to regard this specification as silly (it should probably be expresed in two different downtimes), document the exact behaviour and treat specifications like this as ``always''. A third approach, which would be nearly trivial to implement, is to keep merging until having accumulated an enormous period, say, a year or ten years. Then, either simply stop merging (giving spurious notifications every <enormous period>) or stop merging and treat the period as infinite. Which approach shoud I persue?

dnsmichi · 2019-11-14T12:17:13Z

@efuss Can you please move this into a new issue for better discussion with involved developers and users? The PR unfortunately went stale in this regard. Thanks.

efuss mentioned this pull request Aug 29, 2018

[dev.icinga.com #13215] Newly configured already running ScheduledDowntime not put into effect #4790

Closed

dnsmichi requested a review from mcktr September 4, 2018 14:45

mcktr requested changes Sep 17, 2018

View reviewed changes

efuss added 5 commits September 18, 2018 10:22

Fix minEnd handling in FindRunningSegment()

406e5f2

In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.

Revert 406e5f2

201f547

Revert commit 406e5f2 as the patched file was from the wrong branch

Fix minEnd handling in FindRunningSegment()

2e721ba

In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.

revert 2e721ba

28b6545

Revert commit 2e721ba since it put scheduleddowntime.cpp in the wrong place.

Fix minEnd handling in FindRunningSegment()

f461499

In ScheduledDowntime::FindRunningSegment(), only regard downtimes that last longer than minEnd, not at least as long. Otherwise, a running downtime with a fixed start date will be queued over and over again.

dnsmichi added bug Something isn't working area/notifications Notification events labels Oct 10, 2018

mcktr added the needs feedback We'll only proceed once we hear from you again label Oct 15, 2018

Surpress misleading debug message

1e58214

Surpress a mislading debug message stating the next segment won't fit because its start time (The Epoch) didn't match. Instead, log that no next segment exists.

efuss mentioned this pull request Oct 18, 2018

Put newly configured already running ScheduledDowntime immediately in effect #6704

Merged

mcktr changed the title ~~Put running downtimes in effect and merge segments~~ WIP: Merge adjacent downtime segments Oct 19, 2018

Al2Klimov assigned mcktr Mar 8, 2019

dnsmichi requested a review from mcktr June 7, 2019 08:48

dnsmichi closed this Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Merge adjacent downtime segments #6579

WIP: Merge adjacent downtime segments #6579

efuss commented Aug 29, 2018

mcktr commented Sep 17, 2018 •

edited

mcktr left a comment

efuss commented Sep 18, 2018

dnsmichi commented Oct 10, 2018

mcktr commented Oct 10, 2018

mcktr commented Oct 15, 2018

efuss commented Oct 15, 2018 via email

mcktr commented Oct 15, 2018

efuss commented Oct 16, 2018 via email

mcktr commented Oct 17, 2018

efuss commented Oct 17, 2018 via email

efuss commented Oct 18, 2018 via email

efuss commented Oct 19, 2018 via email

efuss commented Oct 23, 2018 via email

dnsmichi commented Nov 14, 2019

WIP: Merge adjacent downtime segments #6579

WIP: Merge adjacent downtime segments #6579

Conversation

efuss commented Aug 29, 2018

mcktr commented Sep 17, 2018 • edited

Test

Problem

mcktr left a comment

Choose a reason for hiding this comment

efuss commented Sep 18, 2018

dnsmichi commented Oct 10, 2018

mcktr commented Oct 10, 2018

mcktr commented Oct 15, 2018

Tests

Put running downtime in effect immediately

Merge adjacent downtime segments

efuss commented Oct 15, 2018 via email

mcktr commented Oct 15, 2018

More Tests

efuss commented Oct 16, 2018 via email

mcktr commented Oct 17, 2018

efuss commented Oct 17, 2018 via email

efuss commented Oct 18, 2018 via email

efuss commented Oct 19, 2018 via email

efuss commented Oct 23, 2018 via email

dnsmichi commented Nov 14, 2019

mcktr commented Sep 17, 2018 •

edited