Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11382] Downtimes are not always activated/expired on restart #4031

Closed
icinga-migration opened this issue Mar 14, 2016 · 19 comments

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

commented Mar 14, 2016

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11382

Created by delgaty on 2016-03-14 19:22:33 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2016-04-07 08:16:14 +00:00)
Target Version: 2.4.5
Last Update: 2016-05-02 13:27:26 +00:00 (in Redmine)

Icinga Version: 2.4.3-1.fc23
Backport?: Already backported
Include in Changelog: 1

Hello,

I am running icinga-2.4.3-1 with icinga-web 1.13.1 on Fedora 23 in a distributed environment. I am able to execute most external commands such as acknowledgements, schedule checks and disable notifications with out incident. But I am having trouble with downtimes. I am able to schedule a downtime and see it in the downtime pane fine. The problem is that the downtime isn't always activating. If it is fixed for 2 hours and the service goes down within that timeframe alerts are not getting supressed and the downtime icon is not appearing for the service. I am also seeing downtimes that should have expired still in the downtime pane. Deleting them is not working either. I have run debug and do not see any errors.

When deleting the downtime that did not activate this is the entry in the log:

[2016-03-13 13:35:16 -0600] information/ExternalCommandListener: Executing external command: [1457897716] DEL_SVC_DOWNTIME;1

[2016-03-13 13:35:16 -0600] debug/IdoMysqlConnection: Query: INSERT INTO icinga_externalcommands (command_args, command_name, command_type, endpoint_object_id, entry_time, instance_id) VALUES ('1', 'DEL_SVC_DOWNTIME', '79', 7626, FROM_UNIXTIME(1457897716), 1)

But the entry in the database never actually gets deleted.

Nor does the corresponding file in /var/lib/icinga2/api/packages/_api/xxx.xxx.xxx-1457480846-0/conf.d/downtimes.

/var/lib/icinga2/api/packages/_api/xxx.xxx.xxx-1457480846-0/conf.d/downtimes
-rw-r-r- 1 icinga icinga 344 Mar 13 13:22 cloudweb07xxx.xxx.xxx-1457896976-0.conf
object Downtime "xxx.xxx.xxx-1457896976-0" ignore_on_error {
author = "jen"
comment = "test 1"
config_owner = ""
duration = 7200.000000
end_time = 1457904211.000000
fixed = true
host_name = "cloudweb07"
scheduled_by = ""
service_name = "CPU"
start_time = 1457897011.000000
triggered_by = ""
version = 1457896976.122614
}

Sometimes however, the downtimes do work. This is info if from one that did activate:

-rw-r-r- 1 icinga icinga 350 Mar 14 12:36 LinuxSMTPxxx.xxx.xxx-1457980581-0.conf

object Downtime "xxx.xxx.xxxl-1457980581-0" ignore_on_error {
author = "jen"
comment = "test"
config_owner = ""
duration = 7200.000000
end_time = 1457981430.000000
fixed = true
host_name = "LinuxSMTP"
scheduled_by = ""
service_name = "procs"
start_time = 1457980590.000000
triggered_by = ""
version = 1457980581.536455
}

I cannot find anything in common with the downtimes that are not working correctly. Thanks for any insight.

Changesets

2016-03-23 13:05:09 +00:00 by mfriedrich 0447e81

Ensure that the legacy_id attribute for comment/downtime objects is restored on restart

This has been changed when moving comments/downtimes to config objects
in 050c520b2a3df0477f2eec71ec2d874b6dd36aa5

It may cause trouble with DB IDO DELETE and UPDATE statements
depending on the old value.

refs #11382

2016-04-20 08:07:22 +00:00 by mfriedrich 521580f

Ensure that the legacy_id attribute for comment/downtime objects is restored on restart

This has been changed when moving comments/downtimes to config objects
in 050c520b2a3df0477f2eec71ec2d874b6dd36aa5

It may cause trouble with DB IDO DELETE and UPDATE statements
depending on the old value.

refs #11382

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by mfriedrich on 2016-03-15 07:55:33 +00:00

  • Parent Id set to 11312
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by mfriedrich on 2016-03-15 07:55:46 +00:00

  • Category changed from Packages to libicinga
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by mfriedrich on 2016-03-15 10:41:55 +00:00

  • Priority changed from Normal to High
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by mfriedrich on 2016-03-15 10:42:45 +00:00

NMS & #458875

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by mfriedrich on 2016-03-15 11:06:51 +00:00

  • Status changed from New to Assigned
  • Assigned to set to mfriedrich
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by mfriedrich on 2016-03-15 15:13:36 +00:00

Some questions:

  • How many hosts and services are being checked?
  • How does the cluster look like (2 master nodes with HA, best to show zones.conf)
  • How many downtimes are currently scheduled?
  • Are you using the ScheduledDowntime configuration objects from Icinga2 itself, or are you manually creating these downtimes through Icinga Web 2 / the command pipe / the API?
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by mfriedrich on 2016-03-15 15:14:26 +00:00

  • Relates set to 11012
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by ClemensBW on 2016-03-15 15:38:28 +00:00

Same problem here:

dnsmichi wrote:

Some questions:
* How many hosts and services are being checked?
11860 Hosts / 49741 Services

* How does the cluster look like (2 master nodes with HA, best to show zones.conf)
[root@$master ~]# cat /etc/icinga2/zones.conf

object Endpoint "$master" {
}

object Zone "master" {
//this is the local node master named = "master"
endpoints = [ "$master" ]
}

object Endpoint "987" {
host = "10.x.x.987"
}

object Zone "987" {
endpoints = [ "987" ]
parent = "master"
}

////////////////////////

object Endpoint "123" {
host = "10.x.x.123"
}

object Zone "456" {
endpoints = [ "567" ]
parent = "master"
}

* How many downtimes are currently scheduled?
MariaDB [icinga]> select count(scheduleddowntime_id) from icinga_scheduleddowntime;

-----------------------------
| count(scheduleddowntime_id) |

-----------------------------
| 383 |

-----------------------------
1 row in set (0.00 sec)

* Are you using the ScheduledDowntime configuration objects from Icinga2 itself, or are you manually creating these downtimes through Icinga Web 2 / the command pipe / the API?
With Icingaweb2 over command pipe

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by delgaty on 2016-03-15 18:31:39 +00:00

dnsmichi wrote:

Some questions:

* How many hosts and services are being checked?

463/1453

* How does the cluster look like (2 master nodes with HA, best to show zones.conf)

object Endpoint "i2master" {
}

object Zone "master" {
//this is the local node master named = "master"
endpoints = [ "i2master" ]
}

object Zone "i2slave" {
endpoints = [ "i2slave" ]
}

object Zone "global-templates" {
global = true
}

* How many downtimes are currently scheduled?
I currently have less than 15 downtimes scheduled.

* Are you using the ScheduledDowntime configuration objects from Icinga2 itself, or are you manually creating these downtimes through Icinga Web 2 / the command pipe / the API?

I have removed my recurring downtime file until I resolve this issue. I have been scheduling them thru icinga-web and icingaweb2 with the same result.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2016

Updated by mfriedrich on 2016-03-18 10:21:20 +00:00

  • Relates set to 11353
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 21, 2016

Updated by jcarterch on 2016-03-21 21:13:33 +00:00

I am encountering what seems to be the same problem.

A way I was able to consistently to reproduce this is that all active downtimes (scheduled via API) are lost/disabled when a secondary master restarts and the primary configuration master is down (zone of two masters total).

Downtime events remain present in database (icinga_scheduleddowntime), but are not removed by API requests, though DEL_HOST_DOWNTIME and DEL_SVC_DOWNTIME commands are logged in debug to be written to icinga_externalcommands. No errors are logged in icinga for a failed external command. Downtimes are visible in icingaweb2, but are not "in effect", and do not return to being in effect after forced service checks are executed. Bringing the primary up again does not return the downtimes to a active state.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2016

Updated by mfriedrich on 2016-03-22 21:32:24 +00:00

Notes:

  • This does not necessarily happen inside a cluster
  • It is a timing issue with lots of checkables and downtimes, cannot be easily triggered
  • My guess is that the reload takes "longer". Stopping the Core will also stop all objects. Downtime::Stop() will unregister it from its checkable but not call OnDowntimeRemoved (that only happens for when RemoveDowntime() calls DeleteObject() from api actions or external command processor).
  • We changed the way downtimes/comments are handled in 2.4 and also changed the way of adding/removing these objects. That includes clearing the tables on restart. I suspect that on start the adding of downtimes does "fail" leaving the old partially there.
void DbEvents::AddDowntime(const Downtime::Ptr& downtime)
{
        std::vector queries;
        RemoveDowntimeInternal(queries, downtime);
        AddDowntimeInternal(queries, downtime, false);
        DbObject::OnMultipleQueries(queries);
}

RemoveDowntimeInternal() does not only delete the entry but updates the downtime history and status tables (downtime_depth).

  • check whether this might influence the queries.
  • the delete calls by downtime are pretty much hurting, each time a delete query on an index - we should find a better way to maybe delete them on a global basis and re-insert them on startup again.
  • 5b48559 disables the global downtime/comment table clearing on restart

It certainly would help to get a debug log containing all "IdoMysql" entries before and after the restart to dig deeper.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2016

Updated by mfriedrich on 2016-03-23 09:15:41 +00:00

Question to all the reporters here:

Once you've added a downtime or a comment you should see the (legacy)_id (note that one from inside your database dump). The table entry also includes the object_id (note that one as well).

Better: Enable the Core REST API which is independent from the faulty DB IDO updates. Then you'll know the current downtime values.

Pick a legacy_id which is pretty huge, not the first or tenth one.
https://localhost:5665/v1/objects/downtimes?filter=downtime.legacy\_id>50

Note the __name field and write it down. Also the legacy_id.

Then fire a restart a restart.

Call the REST API again and look for the __name. Is the legacy_id the same or did it change?

Post your findings here including all outputs here, please.

(OTOH: I guess the DELETE query using the object_id and internal_downtime_id in the where clause is incorrect because the downtime_id (legacy_id) changed over restarts. In 2.4 we changed the way downtimes are inserted (delete up front) but never looked into the delete statement. Icinga 1.x omits the legacy_id as it could change over time).

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2016

Updated by mfriedrich on 2016-03-23 13:08:20 +00:00

  • Subject changed from Some downtimes not activating or expiring to Downtimes are not always activated/expired on restart
  • Target Version set to 2.5.0
  • Add downtime with comment "abc"
    -> legacy_id=75
  • Add downtime for a different checkable with comment "def"
    -> legacy_id=76
  • Delete downtime with comment "abc"
    -> legacy_id=75 is gone
  • Downtime with id 76 exists
"attrs": {
"__name": "mbmif.int.netways.de!disk /!mbmif.int.netways.de-1458726179-1",
"active": true,
"author": "icingaadmin",
"comment": "def",
"config_owner": "",
"duration": 0,
"end_time": 1461404573,
"entry_time": 1458726179.129743,
"fixed": true,
"ha_mode": 0,
"host_name": "mbmif.int.netways.de",
"legacy_id": 76,
"name": "mbmif.int.netways.de-1458726179-1",
"original_attributes": null,
"package": "_api",
"paused": false,
"scheduled_by": "",
"service_name": "disk /",
"start_time": 1458726173,
"templates": [
"mbmif.int.netways.de-1458726179-1"
],
"trigger_time": 1458726179.132432,
"triggered_by": "",
"triggers": [ ],
"type": "Downtime",
"version": 1458726179.128099,
"was_cancelled": false,
"zone": ""
},
"joins": { },
"meta": { },
"name": "mbmif.int.netways.de!disk /!mbmif.int.netways.de-1458726179-1",
"type": "Downtime"
}
  • Restart the icinga2 process
  • Fetch the downtime with comment "def"
"attrs": {
"__name": "mbmif.int.netways.de!disk /!mbmif.int.netways.de-1458726179-1",
"active": true,
"author": "icingaadmin",
"comment": "def",
"config_owner": "",
"duration": 0,
"end_time": 1461404573,
"entry_time": 1458726179.129743,
"fixed": true,
"ha_mode": 0,
"host_name": "mbmif.int.netways.de",
"legacy_id": 52,
"name": "mbmif.int.netways.de-1458726179-1",
"original_attributes": null,
"package": "_api",
"paused": false,
"scheduled_by": "",
"service_name": "disk /",
"start_time": 1458726173,
"templates": [
"mbmif.int.netways.de-1458726179-1"
],
"trigger_time": 1458726179.132432,
"triggered_by": "",
"triggers": [ ],
"type": "Downtime",
"version": 1458726179.128099,
"was_cancelled": false,
"zone": ""
},
"joins": { },
"meta": { },
"name": "mbmif.int.netways.de!disk /!mbmif.int.netways.de-1458726179-1",
"type": "Downtime"
},
  • downtime has a different legacy_id -> 52

This could be the culprit when the DB IDO Delete-then-insert happens with the new legacy_id leaving the old behind. I haven't found a clean way to reproduce it yet (I always have the same number of downtimes).

But I've found a possible fix by re-adding the legacy_id field to the state file.

Please test the current snapshot packages from git master.

$ icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: v2.4.4-273-g98e1d70; debug)
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2016

Updated by delgaty on 2016-04-01 17:08:23 +00:00

I have been running the snapshot in my test environment for over a week and have had no issues with downtime. Both manual and recurring downtimes are activating and expiring as expected. The issue seems to be resolved in this snapshot. Thanks.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2016

Updated by mfriedrich on 2016-04-04 14:05:28 +00:00

  • Target Version changed from 2.5.0 to 2.4.5

Ok thanks for your kind feedback on your tests. I'd be happy to see the other reporters for this issue also firing up their test stages.

Kind regards,
Michael

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 7, 2016

Updated by mfriedrich on 2016-04-07 08:16:15 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 20, 2016

Updated by gbeutner on 2016-04-20 08:15:54 +00:00

  • Backport? changed from Not yet backported to Already backported
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 2, 2016

Updated by mfriedrich on 2016-05-02 13:27:26 +00:00

  • Parent Id deleted 11312
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.