Skip to content
This repository has been archived by the owner. It is now read-only.

[dev.icinga.com #2951] fix deleting too old check result files #1066

Closed
icinga-migration opened this issue Aug 5, 2012 · 1 comment

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

commented Aug 5, 2012

This issue has been migrated from Redmine: https://dev.icinga.com/issues/2951

Created by mfriedrich on 2012-08-05 11:20:01 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2012-08-31 11:13:38 +00:00)
Target Version: 1.8
Last Update: 2012-08-31 11:13:37 +00:00 (in Redmine)

Icinga Version: 1.7.1
OS Version: Debian

this is a rather common issue - the checkresult dir does not get cleaned after the core reaps the files, and leaves files there, slowing down the overall processing.

as the original diff describes, the initial problem are the "write the checkresult to tmp dir, then move it to checkresult queue, and put a .ok file there as well, telling the core checkresult reaper that files are totally fine to be read". on frequent reloads, this will cause a lot of "not yet finished" checks to stay within the queue, but not having the .ok file there.

the core algorithm on checking if a file is ok, requires to loop all files and stat() if the .ok file is there - which is normally a lot of misses because those old checkresult files won't even be processed anymore. and who got a manual cronjob cleaning that, when the core should do?

that patch should be cherry-picked into 1.8.x trees as well, when done testing.

core: Fix deleting too old check result files
Even under pretty normal circumstances, the check result spool dir
can fill up with a tremendous amount of check result files, which kills
Nagios' performance completely.

The problem is reloads, where old checks may be abandoned in case
they take too long to finish. In that case, half the check result file
is stashed in the spool directory (the other half is only written as
the check returns). With a huge amount of checks and semi-frequent
restarts, the checks will start to accumulate and Nagios will spend
more and more time scanning a huge directory of files where very few of
the check result files have ".ok" files accompanying them, leading to
a ton of cache-misses when we try to stat() the ".ok" file.

This patch fixes it by using the mtime from the stat call earlier in
the chain so even check results without an ".ok" file can be deleted.

Signed-off-by: Andreas Ericsson 

Changesets

2012-08-05 11:24:55 +00:00 by mfriedrich 13b11a984d715516414dde3bb706b8e4a6535972

core: fix deleting too old check result files #2951

this is a rather common issue - the checkresult dir does not get cleaned
after the core reaps the files, and leaves files there, slowing down the
overall processing.

as the original diff describes, the initial problem are the "write the
checkresult to tmp dir, then move it to checkresult queue, and put a .ok
file there as well, telling the core checkresult reaper that files are
totally fine to be read". on frequent reloads, this will cause a lot of
"not yet finished" checks to stay within the queue, but not having the
.ok file there.

the core algorithm on checking if a file is ok, requires to loop all
files and stat() if the .ok file is there - which is normally a lot of
misses because those old checkresult files won't even be processed
anymore. and who got a manual cronjob cleaning that, when the core
should do?

refs #2951

2012-08-07 13:30:33 +00:00 by mfriedrich f63541d

core: fix deleting too old check result files #2951

this is a rather common issue - the checkresult dir does not get cleaned
after the core reaps the files, and leaves files there, slowing down the
overall processing.

as the original diff describes, the initial problem are the "write the
checkresult to tmp dir, then move it to checkresult queue, and put a .ok
file there as well, telling the core checkresult reaper that files are
totally fine to be read". on frequent reloads, this will cause a lot of
"not yet finished" checks to stay within the queue, but not having the
.ok file there.

the core algorithm on checking if a file is ok, requires to loop all
files and stat() if the .ok file is there - which is normally a lot of
misses because those old checkresult files won't even be processed
anymore. and who got a manual cronjob cleaning that, when the core
should do?

refs #2951

2012-08-19 17:42:11 +00:00 by mfriedrich e06dadc

core: fix deleting too old check result files #2951

this is a rather common issue - the checkresult dir does not get cleaned
after the core reaps the files, and leaves files there, slowing down the
overall processing.

as the original diff describes, the initial problem are the "write the
checkresult to tmp dir, then move it to checkresult queue, and put a .ok
file there as well, telling the core checkresult reaper that files are
totally fine to be read". on frequent reloads, this will cause a lot of
"not yet finished" checks to stay within the queue, but not having the
.ok file there.

the core algorithm on checking if a file is ok, requires to loop all
files and stat() if the .ok file is there - which is normally a lot of
misses because those old checkresult files won't even be processed
anymore. and who got a manual cronjob cleaning that, when the core
should do?

refs #2951

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Aug 31, 2012

Updated by mfriedrich on 2012-08-31 11:13:38 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100
  • Icinga Version set to 1
  • OS Version set to Debian
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.