Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[process] prettify & cache AccessDenied failures #1595

Merged
merged 2 commits into from
May 19, 2015
Merged

Conversation

LeoCavaille
Copy link
Member

Refactor a bit this check, two noticeable changes:

  1. on Windows especially, we get randomly some AccessDenied exceptions
    when trying to look at some processes. Because we iterate on all the
    processes at every run, it would trigger as many failures every run.
    Once we get this error and ignore_denied_access is set to true, we'll
    not look at it again until DEFAULT_AD_CACHE_DURATION or
    access_denied_cache_duration in the init config.
  2. instead of doing some sketchy try/catch over the code, all the
    psutil-related code that actually gets metric values is wrapped into
    psutil_wrapper which knows how to handle exceptions and known
    failures.

Last remark: before we were setting some metrics to 0 even if we had no
values parsed, this is not the case anymore.

@remh
Copy link
Contributor

remh commented May 11, 2015

Nice cleanup!

We should probably cache the pid list too, i'm not sure it's useful to refresh it at every iteration.
What do you think ?

According to the dev mode ( 👍 @talwai ) this is by far the most intensive operation of the check.

We could maybe cache it for a few iterations ?

@LeoCavaille
Copy link
Member Author

@remh it's a bit hard to make this design choice, where people could potentially put no-data monitors on top of those metrics. Or maybe hide it behind a feature flag?

@remh
Copy link
Contributor

remh commented May 12, 2015

+1 for the feature flag / a TTL parameter for the pid list.

@remh remh self-assigned this May 12, 2015
@LeoCavaille
Copy link
Member Author

@remh implemented, let me know what you think?

self.last_pid_cache_ts = {}
self.pid_cache = {}
self.pid_cache_duration = int(
init_config.get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be an instance parameter instead ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in most cases, this shouldn't be changed so I don't think it's worth the trouble for an instance parameter.

Refactor a bit this check, two noticeable changes:
1. on Windows especially, we get randomly some `AccessDenied` exceptions
when trying to look at some processes. Because we iterate on all the
processes at every run, it would trigger as many failures *every run*.
Once we get this error and `ignore_denied_access` is set to true, we'll
not look at it again until DEFAULT_AD_CACHE_DURATION or
`access_denied_cache_duration` in the init config.

2. instead of doing some sketchy try/catch over the code, all the
psutil-related code that actually gets metric values is wrapped into
`psutil_wrapper` which knows how to handle exceptions and known
failures.

Last remark: before we were setting some metrics to 0 even if we had no
values parsed, this is not the case anymore.
The new init_config `pid_cache_duration` is the period during which the
agent won't try to refresh the "matching" PID list for each instance.
It avoids having to iterate over all the processes every run, the cache
is invalidated when:
* the deadline expire
* or one of the cached process can't be read anymore

One would want to lower the cache deadline if he wants to alert on
no-data.
@LeoCavaille LeoCavaille merged commit cc5a67b into master May 19, 2015
@LeoCavaille LeoCavaille deleted the leo/processcheck branch May 19, 2015 16:02
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling cc5a67b on leo/processcheck into ** on master**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants