-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recatch the metrics not found error #228
Conversation
@miq-bot add_label metrics looks ok now , waiting for logs to fill up ... @cben @moolitayer @Ladas @gtanzillo please review |
@cben @moolitayer @Ladas @gtanzillo works for me, without it I have the errors :-) |
p.s. |
@yaacov so if the Pods are failing, maybe we are still good with .debug? I am thinking that nobody will scan logs for 'why the failing pod misses metrics'? We should see the failing pod in inventory and we should also get events about that fact? So those might be enough. |
If we can't decide, consider also log.info - non-judgemental but present in logs by default :) |
LGTM |
Please set gaprindashvili/{yes,no}, I lost track of what was merged/backported/reverted :) |
My vote is yes, @gtanzillo @moolitayer ? |
@cben right so, the thing was it was flooding the log for @gtanzillo, but the might have been caused only by the exception. @yaacov how often do you see the message in log? @yaacov I think we want this also for Fine, right @gtanzillo? |
In a system with failing pods it will log one line per pod each 50 min: |
@yaacov ok, that looks fine, we can switch to warn or info I guess. Could you check without this PR, was the exception causing this to be repeated more? (e.g. if the job was performed over and over and it always failed) Otherwise I am not sure what could have caused the log flood, even if the backtrace has like 200 lines. |
looking at it now |
13822ef
to
873a30b
Compare
@Ladas Update about number of log lines:
It looks like we have a line for each node, pod and container that have no metrics in this cycle:
|
Checked commits yaacov/manageiq-providers-kubernetes@833e43e~...873a30b with ruby 2.3.3, rubocop 0.52.0, haml-lint 0.20.0, and yamllint 1.10.0 |
@Ladas can we merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cben yeah, looks good. Going forward, we should find a way to alert users about failing pods, since that causes also an event storm, among others.
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1537195 |
Note: |
recatch the metrics not found error (cherry picked from commit 9769743) https://bugzilla.redhat.com/show_bug.cgi?id=1552314
Gaprindashvili backport details:
|
Description
In #227 I forgot to re-catch the
NoMetricsFoundError
and prevent it from escalating toCollectionFailure
.Fix for PR #227
BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1530627
https://bugzilla.redhat.com/show_bug.cgi?id=1537195