Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recatch the metrics not found error #228

Merged
merged 2 commits into from
Jan 31, 2018
Merged

Conversation

yaacov
Copy link
Member

@yaacov yaacov commented Jan 30, 2018

Description

In #227 I forgot to re-catch the NoMetricsFoundError and prevent it from escalating to CollectionFailure.

Fix for PR #227

BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1530627
https://bugzilla.redhat.com/show_bug.cgi?id=1537195

@yaacov
Copy link
Member Author

yaacov commented Jan 30, 2018

@miq-bot add_label metrics

looks ok now , waiting for logs to fill up ...

@cben @moolitayer @Ladas @gtanzillo please review

@yaacov
Copy link
Member Author

yaacov commented Jan 30, 2018

@cben @moolitayer @Ladas @gtanzillo works for me, without it I have the errors :-)

@yaacov
Copy link
Member Author

yaacov commented Jan 30, 2018

p.s.
we use _log.debug, but metrics can be missing because of a real problem,
do we want to switch to _log.warning ?

@Ladas
Copy link
Contributor

Ladas commented Jan 30, 2018

@yaacov so if the Pods are failing, maybe we are still good with .debug? I am thinking that nobody will scan logs for 'why the failing pod misses metrics'? We should see the failing pod in inventory and we should also get events about that fact? So those might be enough.

@cben
Copy link
Contributor

cben commented Jan 31, 2018

If we can't decide, consider also log.info - non-judgemental but present in logs by default :)

@cben
Copy link
Contributor

cben commented Jan 31, 2018

LGTM

@cben
Copy link
Contributor

cben commented Jan 31, 2018

Please set gaprindashvili/{yes,no}, I lost track of what was merged/backported/reverted :)

@yaacov
Copy link
Member Author

yaacov commented Jan 31, 2018

Please set gaprindashvili/{yes,no}

My vote is yes, @gtanzillo @moolitayer ?

@Ladas
Copy link
Contributor

Ladas commented Jan 31, 2018

@cben right so, the thing was it was flooding the log for @gtanzillo, but the might have been caused only by the exception. @yaacov how often do you see the message in log?

@yaacov I think we want this also for Fine, right @gtanzillo?

@yaacov
Copy link
Member Author

yaacov commented Jan 31, 2018

@yaacov how often do you see the message in log?

In a system with failing pods it will log one line per pod each 50 min:
4 failing pods * 28 interval per day = 115 new lines per day for 4 failing pods

@Ladas
Copy link
Contributor

Ladas commented Jan 31, 2018

@yaacov ok, that looks fine, we can switch to warn or info I guess. Could you check without this PR, was the exception causing this to be repeated more? (e.g. if the job was performed over and over and it always failed) Otherwise I am not sure what could have caused the log flood, even if the backtrace has like 200 lines.

@yaacov
Copy link
Member Author

yaacov commented Jan 31, 2018

Could you check without this PR

looking at it now

@yaacov
Copy link
Member Author

yaacov commented Jan 31, 2018

Could you check without this PR

@Ladas Update about number of log lines:
for one cycle ( on the RADAR + 2 clasters I have with failing pods ) system:

cat log/evm.log | grep WARN.*perf_collect_metrics | wc -l
28

It looks like we have a line for each node, pod and container that have no metrics in this cycle:

11:19 $ cat log/evm.log | grep WARN.*perf_collect_metrics
[----] W, [2018-01-31T11:16:47.238934 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerNode(85)] network/rx_rate missing while query metrics
[----] W, [2018-01-31T11:16:50.881737 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(333)] no gauge metrics found for [httpd-1-deploy]
[----] W, [2018-01-31T11:16:56.826146 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(316)] no gauge metrics found for [registry-console-1-deploy]
[----] W, [2018-01-31T11:17:00.008950 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(335)] no gauge metrics found for [memcached-1-deploy]
[----] W, [2018-01-31T11:17:00.889283 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(323)] no gauge metrics found for [jboss-eap-70-1-build]
[----] W, [2018-01-31T11:17:02.758937 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(328)] cpu/usage_rate missing while query metrics
[----] W, [2018-01-31T11:17:06.372275 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(318)] no gauge metrics found for [router-1-deploy]
[----] W, [2018-01-31T11:17:09.014359 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(336)] no gauge metrics found for [postgresql-1-deploy]
[----] W, [2018-01-31T11:17:09.062079 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(317)] no gauge metrics found for [registry-console-2-deploy]
[----] W, [2018-01-31T11:17:12.915836 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(313)] no gauge metrics found for [docker-registry-1-deploy]
[----] W, [2018-01-31T11:17:14.520492 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(332)] no gauge metrics found for [ansible-1-deploy]
[----] W, [2018-01-31T11:17:15.652845 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(324)] no gauge metrics found for [jboss-eap-70-2-build]
[----] W, [2018-01-31T11:17:17.253626 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(319)] network/rx_rate missing while query metrics
[----] W, [2018-01-31T11:17:23.049514 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(321)] no gauge metrics found for [eap-app-1-build]
[----] W, [2018-01-31T11:17:23.477521 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(325)] no gauge metrics found for [jboss-eap-70-3-build]
[----] W, [2018-01-31T11:17:26.266991 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2460)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:26.267749 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2441)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:32.260782 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2463)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:33.729404 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2456)] cpu/usage_rate missing while query metrics
[----] W, [2018-01-31T11:17:43.418998 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2451)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:46.931836 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2444)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:47.498166 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2452)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:49.625242 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2453)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:50.224540 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2449)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:52.949753 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2464)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:58.748367 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2446)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:58.748603 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2445)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:18:01.498400 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2461)] no gauge metrics found for [deployment]

@miq-bot
Copy link
Member

miq-bot commented Jan 31, 2018

Checked commits yaacov/manageiq-providers-kubernetes@833e43e~...873a30b with ruby 2.3.3, rubocop 0.52.0, haml-lint 0.20.0, and yamllint 1.10.0
3 files checked, 0 offenses detected
Everything looks fine. ⭐

@cben
Copy link
Contributor

cben commented Jan 31, 2018

@Ladas can we merge?

Copy link
Contributor

@Ladas Ladas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cben yeah, looks good. Going forward, we should find a way to alert users about failing pods, since that causes also an event storm, among others.

@cben cben merged commit 9769743 into ManageIQ:master Jan 31, 2018
@cben cben added this to the Sprint 78 Ending Jan 29, 2018 milestone Jan 31, 2018
@cben cben self-assigned this Jan 31, 2018
@yaacov
Copy link
Member Author

yaacov commented Jan 31, 2018

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1537195
target: 5.9.1

@yaacov
Copy link
Member Author

yaacov commented Jan 31, 2018

Note:
When backporting must be together with #227

simaishi pushed a commit that referenced this pull request Mar 6, 2018
@simaishi
Copy link

simaishi commented Mar 6, 2018

Gaprindashvili backport details:

$ git log -1
commit cdc323e74a8efe921f8c5133b6c4518722b8ea82
Author: Beni Cherniavsky-Paskin <cben@redhat.com>
Date:   Wed Jan 31 17:11:42 2018 +0200

    Merge pull request #228 from yaacov/recatch-no-metrics
    
    recatch the metrics not found error
    (cherry picked from commit 9769743ea2243943ee7f9a01b078676ac75ea109)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1552314

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants