recatch the metrics not found error #228

yaacov · 2018-01-30T17:48:03Z

Description

In #227 I forgot to re-catch the NoMetricsFoundError and prevent it from escalating to CollectionFailure.

Fix for PR #227

BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1530627
https://bugzilla.redhat.com/show_bug.cgi?id=1537195

yaacov · 2018-01-30T17:56:22Z

@miq-bot add_label metrics

looks ok now , waiting for logs to fill up ...

@cben @moolitayer @Ladas @gtanzillo please review

yaacov · 2018-01-30T18:04:54Z

@cben @moolitayer @Ladas @gtanzillo works for me, without it I have the errors :-)

yaacov · 2018-01-30T18:20:45Z

p.s.
we use _log.debug, but metrics can be missing because of a real problem,
do we want to switch to _log.warning ?

Ladas · 2018-01-30T20:32:50Z

@yaacov so if the Pods are failing, maybe we are still good with .debug? I am thinking that nobody will scan logs for 'why the failing pod misses metrics'? We should see the failing pod in inventory and we should also get events about that fact? So those might be enough.

cben · 2018-01-31T05:43:24Z

If we can't decide, consider also log.info - non-judgemental but present in logs by default :)

cben · 2018-01-31T05:44:00Z

LGTM

cben · 2018-01-31T05:45:56Z

Please set gaprindashvili/{yes,no}, I lost track of what was merged/backported/reverted :)

yaacov · 2018-01-31T08:07:34Z

Please set gaprindashvili/{yes,no}

My vote is yes, @gtanzillo @moolitayer ?

Ladas · 2018-01-31T08:11:02Z

@cben right so, the thing was it was flooding the log for @gtanzillo, but the might have been caused only by the exception. @yaacov how often do you see the message in log?

@yaacov I think we want this also for Fine, right @gtanzillo?

yaacov · 2018-01-31T08:23:03Z

@yaacov how often do you see the message in log?

In a system with failing pods it will log one line per pod each 50 min:
4 failing pods * 28 interval per day = 115 new lines per day for 4 failing pods

Ladas · 2018-01-31T08:31:29Z

@yaacov ok, that looks fine, we can switch to warn or info I guess. Could you check without this PR, was the exception causing this to be repeated more? (e.g. if the job was performed over and over and it always failed) Otherwise I am not sure what could have caused the log flood, even if the backtrace has like 200 lines.

yaacov · 2018-01-31T08:43:34Z

Could you check without this PR

looking at it now

yaacov · 2018-01-31T09:23:56Z

Could you check without this PR

@Ladas Update about number of log lines:
for one cycle ( on the RADAR + 2 clasters I have with failing pods ) system:

cat log/evm.log | grep WARN.*perf_collect_metrics | wc -l
28

It looks like we have a line for each node, pod and container that have no metrics in this cycle:

11:19 $ cat log/evm.log | grep WARN.*perf_collect_metrics
[----] W, [2018-01-31T11:16:47.238934 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerNode(85)] network/rx_rate missing while query metrics
[----] W, [2018-01-31T11:16:50.881737 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(333)] no gauge metrics found for [httpd-1-deploy]
[----] W, [2018-01-31T11:16:56.826146 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(316)] no gauge metrics found for [registry-console-1-deploy]
[----] W, [2018-01-31T11:17:00.008950 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(335)] no gauge metrics found for [memcached-1-deploy]
[----] W, [2018-01-31T11:17:00.889283 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(323)] no gauge metrics found for [jboss-eap-70-1-build]
[----] W, [2018-01-31T11:17:02.758937 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(328)] cpu/usage_rate missing while query metrics
[----] W, [2018-01-31T11:17:06.372275 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(318)] no gauge metrics found for [router-1-deploy]
[----] W, [2018-01-31T11:17:09.014359 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(336)] no gauge metrics found for [postgresql-1-deploy]
[----] W, [2018-01-31T11:17:09.062079 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(317)] no gauge metrics found for [registry-console-2-deploy]
[----] W, [2018-01-31T11:17:12.915836 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(313)] no gauge metrics found for [docker-registry-1-deploy]
[----] W, [2018-01-31T11:17:14.520492 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(332)] no gauge metrics found for [ansible-1-deploy]
[----] W, [2018-01-31T11:17:15.652845 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(324)] no gauge metrics found for [jboss-eap-70-2-build]
[----] W, [2018-01-31T11:17:17.253626 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(319)] network/rx_rate missing while query metrics
[----] W, [2018-01-31T11:17:23.049514 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(321)] no gauge metrics found for [eap-app-1-build]
[----] W, [2018-01-31T11:17:23.477521 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [ContainerGroup(325)] no gauge metrics found for [jboss-eap-70-3-build]
[----] W, [2018-01-31T11:17:26.266991 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2460)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:26.267749 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2441)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:32.260782 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2463)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:33.729404 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2456)] cpu/usage_rate missing while query metrics
[----] W, [2018-01-31T11:17:43.418998 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2451)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:46.931836 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2444)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:47.498166 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2452)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:49.625242 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2453)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:50.224540 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2449)] no gauge metrics found for [sti-build]
[----] W, [2018-01-31T11:17:52.949753 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2464)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:58.748367 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2446)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:17:58.748603 #14937:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2445)] no gauge metrics found for [deployment]
[----] W, [2018-01-31T11:18:01.498400 #14946:2ae86497af50]  WARN -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) Metrics missing: [Container(2461)] no gauge metrics found for [deployment]

miq-bot · 2018-01-31T09:24:29Z

Checked commits yaacov/manageiq-providers-kubernetes@833e43e~...873a30b with ruby 2.3.3, rubocop 0.52.0, haml-lint 0.20.0, and yamllint 1.10.0
3 files checked, 0 offenses detected
Everything looks fine. ⭐

cben · 2018-01-31T14:00:00Z

@Ladas can we merge?

Ladas

@cben yeah, looks good. Going forward, we should find a way to alert users about failing pods, since that causes also an event storm, among others.

yaacov · 2018-01-31T16:43:57Z

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1537195
target: 5.9.1

yaacov · 2018-01-31T16:45:28Z

Note:
When backporting must be together with #227

recatch the metrics not found error (cherry picked from commit 9769743) https://bugzilla.redhat.com/show_bug.cgi?id=1552314

simaishi · 2018-03-06T22:51:40Z

Gaprindashvili backport details:

$ git log -1
commit cdc323e74a8efe921f8c5133b6c4518722b8ea82
Author: Beni Cherniavsky-Paskin <cben@redhat.com>
Date:   Wed Jan 31 17:11:42 2018 +0200

    Merge pull request #228 from yaacov/recatch-no-metrics
    
    recatch the metrics not found error
    (cherry picked from commit 9769743ea2243943ee7f9a01b078676ac75ea109)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1552314

recatch the metrics not found error

833e43e

miq-bot added the metrics label Jan 30, 2018

change debug to warn

873a30b

yaacov force-pushed the recatch-no-metrics branch from 13822ef to 873a30b Compare January 31, 2018 09:16

Ladas approved these changes Jan 31, 2018

View reviewed changes

cben merged commit 9769743 into ManageIQ:master Jan 31, 2018

cben added this to the Sprint 78 Ending Jan 29, 2018 milestone Jan 31, 2018

cben self-assigned this Jan 31, 2018

cben added the gaprindashvili/yes label Jan 31, 2018

cben mentioned this pull request Jan 31, 2018

Do not raise an error when metrics are missing for one object #227

Merged

simaishi pushed a commit that referenced this pull request Mar 6, 2018

Merge pull request #228 from yaacov/recatch-no-metrics

cdc323e

recatch the metrics not found error (cherry picked from commit 9769743) https://bugzilla.redhat.com/show_bug.cgi?id=1552314

simaishi added the gaprindashvili/backported label Mar 6, 2018

simaishi removed the gaprindashvili/yes label Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recatch the metrics not found error #228

recatch the metrics not found error #228

yaacov commented Jan 30, 2018 •

edited

yaacov commented Jan 30, 2018

yaacov commented Jan 30, 2018

yaacov commented Jan 30, 2018

Ladas commented Jan 30, 2018

cben commented Jan 31, 2018

cben commented Jan 31, 2018

cben commented Jan 31, 2018

yaacov commented Jan 31, 2018

Ladas commented Jan 31, 2018

yaacov commented Jan 31, 2018

Ladas commented Jan 31, 2018

yaacov commented Jan 31, 2018

yaacov commented Jan 31, 2018

miq-bot commented Jan 31, 2018

cben commented Jan 31, 2018

Ladas left a comment

yaacov commented Jan 31, 2018

yaacov commented Jan 31, 2018

simaishi commented Mar 6, 2018

recatch the metrics not found error #228

recatch the metrics not found error #228

Conversation

yaacov commented Jan 30, 2018 • edited

yaacov commented Jan 30, 2018

yaacov commented Jan 30, 2018

yaacov commented Jan 30, 2018

Ladas commented Jan 30, 2018

cben commented Jan 31, 2018

cben commented Jan 31, 2018

cben commented Jan 31, 2018

yaacov commented Jan 31, 2018

Ladas commented Jan 31, 2018

yaacov commented Jan 31, 2018

Ladas commented Jan 31, 2018

yaacov commented Jan 31, 2018

yaacov commented Jan 31, 2018

miq-bot commented Jan 31, 2018

cben commented Jan 31, 2018

Ladas left a comment

Choose a reason for hiding this comment

yaacov commented Jan 31, 2018

yaacov commented Jan 31, 2018

simaishi commented Mar 6, 2018

yaacov commented Jan 30, 2018 •

edited