Make capture interval a multiple of 30s #187

yaacov · 2017-12-10T14:16:30Z

Description

TL;DR This PR change query rate from Hawkular/Prometheus to 60s.

Metrics are scraped from k8s/openshift once every 30s into Hawkular/Prometheus.
Currently we query data from Hawkular/Prometheus every 20s, this create a response with missing data.

After discussion we decided to query from Hawkular/Prometheus every 60s, because it is a multiple of base rate (e.g. 30s) and allow for missing heartbeat once in a while.

Removed from #159

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1524183

yaacov · 2017-12-10T14:19:07Z

@miq-bot add_label bug, metrics

@cben @moolitayer @Ladas @agrare please review

yaacov · 2017-12-10T14:43:46Z

BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1524183

bazulay · 2017-12-10T15:12:15Z

@Ladas @agrare @zeari
Are we sure this change does not mess up the rollups ?

cben · 2017-12-10T15:33:11Z

SGTM based on verbal explanation of expected effects, but that doesn't mean much :-)

Would love if we had any tests:

showing metric rows written to DB?
integtation test showing a user-visible problem fixed?

yaacov · 2017-12-10T16:04:29Z

@cben 👍 about specs:

Would love if we had any tests:
showing metric rows written to DB?

rspec is updated for new interval, it does not check actual rows written to DB, but it does check the methods that write data to the DB, this spec will fail if a different interval is used.

integtation test showing a user-visible problem fixed?

We do not have integration test framework (?) , you can check this is fixed by running commands in rails console:

Before, we will get 40s, 20s intervals, we have missing interval between 1st and 2nd data points:

prov = ExtManagementSystem.find(< ID OF CONTAINER PROVIDER>)
container = prov.containers.find(< ID OF A CONTAINER>)
manager = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture.new(container)
manager.perf_collect_metrics('realtime', 5.minutes.ago, 1.minutes.ago)

  { 2017-12-10 14:29:46 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:30:26 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:30:46 UTC=>{"cpu/usage_rate"=>...},

After fix, we will get 60s intervals consistently:

prov = ExtManagementSystem.find(< ID OF CONTAINER PROVIDER>)
container = prov.containers.find(< ID OF A CONTAINER>)
manager = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture.new(container)
manager.perf_collect_metrics('realtime', 5.minutes.ago, 1.minutes.ago)

  { 2017-12-10 14:45:24 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:46:24 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:47:24 UTC=>{"cpu/usage_rate"=>...},

cben · 2017-12-10T19:33:12Z

What I had in mind is precisely these gaps in metrics that previously resulted from some 20s intervals not containing any 30s-spaced samples.

I think the existing test structure is good, doesn't have to be more "integration", just need more data.
Input data, both before and after, is only 2 samples, and expected output is only 1 timestamp — didn't demonstrate the gaps and doesn't (convincingly) demonstrate we now have no gaps.

New test input data is 60s-spaced. Is that what we get in reality when we ask hawkular for 60s resolution?

P.S. tests do some stubbing on HawkularLegacyCaptureContext. Is that what tests actually run, or HawkularCaptureContext? Should they test both (or maybe all 3 including prometheus)?

yaacov · 2017-12-10T20:28:28Z

both before and after, is only 2 samples, and expected output is only 1 timestamp — didn't demonstrate the gaps and doesn't (convincingly) demonstrate we now have no gaps.

This is simulated data to show that we calculate correctly delta T of 60s instead of 20s. We do

\dfrac{\delta V}{\delta T}

We need to check this calculation, because and \delta T equals the interval (even if we have no data), so if we have missing values we have \delta X == 0, but \delta T still equals the interval, witch results in wrong diff value, more simulated data will not change the test of the calculation.

New test input data is 60s-spaced. Is that what we get in reality when we ask hawkular for 60s resolution?

Yes, you can run the example code from #187 (comment) on a Hawkular or Prometheus provider, we also have VCR recording of the inawkularCaptureContext rspec [1] that show no gaps with 60s interval.

tests do some stubbing on HawkularLegacyCaptureContext. Is that what tests actually run, or HawkularCaptureContext? Should they test both (or maybe all 3 including prometheus)?

No, only HawkularLegacyCaptureContext does the calculation, the new Hawkular and Prometheus code get the diff values pre-calculated, so they are immune to the missing data calculation error.

[1] The rspec in the metrics capture sub classes [1] check that we do not have gaps:
Current VCR recording (after #159) use 60s interval, and check that we have ~10 samples on a 10m time span.
https://github.com/ManageIQ/manageiq-providers-kubernetes/blob/master/spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_spec.rb#L91

Ladas · 2017-12-11T07:43:07Z

@yaacov can we align the data samples to be always at beginning of the minute? So 12:20:00, 12:21:00, etc. Since there is aligning code that will store them at closest 20s and that might lead to different values, that could theoretically break the rollup.

@agrare @Fryguy I haven't found a code that would require for data to be 20s aligned, have I missed anything? (OpenShift provider has been breaking this condition for a while) But there is code relying on the alignment, like when we put multiple metrics into one sample (cpu, mem, net) and when we sum samples into 1 value (network inbound+outbound).

So from what I see, alignment to 00s, should serve us the same as 20s alignment.

Also, using rates, we should not have a problem with accuracy. But using the old way, I think we need to calculate with 'max' sample, instead of 'avg'. Otherwise the values could be again bigger than 100% for e.g. 1CPU core (as a visible bug in data), since we will have 2 samples in the 60s and a distance between avg of current and avg of previous is not 60s.

yaacov · 2017-12-11T08:18:08Z

can we align the data samples to be always at beginning of the minute?

@Ladas sure :-) done

example:

reload!; c = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture.new(ExtManagementSystem.find(89).containers.last)
c.perf_collect_metrics('realtime', 5.minutes.ago, 1.minutes.ago)

{2017-12-11 08:09:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:10:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:11:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:12:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:13:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:14:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>14995456.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07713315158685252}

yaacov · 2017-12-11T08:31:22Z

But using the old way, I think we need to calculate with 'max' sample, instead of 'avg'.

This is not realy a problem as the avg == min == max when we have one sample :-)
But you are right, fixing 👍

Ladas · 2017-12-11T08:57:30Z

app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb

@@ -51,6 +51,9 @@ def log_severity
    }

    def capture_context(ems, target, start_time, end_time)
+      # make start_time align with INTERVAL
+      start_time = (start_time.to_i / INTERVAL.to_i) * INTERVAL.to_i


will this result in the same?
start_time = start_time.beginning_of_minute

looks more readable

@Ladas nice !

will fix specs and then use this 👍

Ladas · 2017-12-11T08:59:50Z

...els/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb

@@ -110,7 +110,7 @@ def compute_derivative(counters)
        {
          'start' => n['start'],
          'end'   => n['end'],
-          'avg'   => n['avg'] - prv['avg']
+          'max'   => n['max'] - prv['max']


is n['start'] the same as prv['end'] ?

not if we are missing a window ... but usually yes

right, so it it safer to use

'start' => prv['end'] ?

although, we do not use this information in the calculation. Should we rather use that?

(n['max'] - prv['max']) / (1.kilobyte.to_f * (n['end'] - prv['end'])) here, instead of using @interval above?

Ladas 👍 nice catch, I saw that a couple of times, didn't fix that immediately, and forgot about that :-)

Ladas · 2017-12-11T09:17:41Z

...els/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb

        @ts_values[timestamp]['mem_usage_absolute_average'] = avg_usage
      end
    end

    def process_net_counters_rate(counters_rate)
      @metrics |= ['net_usage_rate_average'] unless counters_rate.empty?
      counters_rate.each do |x|
+        interval = (x['end'] - x['start']) / 1.in_milliseconds


right, this is true only if n['start'] the same as prv['end'], right?

otherwise we would need to move the calculation to below, where we have n and prv elements available?

@Ladas 👍 fixed, now x['end'] is defined as next['start'] on line 114.

yaacov · 2017-12-11T19:32:35Z

@miq-bot add_label compute/containers, gaprindashvili/yes

BZ target 5.9

miq-bot · 2017-12-11T19:33:12Z

@yaacov Cannot apply the following label because they are not recognized: compute/containers

Ladas · 2017-12-12T08:02:46Z

...els/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb

-          'avg'   => n['avg'] - prv['avg']
+          'start' => prv['start'],
+          'end'   => n['start'],
+          'max'   => n['max'] - prv['max']


hm, if we do <prv['start'], n['start']> shouldn't we use n['min'] - prv['min'], since samples are on the start timestamp?

Also, it's still 'avg' => n['max'] - prv['max'], right? Since we are computing avg between 2 samples?

the data look like this, right?

prv_start---------------(prv_end, n_start)---------------n_end prv_min-----------------(prv_max, n_min)-----------------n_max

or I probably the actual min&max samples are not aligned?

prv_start---------------(prv_end, n_start)---------------n_end ----prv_min--------prv_max------------n_min--------n_max--

Can we actually get a real start time of each sample? Cause if we are using bucket start/end, that would explain why I saw bad values (like more than 100% cpu on 1core machine)

hm, if we do <prv['start'], n['start']> shouldn't we use n['min'] - prv['min'], since samples are on the start timestamp?

Right 👍 will update to ['min'].

probably the actual min&max samples are not aligned?

Correct, but we are talking about 30s samples over 60s window:
a. we have 1.. 2 samples per window => min ~= max ~= avg
b. sample time error max is < 30s

Can we actually get a real time of each sample start time?

Hawkular:
Not without changing too much of the code, we use Hawkular to sort the samples (and it drops realtime info), and if we ask Hawkular to give as raw data, we will have to add the code to do the sorting ourselves.

Prometheus:
No way to know raw data time by design ( max sample time error is 30s :-) )

P.S
This code is obsolete, It is only a fullback in case we can not get the rate already calculated from Hawkular, openshift from v3.5 and above give us the already calculated rate, and this code is not called any more. see #159 .

yaacov · 2017-12-12T14:40:26Z

@Ladas @moolitayer @cben @bazulay

Points to note when reviewing:
a. This code is only relevant for openshift v3.4 and below, openshift v3.5 and above do not use this code.
b. We do not support v3.4 any more.
c. This code will be removed soon, when we decide to drop support for old openshifts.

Ladas

Looks good to me now. The legacy is as precise as it can be now.

yaacov · 2017-12-12T15:59:27Z

@Ladas thanks 👍

cc @cben @moolitayer

cben · 2017-12-12T21:19:05Z

...els/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb

+        #       ^ (real sample time)         ^ (real sample time)          ^ (real sample time)
+        #      ^ (real sample value)        ^ (real sample value)           ^ (real sample value)
+        # we use:
+        # (T = start of window timestamp, V = min value of winodw samples)


s/winodw/window

replaced only once :-)

cben · 2017-12-12T21:36:43Z

...els/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb

@@ -97,20 +99,29 @@ def compute_summation(data)
          # Add min, median, max, percentile95th, etc. if needed
          {
            'start' => k,
-            'end'   => [sum['end'], n['end']].max,
-            'avg'   => sum['avg'] + n['avg']
+            'end'   => [sum['end'], n['end']].min,


this one I'm not sure about — if end is supposed to be the last timestamp of the summation period, this should remain max?

r/max/min/g :-)

cben · 2017-12-12T21:39:34Z

...els/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb

-            'end'   => [sum['end'], n['end']].max,
-            'avg'   => sum['avg'] + n['avg']
+            'end'   => [sum['end'], n['end']].min,
+            'min'   => sum['min'] + n['min']


this is weird both before and after.
average over many windows is not sum of sub-window averages (should be divided by count)
minimum over many windows is not sum of sub-window minimums
?

r/max/min/g :-) again ...

this is weird both before and after.

place sad imoji here :-)

average over many windows is not sum of sub-window averages (should be divided by count)
minimum over many windows is not sum of sub-window minimums
?

This function just sum the tx + rx values to give net_total, no diff here, so summation is ok here.

ah, that's what I missed, I thought it's summing along time axis. 👍

miq-bot · 2017-12-12T21:44:40Z

Checked commit yaacov@e56db5d with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
6 files checked, 0 offenses detected
Everything looks fine. 🍪

cben

LGTM

moolitayer

LGTM 👍

Make capture interval a multiple of 30s (cherry picked from commit 9be0e66) https://bugzilla.redhat.com/show_bug.cgi?id=1527094

simaishi · 2017-12-18T14:46:14Z

Gaprindashvili backport details:

$ git log -1
commit 767ed8106f6808a2d83175b9773f191a86893cbd
Author: Mooli Tayer <mtayer@redhat.com>
Date:   Thu Dec 14 14:00:16 2017 +0200

    Merge pull request #187 from yaacov/metric-capture-interval-60s
    
    Make capture interval a multiple of 30s
    (cherry picked from commit 9be0e668b23e822aeaf3eef960fcb3a0f1fa04cb)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1527094

miq-bot added bug metrics labels Dec 10, 2017

moolitayer requested review from Ladas and agrare December 10, 2017 15:32

yaacov force-pushed the metric-capture-interval-60s branch from f64d75c to 8775f1b Compare December 11, 2017 08:53

Ladas reviewed Dec 11, 2017

View reviewed changes

yaacov force-pushed the metric-capture-interval-60s branch from 8775f1b to 6aafa91 Compare December 11, 2017 09:03

Ladas reviewed Dec 11, 2017

View reviewed changes

yaacov force-pushed the metric-capture-interval-60s branch 8 times, most recently from f593931 to 282e287 Compare December 11, 2017 17:56

miq-bot added the gaprindashvili/yes label Dec 11, 2017

Ladas reviewed Dec 12, 2017

View reviewed changes

Ladas approved these changes Dec 12, 2017

View reviewed changes

cben reviewed Dec 12, 2017

View reviewed changes

yaacov force-pushed the metric-capture-interval-60s branch 2 times, most recently from 7cbe3b8 to fb03951 Compare December 12, 2017 21:33

cben reviewed Dec 12, 2017

View reviewed changes

Make capture interval a multiple of 30s

e56db5d

yaacov force-pushed the metric-capture-interval-60s branch from fb03951 to e56db5d Compare December 12, 2017 21:39

cben reviewed Dec 13, 2017

View reviewed changes

cben approved these changes Dec 13, 2017

View reviewed changes

moolitayer approved these changes Dec 14, 2017

View reviewed changes

moolitayer merged commit 9be0e66 into ManageIQ:master Dec 14, 2017

moolitayer added this to the Sprint 76 Ending Jan 1, 2018 milestone Dec 14, 2017

simaishi pushed a commit that referenced this pull request Dec 18, 2017

Merge pull request #187 from yaacov/metric-capture-interval-60s

767ed81

Make capture interval a multiple of 30s (cherry picked from commit 9be0e66) https://bugzilla.redhat.com/show_bug.cgi?id=1527094

simaishi added gaprindashvili/backported and removed gaprindashvili/yes labels Dec 18, 2017

moolitayer self-assigned this Dec 27, 2017

Make capture interval a multiple of 30s #187

Make capture interval a multiple of 30s #187

Conversation

yaacov commented Dec 10, 2017 • edited

yaacov commented Dec 10, 2017

yaacov commented Dec 10, 2017

bazulay commented Dec 10, 2017

cben commented Dec 10, 2017

yaacov commented Dec 10, 2017

cben commented Dec 10, 2017

yaacov commented Dec 10, 2017

Ladas commented Dec 11, 2017 • edited

yaacov commented Dec 11, 2017

yaacov commented Dec 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas Dec 11, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ladas Dec 11, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaacov Dec 11, 2017 • edited

Choose a reason for hiding this comment

yaacov commented Dec 11, 2017

miq-bot commented Dec 11, 2017

Ladas Dec 12, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaacov commented Dec 12, 2017

Ladas left a comment

Choose a reason for hiding this comment

yaacov commented Dec 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miq-bot commented Dec 12, 2017

cben left a comment

Choose a reason for hiding this comment

moolitayer left a comment

Choose a reason for hiding this comment

simaishi commented Dec 18, 2017

yaacov commented Dec 10, 2017 •

edited

Ladas commented Dec 11, 2017 •

edited

Ladas Dec 11, 2017 •

edited

Ladas Dec 11, 2017 •

edited

yaacov Dec 11, 2017 •

edited

Ladas Dec 12, 2017 •

edited