Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make capture interval a multiple of 30s #187

Merged
merged 1 commit into from Dec 14, 2017

Conversation

yaacov
Copy link
Member

@yaacov yaacov commented Dec 10, 2017

Description

TL;DR This PR change query rate from Hawkular/Prometheus to 60s.

Metrics are scraped from k8s/openshift once every 30s into Hawkular/Prometheus.
Currently we query data from Hawkular/Prometheus every 20s, this create a response with missing data.

After discussion we decided to query from Hawkular/Prometheus every 60s, because it is a multiple of base rate (e.g. 30s) and allow for missing heartbeat once in a while.

Removed from #159

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1524183

@yaacov
Copy link
Member Author

yaacov commented Dec 10, 2017

@miq-bot add_label bug, metrics

@cben @moolitayer @Ladas @agrare please review

@yaacov
Copy link
Member Author

yaacov commented Dec 10, 2017

@bazulay
Copy link

bazulay commented Dec 10, 2017

@Ladas @agrare @zeari
Are we sure this change does not mess up the rollups ?

@cben
Copy link
Contributor

cben commented Dec 10, 2017

SGTM based on verbal explanation of expected effects, but that doesn't mean much :-)

Would love if we had any tests:

  • showing metric rows written to DB?
  • integtation test showing a user-visible problem fixed?

@yaacov
Copy link
Member Author

yaacov commented Dec 10, 2017

@cben 👍 about specs:

Would love if we had any tests:
showing metric rows written to DB?

rspec is updated for new interval, it does not check actual rows written to DB, but it does check the methods that write data to the DB, this spec will fail if a different interval is used.

integtation test showing a user-visible problem fixed?

We do not have integration test framework (?) , you can check this is fixed by running commands in rails console:

Before, we will get 40s, 20s intervals, we have missing interval between 1st and 2nd data points:

prov = ExtManagementSystem.find(< ID OF CONTAINER PROVIDER>)
container = prov.containers.find(< ID OF A CONTAINER>)
manager = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture.new(container)
manager.perf_collect_metrics('realtime', 5.minutes.ago, 1.minutes.ago)

  { 2017-12-10 14:29:46 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:30:26 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:30:46 UTC=>{"cpu/usage_rate"=>...},

After fix, we will get 60s intervals consistently:

prov = ExtManagementSystem.find(< ID OF CONTAINER PROVIDER>)
container = prov.containers.find(< ID OF A CONTAINER>)
manager = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture.new(container)
manager.perf_collect_metrics('realtime', 5.minutes.ago, 1.minutes.ago)

  { 2017-12-10 14:45:24 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:46:24 UTC=>{"cpu/usage_rate"=>...},
    2017-12-10 14:47:24 UTC=>{"cpu/usage_rate"=>...},

@cben
Copy link
Contributor

cben commented Dec 10, 2017

What I had in mind is precisely these gaps in metrics that previously resulted from some 20s intervals not containing any 30s-spaced samples.

I think the existing test structure is good, doesn't have to be more "integration", just need more data.
Input data, both before and after, is only 2 samples, and expected output is only 1 timestamp — didn't demonstrate the gaps and doesn't (convincingly) demonstrate we now have no gaps.

  • New test input data is 60s-spaced. Is that what we get in reality when we ask hawkular for 60s resolution?

P.S. tests do some stubbing on HawkularLegacyCaptureContext. Is that what tests actually run, or HawkularCaptureContext? Should they test both (or maybe all 3 including prometheus)?

@yaacov
Copy link
Member Author

yaacov commented Dec 10, 2017

both before and after, is only 2 samples, and expected output is only 1 timestamp — didn't demonstrate the gaps and doesn't (convincingly) demonstrate we now have no gaps.

This is simulated data to show that we calculate correctly delta T of 60s instead of 20s. We do

\dfrac{\delta V}{\delta T}

We need to check this calculation, because and \delta T equals the interval (even if we have no data), so if we have missing values we have \delta X == 0, but \delta T still equals the interval, witch results in wrong diff value, more simulated data will not change the test of the calculation.

New test input data is 60s-spaced. Is that what we get in reality when we ask hawkular for 60s resolution?

Yes, you can run the example code from #187 (comment) on a Hawkular or Prometheus provider, we also have VCR recording of the inawkularCaptureContext rspec [1] that show no gaps with 60s interval.

tests do some stubbing on HawkularLegacyCaptureContext. Is that what tests actually run, or HawkularCaptureContext? Should they test both (or maybe all 3 including prometheus)?

No, only HawkularLegacyCaptureContext does the calculation, the new Hawkular and Prometheus code get the diff values pre-calculated, so they are immune to the missing data calculation error.

[1] The rspec in the metrics capture sub classes [1] check that we do not have gaps:
Current VCR recording (after #159) use 60s interval, and check that we have ~10 samples on a 10m time span.
https://github.com/ManageIQ/manageiq-providers-kubernetes/blob/master/spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context_spec.rb#L91

@Ladas
Copy link
Contributor

Ladas commented Dec 11, 2017

@yaacov can we align the data samples to be always at beginning of the minute? So 12:20:00, 12:21:00, etc. Since there is aligning code that will store them at closest 20s and that might lead to different values, that could theoretically break the rollup.

@agrare @Fryguy I haven't found a code that would require for data to be 20s aligned, have I missed anything? (OpenShift provider has been breaking this condition for a while) But there is code relying on the alignment, like when we put multiple metrics into one sample (cpu, mem, net) and when we sum samples into 1 value (network inbound+outbound).

So from what I see, alignment to 00s, should serve us the same as 20s alignment.


Also, using rates, we should not have a problem with accuracy. But using the old way, I think we need to calculate with 'max' sample, instead of 'avg'. Otherwise the values could be again bigger than 100% for e.g. 1CPU core (as a visible bug in data), since we will have 2 samples in the 60s and a distance between avg of current and avg of previous is not 60s.

@yaacov
Copy link
Member Author

yaacov commented Dec 11, 2017

can we align the data samples to be always at beginning of the minute?

@Ladas sure :-) done

example:

reload!; c = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture.new(ExtManagementSystem.find(89).containers.last)
c.perf_collect_metrics('realtime', 5.minutes.ago, 1.minutes.ago)
{2017-12-11 08:09:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:10:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:11:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:12:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:13:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>15003648.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07717528933696827},
    2017-12-11 08:14:00 UTC=>{"cpu/usage_rate"=>0.0, "memory/usage"=>14995456.0, "cpu_usage_rate_average"=>0.0, "mem_usage_absolute_average"=>0.07713315158685252}

@yaacov
Copy link
Member Author

yaacov commented Dec 11, 2017

But using the old way, I think we need to calculate with 'max' sample, instead of 'avg'.

This is not realy a problem as the avg == min == max when we have one sample :-)
But you are right, fixing 👍

@@ -51,6 +51,9 @@ def log_severity
}

def capture_context(ems, target, start_time, end_time)
# make start_time align with INTERVAL
start_time = (start_time.to_i / INTERVAL.to_i) * INTERVAL.to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this result in the same?
start_time = start_time.beginning_of_minute

looks more readable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ladas nice !

will fix specs and then use this 👍

@@ -110,7 +110,7 @@ def compute_derivative(counters)
{
'start' => n['start'],
'end' => n['end'],
'avg' => n['avg'] - prv['avg']
'max' => n['max'] - prv['max']
Copy link
Contributor

@Ladas Ladas Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is n['start'] the same as prv['end'] ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not if we are missing a window ... but usually yes

Copy link
Contributor

@Ladas Ladas Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so it it safer to use

'start' => prv['end'] ?

although, we do not use this information in the calculation. Should we rather use that?

(n['max'] - prv['max']) / (1.kilobyte.to_f * (n['end'] - prv['end'])) here, instead of using @interval above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ladas 👍 nice catch, I saw that a couple of times, didn't fix that immediately, and forgot about that :-)

@ts_values[timestamp]['mem_usage_absolute_average'] = avg_usage
end
end

def process_net_counters_rate(counters_rate)
@metrics |= ['net_usage_rate_average'] unless counters_rate.empty?
counters_rate.each do |x|
interval = (x['end'] - x['start']) / 1.in_milliseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, this is true only if n['start'] the same as prv['end'], right?

otherwise we would need to move the calculation to below, where we have n and prv elements available?

Copy link
Member Author

@yaacov yaacov Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ladas 👍 fixed, now x['end'] is defined as next['start'] on line 114.

@yaacov yaacov force-pushed the metric-capture-interval-60s branch 8 times, most recently from f593931 to 282e287 Compare December 11, 2017 17:56
@yaacov
Copy link
Member Author

yaacov commented Dec 11, 2017

@miq-bot add_label compute/containers, gaprindashvili/yes

BZ target 5.9

@miq-bot
Copy link
Member

miq-bot commented Dec 11, 2017

@yaacov Cannot apply the following label because they are not recognized: compute/containers

'avg' => n['avg'] - prv['avg']
'start' => prv['start'],
'end' => n['start'],
'max' => n['max'] - prv['max']
Copy link
Contributor

@Ladas Ladas Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, if we do <prv['start'], n['start']> shouldn't we use n['min'] - prv['min'], since samples are on the start timestamp?

Also, it's still 'avg' => n['max'] - prv['max'], right? Since we are computing avg between 2 samples?

the data look like this, right?

prv_start---------------(prv_end, n_start)---------------n_end
prv_min-----------------(prv_max, n_min)-----------------n_max

or I probably the actual min&max samples are not aligned?

prv_start---------------(prv_end, n_start)---------------n_end
----prv_min--------prv_max------------n_min--------n_max--

Can we actually get a real start time of each sample? Cause if we are using bucket start/end, that would explain why I saw bad values (like more than 100% cpu on 1core machine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, if we do <prv['start'], n['start']> shouldn't we use n['min'] - prv['min'], since samples are on the start timestamp?

Right 👍 will update to ['min'].

probably the actual min&max samples are not aligned?

Correct, but we are talking about 30s samples over 60s window:
a. we have 1.. 2 samples per window => min ~= max ~= avg
b. sample time error max is < 30s

Can we actually get a real time of each sample start time?

Hawkular:
Not without changing too much of the code, we use Hawkular to sort the samples (and it drops realtime info), and if we ask Hawkular to give as raw data, we will have to add the code to do the sorting ourselves.

Prometheus:
No way to know raw data time by design ( max sample time error is 30s :-) )

P.S
This code is obsolete, It is only a fullback in case we can not get the rate already calculated from Hawkular, openshift from v3.5 and above give us the already calculated rate, and this code is not called any more. see #159 .

@yaacov
Copy link
Member Author

yaacov commented Dec 12, 2017

@Ladas @moolitayer @cben @bazulay

Points to note when reviewing:
a. This code is only relevant for openshift v3.4 and below, openshift v3.5 and above do not use this code.
b. We do not support v3.4 any more.
c. This code will be removed soon, when we decide to drop support for old openshifts.

Copy link
Contributor

@Ladas Ladas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now. The legacy is as precise as it can be now.

@yaacov
Copy link
Member Author

yaacov commented Dec 12, 2017

@Ladas thanks 👍

cc @cben @moolitayer

# ^ (real sample time) ^ (real sample time) ^ (real sample time)
# ^ (real sample value) ^ (real sample value) ^ (real sample value)
# we use:
# (T = start of window timestamp, V = min value of winodw samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/winodw/window

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced only once :-)

@yaacov yaacov force-pushed the metric-capture-interval-60s branch 2 times, most recently from 7cbe3b8 to fb03951 Compare December 12, 2017 21:33
@@ -97,20 +99,29 @@ def compute_summation(data)
# Add min, median, max, percentile95th, etc. if needed
{
'start' => k,
'end' => [sum['end'], n['end']].max,
'avg' => sum['avg'] + n['avg']
'end' => [sum['end'], n['end']].min,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one I'm not sure about — if end is supposed to be the last timestamp of the summation period, this should remain max?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r/max/min/g :-)

'end' => [sum['end'], n['end']].max,
'avg' => sum['avg'] + n['avg']
'end' => [sum['end'], n['end']].min,
'min' => sum['min'] + n['min']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird both before and after.
average over many windows is not sum of sub-window averages (should be divided by count)
minimum over many windows is not sum of sub-window minimums
?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r/max/min/g :-) again ...

this is weird both before and after.

place sad imoji here :-)

average over many windows is not sum of sub-window averages (should be divided by count)
minimum over many windows is not sum of sub-window minimums
?

This function just sum the tx + rx values to give net_total, no diff here, so summation is ok here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that's what I missed, I thought it's summing along time axis. 👍

@miq-bot
Copy link
Member

miq-bot commented Dec 12, 2017

Checked commit yaacov@e56db5d with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
6 files checked, 0 offenses detected
Everything looks fine. 🍪

Copy link
Contributor

@cben cben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@moolitayer moolitayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@moolitayer moolitayer merged commit 9be0e66 into ManageIQ:master Dec 14, 2017
@moolitayer moolitayer added this to the Sprint 76 Ending Jan 1, 2018 milestone Dec 14, 2017
simaishi pushed a commit that referenced this pull request Dec 18, 2017
@simaishi
Copy link

Gaprindashvili backport details:

$ git log -1
commit 767ed8106f6808a2d83175b9773f191a86893cbd
Author: Mooli Tayer <mtayer@redhat.com>
Date:   Thu Dec 14 14:00:16 2017 +0200

    Merge pull request #187 from yaacov/metric-capture-interval-60s
    
    Make capture interval a multiple of 30s
    (cherry picked from commit 9be0e668b23e822aeaf3eef960fcb3a0f1fa04cb)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1527094

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants