Use Hawkular memory tag working-set instead of usage #305

yaacov · 2018-11-18T14:54:13Z

Description

ManageIQ includes cached memory when showing used memory in the dashboard and reports, some users want to see only the working set values as memory usage in the dashboard.

Definitions of memory tags used in Hawkular:
memory/usage - Total memory usage, including cache.
memory/working_set - Total working set usage. Working set is the memory being used and not easily dropped by the kernel.

This PR replace the memory value used in reports in dashboard.

Testing
In rails c:

Before:

2.4.0 :001 > prov = ExtManagementSystem.find(6);node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularCaptureContext.new(node, 10.minutes.ago, 5.minutes.ago, 60);
2.4.0 :002 > pp context.collect_metrics

...
2018-11-18 14:14:37 UTC=>
  {"cpu/usage_rate"=>142.0,
   "memory/usage"=>7406342144.0,
   "memory/working_set"=>5029109760.0,
   "network/rx_rate"=>2128.1669921875,
   "network/tx_rate"=>12143.740234375,
   "cpu_usage_rate_average"=>7.1,
   "mem_usage_absolute_average"=>38.09846781893004,
   "net_usage_rate_average"=>14.2719072265625}}
...

After:

2.4.0 :001 > prov = ExtManagementSystem.find(6);node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularCaptureContext.new(node, 10.minutes.ago, 5.minutes.ago, 60);
2.4.0 :002 > pp context.collect_metrics

...
2018-11-18 14:30:30 UTC=>
  {"cpu/usage_rate"=>135.0,
   "memory/usage"=>7422861312.0,
   "memory/working_set"=>5045604352.0,
   "network/rx_rate"=>2566.94287109375,
   "network/tx_rate"=>10321.9169921875,
   "cpu_usage_rate_average"=>6.75,
   "mem_usage_absolute_average"=>25.95475489711934,
   "net_usage_rate_average"=>12.88885986328125}}
...

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1650351
BZ (old): https://bugzilla.redhat.com/show_bug.cgi?id=1456856

yaacov · 2018-11-18T14:54:49Z

@cben @Ladas @agrare @oourfali please review

cben · 2018-11-18T15:22:06Z

Technically I'd prefer adding a new metric column, adding a new dashboard widget using it, and letting customer choose which widget they want to see.

One concrete reason overloading a single metric is bad: flipping this setting mid-way would make rollups a meaningless mix of old and new meaning.

But I don't know what it takes to add a new metric, and I'm not the one that'd implement it :-)

agrare · 2018-11-18T15:45:40Z

Hey @yaacov, I realize you're trying to get this done in a way that can be backported but I'm also 👎 on overloading an existing column.

yaacov · 2018-11-18T16:21:29Z

@agrare @cben Hi,

👎 on overloading an existing column.

As @agrare mentioned in the BZ the "correct" way will be to add a new column to the metrics table. I agree with that observation.

My issues with the "correct" solution are:

It's not what users asked for in the BZ, they want the memory usage to be the portion of memory occupied by a process that is held in main memory (RAM). To them current behaviour ~~(ram + swap)~~ (working_set + cache) is a bug.
OpenShift show memory usage using working-set , so to be consistent with OpenShift we need to change the value used for memory-usage and not add a new field.
We already have ~40 fields in the metrics table, 7 are dedicated to real time memory, none for RSS or working-set, so we will have to add a new field called mem_working_set_absolute_average, this mean a schema change + rollups + aggregation changes in ManageIQ core.
If we decide to add a new metrics field, we will also need to change / add reports.
We will also need to update the dashboards to show the new metrics and document what this new values mean.

p.s.
This PR purpose is to [ add option to ] show in ManageIQ dashboard / reports what users expect to see in "memory-usage", it may not be the "correct" value, but it is what users expect.

EDIT NOV-19: usage = working_set + cache not ram + swap

Ladas · 2018-11-19T09:50:06Z

@yaacov I get what you are saying, but as @cben says, we can't do this since it would mess up rollups, historic reports etc. @agrare @Fryguy did we deal with something like this before?

We already have field for swap mem_swapped_absolute_average.

For the memory usage, it depends, for OpenStack infra the customer expected the used memory is
-cache and -buffer. Normally it has RSS in there, but since we didn't put it there, it'll be very hard(impossible) to migrate the existing data.

yaacov · 2018-11-19T10:08:22Z

we can't do this since it would mess up rollups, historic reports

Correct, as @cben and @Ladas mentioned overloading _usage_ is wrong, we all agree about that, but we need to give the user a fix, adding a new field is the best solution, but it also has it's cons.

@Ladas, how fast do you think you or @adam can push the necessary schema changes and the necessary aggregations, roll-ups and reports changes ?

Depending on the state of the BZ, I am for waiting with this PR until schema changes get merged.

My two cents on waiting for schema changes:

If we close the BZ as "not a bug" because we show what we say we show (e.g. "usage"), than we do not need this PR at all.
If we decide this BZ can wait for schema changes, than we we can wait with this PR until then.
Other wise, we need to get creative, I think this PR is a good compromise for this case.

Ladas · 2018-11-19T14:19:31Z

@yaacov so, most important question, how does this align to https://github.com/Ladas/manageiq-providers-kubernetes/blob/5565f3738fdbd49979eb2da202ee52dedf23a805/app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb#L727

Does that align with memory/working_set or memory/usage? Since we show % value of the ^ inventory value, right? So is it correct now? Or you are saying having all pods on 100% memory will be again more than 100% of the node capacity? (This would be a bug)

(there is also unused code that should be aligned https://github.com/Ladas/manageiq-providers-kubernetes/blob/833e43eaaa0938f3e60d7ec8952a5ed7ecde0176/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb#L264 ? )

yaacov · 2018-11-19T14:37:00Z

Hi all, to sum discussions on and off line -

Common wisdom now is to add a new field to schema, @agrare will try to push this forward.

I will change this PR to push data to a yet none existent field called mem_working_set_absolute_average if needed we will push data to one of the not used fields in the metrics table.

@Ladas @agrare @cben WDYT ?

Not used fields in metrics table:

 mem_vmmemctl_absolute_average: nil,
 mem_vmmemctltarget_absolute_average: nil,
 mem_swapin_absolute_average: nil,
 mem_swapout_absolute_average: nil,
 mem_swapped_absolute_average: nil,
 mem_swaptarget_absolute_average: nil,

We will also try to see if we can add more memory related fields on the way,

The metrics we can get from Hawkular in Openshift 3.7.27 are:

memory/limit                    Memory hard limit in bytes.
memory/major_page_faults        Number of major page faults.
memory/major_page_faults_rate   Number of major page faults per second.
memory/node_capacity            Memory capacity of a node.
memory/node_allocatable         Memory allocatable of a node.
memory/node_reservation         Share of memory that is reserved on the node allocatable.
memory/node_utilization         Memory utilization as a share of memory allocatable.
memory/page_faults              Number of page faults.
memory/page_faults_rate         Number of page faults per second.
memory/request                  Memory request (the guaranteed amount of resources) in bytes.
memory/usage                    Total memory usage.
memory/working_set              Total working set usage. Working set is the memory being used and not easily dropped by the kernel.

Fryguy · 2018-11-19T16:28:01Z

I am -1 on the option...more options is bad for the product, and in this case would be a confusing option for customers as they a) would not understand why they would switch it and b) probably would never switch it (because I would choose the settings default to be the more accurate working_set).

I'm also -1 on a new column to solve this issue. In my opinion, the original choice of "usage" was just a bug/oversight, and the new choice of "working set" is the correct value and is just a bug fix.

yaacov · 2018-11-19T16:53:44Z

@Fryguy

👍 on just reading working_set instead of usage

Do we need to support pre 3.7.27 OpenShifts that do not have the working_set value ?

agrare · 2018-11-19T17:04:11Z

@yaacov so you mentioned that this was missing on 3.7.0-alpha.1, can we find out if it was available on 3.7.0 GA?

yaacov · 2018-11-19T17:05:59Z

@yaacov so you mentioned that this was missing on 3.7.0-alpha.1, can we find out if it was available on 3.7.0 GA?

I will check ... it may take a little time :-)

Ladas · 2018-11-19T17:10:44Z

@yaacov please respond to #305 (comment)

You are changing the used memory, but you are keeping mem_node_capacity that means one of the 2 options for used memory must be wrong. Which one is it?

yaacov · 2018-11-19T17:17:39Z

@agrare it looks like it was in 3.7.0 GA (*) was in Heapster 1.3 used in OpenShift 3.7

(*) kubernetes-retired/heapster#1074

yaacov · 2018-11-19T17:23:00Z

please respond to #305 (comment)

@Ladas I do not understand your question.

Why fixing a wrong metrics, should have affect on mem_node_capacity ?

We are changing the value from being used ( e.g. working_set+cache ) to being just working_set, it's a smaller number and should have no effect on anything other than fixing the value shown in dashboards and reports.

Ladas · 2018-11-19T17:41:51Z

@yaacov well, no, you are computing the mem_usage_absolute_average as https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/305/files#diff-8e531dadfa2ffca2566fdd1332c47531R174

So it uses the used/total, and if you lower the used, you will get lower mem_usage_absolute_average.

So should the mem_usage_absolute_average be lower? Meaning we were doing the wrong computation before and you would see you are using over 100% of your memory if all your resource are consumed?

Or was it correct and now we'll never see 100%, even if all your resources are consumed?

yaacov · 2018-11-19T18:00:46Z

@Ladas I see.

So should the mem_usage_absolute_average be lower?

Sometimes:
For example if we if we have 100 units of RAM total:
a.

50 used  (working set + cache)
50 working set
0   cache
mem_usage_absolute_average - will stay the same.

b.

50 used  (working set + cache)
40 working set
10   cache
mem_usage_absolute_average - will be a bit lower

c.

50 used  (working set + cache)
0 working set
50   cache
mem_usage_absolute_average - will be zero

Meaning we were doing the wrong computation, before and you would see you are using over 100% of your memory if all your resource are consumed?

No, we used usage that is a correct value to use, working set is better.
You can not get over 100% with usage
You can not get over 100% with working set

mem_usage_absolute_average will never be more than 100%
mem_usage_absolute_average can be 100% if all available RAM is used for working set.
mem_usage_absolute_average will never be less than 0%

Or was it correct and now we'll never see 100%, even if all your resources are consumed?

if all usage is because of working_set and cached is zero, and working_set is all of the total memory we will get 100% ( and the kernel will kill the process responsible before we can measure it :-) )

using working set instead of usage should have no affect on the top and bottom limits of mem_usage_absolute_average value.

Does that make seance to you ?

Ladas · 2018-11-19T18:17:45Z

Can containers share cache (is this without shared/buffer)? If they would it could go over 100%, when summed.

Otherwise this mean we can't never reach 100% util, if there is a cache now?

yaacov · 2018-11-19T18:31:20Z

Can containers share cache (is this without shared/buffer)? If they would it could go over 100%, when summed.

AFAIK containers should not share any memory used, cache or working set. but it's a good question.

Otherwise this mean we can't never reach 100% util, if there is a cache now?

We get our numbers from cAdvisor used by many projects, most use usage some use working_set to calculate utilization ( e.g. used / total ) if the ones using usage for utilization got above 100% I think we should find some issues open about that in some project ...

agrare · 2018-11-19T18:39:35Z

Okay @yaacov so lets drop the config option. Since we support versions older than when this was added we'll need to either add a version check on the provider or do something like working_set || usage

cben · 2018-11-19T18:43:32Z

Okay yaacov <https://github.com/yaacov> so lets drop the config option. Since we support versions older than when this was added we'll need to either add a version check on the provider or do something like working_set || usage

How will users know which of the 2 they're seeing?

…

agrare · 2018-11-19T18:45:46Z

How will users know which of the 2 they're seeing?

We'll need to doc that on openshift 3.6 you get memory/usage and on 3.7+ you get memory/working_set.

agrare · 2018-11-19T19:29:28Z

@yaacov actually since we are dropping support for pre 3.7 on master and back to hammer you can just switch to working_set and we can deal with the version check when we backport to gaprindashvili

…s collected from Hawkular.

Ladas

Looks good 👍

cben · 2018-11-20T11:48:26Z

Code LGTM 👍

What about prometheus?

oourfali · 2018-11-20T11:49:20Z

@cben Prometheus isn't supported.
We can change that, but not mandatory.

agrare · 2018-11-20T12:04:06Z

@yaacov @simaishi we can backport this cleanly to hammer but will need to modify it for gaprindashvili to handle older versions. @yaacov can you start preparing that PR ?

yaacov · 2018-11-20T12:08:22Z

gaprindashvili to handle older versions. @yaacov can you start preparing that PR ?

yes :-)

yaacov · 2018-11-20T13:22:03Z

@yaacov can you start preparing that PR ?

@agrare @simaishi , submitted a PR on gaprindashvili,

That PR relay on falling back to the legacy code that still uses usage: #306

Use Hawkular memory tag working-set instead of usage (cherry picked from commit aeda847) https://bugzilla.redhat.com/show_bug.cgi?id=1650351

simaishi · 2018-11-20T14:36:46Z

Hammer backport details:

$ git log -1
commit 746ae7d41cb2f634bd09b548373eea965b07f7be
Author: Adam Grare <agrare@redhat.com>
Date:   Tue Nov 20 06:59:20 2018 -0500

    Merge pull request #305 from yaacov/optional-working-set-as-memory-tag
    
    Use Hawkular memory tag working-set instead of usage
    
    (cherry picked from commit aeda8479fe7d8caa9e17e325a14be4022d77b213)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1650351

simaishi · 2018-11-28T20:44:29Z

Backported to Gaprindashvili via #306

…emory-tag Use Hawkular memory tag working-set instead of usage (cherry picked from commit aeda847)

…-set [GAPRINDASHVILI] Revert #306 in favor of clean backport of #305

Use Hawkular memory tag working-set instead of usage (cherry picked from commit aeda847) Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1663520

simaishi · 2019-01-04T17:06:06Z

Gaprindashvili backport details:

$ git log -1
commit 06f9d3c70fe1252eb0adf9b213a15c6d41a0de72
Author: Adam Grare <agrare@redhat.com>
Date:   Tue Nov 20 06:59:20 2018 -0500

    Merge pull request #305 from yaacov/optional-working-set-as-memory-tag
    
    Use Hawkular memory tag working-set instead of usage
    
    (cherry picked from commit aeda8479fe7d8caa9e17e325a14be4022d77b213)
    
    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1663520

yaacov force-pushed the optional-working-set-as-memory-tag branch from 6d53366 to edd145c Compare November 18, 2018 15:03

yaacov force-pushed the optional-working-set-as-memory-tag branch 2 times, most recently from aeb74d4 to 3f63a90 Compare November 18, 2018 15:44

yaacov force-pushed the optional-working-set-as-memory-tag branch from 3f63a90 to 23ec957 Compare November 18, 2018 15:53

agrare self-assigned this Nov 19, 2018

Use "memory/working-set" instead of "memory/usage" as "memroy" metric…

ae07ba5

…s collected from Hawkular.

yaacov force-pushed the optional-working-set-as-memory-tag branch from 23ec957 to ae07ba5 Compare November 20, 2018 08:23

Ladas approved these changes Nov 20, 2018

View reviewed changes

agrare approved these changes Nov 20, 2018

View reviewed changes

agrare merged commit aeda847 into ManageIQ:master Nov 20, 2018

agrare added this to the Sprint 99 Ending Nov 19, 2018 milestone Nov 20, 2018

agrare added bug hammer/yes labels Nov 20, 2018

yaacov mentioned this pull request Nov 20, 2018

[GAPRINDASHVILI] Use working-set instead of usage for memory #306

Merged

simaishi pushed a commit that referenced this pull request Nov 20, 2018

Merge pull request #305 from yaacov/optional-working-set-as-memory-tag

746ae7d

Use Hawkular memory tag working-set instead of usage (cherry picked from commit aeda847) https://bugzilla.redhat.com/show_bug.cgi?id=1650351

simaishi added hammer/backported and removed hammer/yes labels Nov 20, 2018

agrare modified the milestones: Sprint 99 Ending Nov 19, 2018, Sprint 100 Ending Dec 3, 2018 Nov 20, 2018

simaishi added blocker gaprindashvili/backported and removed gaprindashvili/yes labels Nov 28, 2018

yaacov mentioned this pull request Dec 31, 2018

[GAPRINDASHVILI] use memory working_set in legacy collector #318

Closed

agrare mentioned this pull request Jan 3, 2019

[GAPRINDASHVILI] Revert #306 in favor of clean backport of #305 #320

Merged

agrare added a commit to agrare/manageiq-providers-kubernetes that referenced this pull request Jan 3, 2019

Merge pull request ManageIQ#305 from yaacov/optional-working-set-as-m…

3db7ba4

…emory-tag Use Hawkular memory tag working-set instead of usage (cherry picked from commit aeda847)

agrare added a commit to agrare/manageiq-providers-kubernetes that referenced this pull request Jan 3, 2019

Merge pull request ManageIQ#305 from yaacov/optional-working-set-as-m…

3dff1d6

…emory-tag Use Hawkular memory tag working-set instead of usage (cherry picked from commit aeda847)

simaishi added a commit that referenced this pull request Jan 4, 2019

Merge pull request #320 from agrare/revert_gaprindashvili-use-working…

4837cfe

…-set [GAPRINDASHVILI] Revert #306 in favor of clean backport of #305

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Hawkular memory tag working-set instead of usage #305

Use Hawkular memory tag working-set instead of usage #305

yaacov commented Nov 18, 2018 •

edited

yaacov commented Nov 18, 2018

cben commented Nov 18, 2018 •

edited

agrare commented Nov 18, 2018

yaacov commented Nov 18, 2018 •

edited

Ladas commented Nov 19, 2018 •

edited

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018 •

edited by agrare

yaacov commented Nov 19, 2018

Fryguy commented Nov 19, 2018

yaacov commented Nov 19, 2018

agrare commented Nov 19, 2018

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018

yaacov commented Nov 19, 2018

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018

yaacov commented Nov 19, 2018

agrare commented Nov 19, 2018

cben commented Nov 19, 2018 via email

agrare commented Nov 19, 2018

agrare commented Nov 19, 2018

Ladas left a comment

cben commented Nov 20, 2018

oourfali commented Nov 20, 2018

agrare commented Nov 20, 2018

yaacov commented Nov 20, 2018

yaacov commented Nov 20, 2018

simaishi commented Nov 20, 2018

simaishi commented Nov 28, 2018

simaishi commented Jan 4, 2019

Use Hawkular memory tag working-set instead of usage #305

Use Hawkular memory tag working-set instead of usage #305

Conversation

yaacov commented Nov 18, 2018 • edited

yaacov commented Nov 18, 2018

cben commented Nov 18, 2018 • edited

agrare commented Nov 18, 2018

yaacov commented Nov 18, 2018 • edited

Ladas commented Nov 19, 2018 • edited

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018 • edited by agrare

yaacov commented Nov 19, 2018

Fryguy commented Nov 19, 2018

yaacov commented Nov 19, 2018

agrare commented Nov 19, 2018

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018

yaacov commented Nov 19, 2018

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018

yaacov commented Nov 19, 2018

Ladas commented Nov 19, 2018

yaacov commented Nov 19, 2018

agrare commented Nov 19, 2018

cben commented Nov 19, 2018 via email

agrare commented Nov 19, 2018

agrare commented Nov 19, 2018

Ladas left a comment

Choose a reason for hiding this comment

cben commented Nov 20, 2018

oourfali commented Nov 20, 2018

agrare commented Nov 20, 2018

yaacov commented Nov 20, 2018

yaacov commented Nov 20, 2018

simaishi commented Nov 20, 2018

simaishi commented Nov 28, 2018

simaishi commented Jan 4, 2019

yaacov commented Nov 18, 2018 •

edited

cben commented Nov 18, 2018 •

edited

yaacov commented Nov 18, 2018 •

edited

Ladas commented Nov 19, 2018 •

edited

Ladas commented Nov 19, 2018 •

edited by agrare