[WIP] Batch collect gap metrics collection #22221

kbrock · 2022-11-09T03:21:16Z

Overview

When we detect that metrics capture has not been run for a while, or never run for that matter, we submit gap/historical metrics requests.

followup to #20071

Problem

Since the number of queue entries are split out by day, we were possibly putting hundreds of requests onto the queue for only a dozen VMs.

Solution

When a gap is detected for inventory items, send the gap collection request in batches. This is minimal since the work to handle batches of inventory items has already been implemented.

For a coordinator with a troublesome C&U installation, gaps are by far the biggest issue we've had in the past in terms of over extending our queue.

Now, it is putting a dozen message for hundreds of vms.
The backend has already been changed to handle batching collection. This changes the coordinator to send the collection requests in batches

This PR is split up into many improvements. They build a little but are relatively independent:

testing:

refactor perf_process to make profiling easier
tests ignore batching and date ranges to make sporadic test failures less likely
drop unused metrics test helpers
use ids instead of objects for metrics to make test failures easier to read

performance:

enhancements:

use same last_perf_capture_on for the batch to make batching easier in the future
sort batches so ids are sent more consistently
send historical gaps as batches of ids (but still batch by date)
this has big performance improvement b/c 250 times fewer queue entries. (# vms * # days / 250)

kbrock · 2022-11-09T16:34:45Z

update:

fix cops
fix failing test (missed a rename)

kbrock · 2022-12-09T18:08:16Z

@Fryguy The commits look good, I just left as WIP because I wanted to discuss which of these commits we want to pull out into separate PRs.

Let me know if you want to discuss them (or if some of the commits need better documentation)

miq-bot · 2023-04-10T00:01:55Z

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s)

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

miq-bot · 2023-04-10T00:01:57Z

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s)

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

miq-bot · 2023-06-06T18:00:46Z

This pull request is not mergeable. Please rebase and repush.

miq-bot · 2023-06-12T00:02:23Z

This pull request has been automatically closed because it has not been updated for at least 3 months.

Feel free to reopen this pull request if these changes are still valid.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

only method movement and indentation changes cutting up perf_process to make it easier to profile. - consolidate multiple process_counter_vaues into a single call - extract perf_capture_states for multiple perf_capture calls - extracted update_last_perf_capture_on - extracted raise_perf_complete_alerts - extracted rollupo_to_parent_metrics

We run rollups for a single ems, and a single zone. But we were looking up an ems and my_zone for every resource rollup optimization: pass the zone to miq, since it is known and we already have the ems / zone in memory. We have had to do this optimization for a number of other miq_queue calls, so it makes sense. This code is using the parent object for the ems. Any one would do. If this method gets converted to a static method, then use any one of the resources' ems. numbers ------- Running capture for only 16 Vms. They generate: ManageIQ::Providers::Vmware::InfraManager::HostEsx#perf_capture_realtime => 6 ManageIQ::Providers::Vmware::InfraManager::Vm#perf_capture_realtime => 16 So that will save 22 ems and 22 zone lookups (total number of vms + hosts) Minor savings. But removes queries But it clears up the number of queries and makes it easier to | ms | bytes | objects |query | qry ms | rows |`comments` | ---:| ---:| ---:| ---:| ---:| ---:| --- | 19,312.2 | 138,947,654* | 32,019,635 |1,151 | 6,993.9 | 262 |`capture-vc91-250-before` | 19,192.5 | 108,353,470* | 31,971,845 |1,117 | 6,869.5 | 228 |`capture-vc91-250-after` | | | | 3% | | 13% | delta

Since this is on an instance method, just use the common ems in the parent. All of the targets has the same ems

- Using a common last_perf_capture_on so they can be grouped together. - This also sets the value from before contacting the server, so we don't introduce micro gaps when collecting right around the hour.

sort targets for metrics collection so consistent records are batched together.

Entries are sent in batches of up to 250 records. Currently, they are still batched by day

miq-bot · 2023-09-18T00:00:09Z

This pull request has been automatically closed because it has not been updated for at least 3 months.

Feel free to reopen this pull request if these changes are still valid.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

kbrock · 2023-12-15T06:05:04Z

we still have more work to do here

miq-bot · 2023-12-15T06:10:49Z

Checked commits kbrock/manageiq@e70ca4e~...fa6f72b with ruby 2.7.8, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
5 files checked, 0 offenses detected
Everything looks fine. 🍰

miq-bot · 2024-03-18T00:00:27Z

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

kbrock added enhancement providers/metrics labels Nov 9, 2022

kbrock requested review from agrare, Fryguy and jrafanie as code owners November 9, 2022 03:21

kbrock force-pushed the perf_capture_historical branch from 0f0a937 to 6ce5073 Compare November 9, 2022 15:57

kbrock force-pushed the perf_capture_historical branch from 6ce5073 to cecb6d6 Compare November 17, 2022 22:49

kbrock force-pushed the perf_capture_historical branch from cecb6d6 to 54d55b0 Compare November 30, 2022 14:42

miq-bot added the unmergeable label Dec 5, 2022

Fryguy changed the title ~~Batch collect gap metrics collection~~ [WIP] Batch collect gap metrics collection Dec 7, 2022

Fryguy self-assigned this Dec 7, 2022

kbrock force-pushed the perf_capture_historical branch from 54d55b0 to 6fd9bc8 Compare December 7, 2022 18:52

miq-bot added the wip label Dec 7, 2022

kbrock removed the unmergeable label Dec 8, 2022

kbrock mentioned this pull request Jan 3, 2023

Perf capture minor changes for testing #22287

Merged

miq-bot added the stale label Apr 10, 2023

kbrock force-pushed the perf_capture_historical branch from 6fd9bc8 to 55e5a9a Compare May 15, 2023 13:53

kbrock force-pushed the perf_capture_historical branch from 55e5a9a to 2d50b8e Compare May 30, 2023 17:56

miq-bot added the unmergeable label Jun 6, 2023

miq-bot closed this Jun 12, 2023

kbrock reopened this Jun 12, 2023

kbrock added 3 commits June 12, 2023 12:46

make more of ci_mixin/processing#perf_process private

92797b6

kbrock added 4 commits June 12, 2023 12:46

metrics: use ems instance from target

ee144b0

Since this is on an instance method, just use the common ems in the parent. All of the targets has the same ems

metrics: common last_perf_capture_on and leverage known ems

eddf03a

- Using a common last_perf_capture_on so they can be grouped together. - This also sets the value from before contacting the server, so we don't introduce micro gaps when collecting right around the hour.

metrics collection: consistent batches

9d4b808

sort targets for metrics collection so consistent records are batched together.

metrics: group gap (historical) collection into batches

fa6f72b

Entries are sent in batches of up to 250 records. Currently, they are still batched by day

kbrock force-pushed the perf_capture_historical branch from 2d50b8e to fa6f72b Compare June 12, 2023 16:50

kbrock removed the unmergeable label Jun 13, 2023

miq-bot closed this Sep 18, 2023

kbrock reopened this Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Batch collect gap metrics collection #22221

[WIP] Batch collect gap metrics collection #22221

kbrock commented Nov 9, 2022 •

edited

kbrock commented Nov 9, 2022

kbrock commented Dec 9, 2022

miq-bot commented Apr 10, 2023

miq-bot commented Apr 10, 2023

miq-bot commented Jun 6, 2023

miq-bot commented Jun 12, 2023

miq-bot commented Sep 18, 2023

kbrock commented Dec 15, 2023

miq-bot commented Dec 15, 2023

miq-bot commented Mar 18, 2024

[WIP] Batch collect gap metrics collection #22221

Are you sure you want to change the base?

[WIP] Batch collect gap metrics collection #22221

Conversation

kbrock commented Nov 9, 2022 • edited

Overview

Problem

Solution

kbrock commented Nov 9, 2022

kbrock commented Dec 9, 2022

miq-bot commented Apr 10, 2023

miq-bot commented Apr 10, 2023

miq-bot commented Jun 6, 2023

miq-bot commented Jun 12, 2023

miq-bot commented Sep 18, 2023

kbrock commented Dec 15, 2023

miq-bot commented Dec 15, 2023

miq-bot commented Mar 18, 2024

kbrock commented Nov 9, 2022 •

edited