Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Batch collect gap metrics collection #22221

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kbrock
Copy link
Member

@kbrock kbrock commented Nov 9, 2022

Overview

When we detect that metrics capture has not been run for a while, or never run for that matter, we submit gap/historical metrics requests.

followup to #20071

Problem

Since the number of queue entries are split out by day, we were possibly putting hundreds of requests onto the queue for only a dozen VMs.

Solution

When a gap is detected for inventory items, send the gap collection request in batches. This is minimal since the work to handle batches of inventory items has already been implemented.

For a coordinator with a troublesome C&U installation, gaps are by far the biggest issue we've had in the past in terms of over extending our queue.

Now, it is putting a dozen message for hundreds of vms.
The backend has already been changed to handle batching collection. This changes the coordinator to send the collection requests in batches

This PR is split up into many improvements. They build a little but are relatively independent:

testing:

  • refactor perf_process to make profiling easier
  • tests ignore batching and date ranges to make sporadic test failures less likely
  • drop unused metrics test helpers
  • use ids instead of objects for metrics to make test failures easier to read

performance:

enhancements:

  • use same last_perf_capture_on for the batch to make batching easier in the future
  • sort batches so ids are sent more consistently
  • send historical gaps as batches of ids (but still batch by date)
    this has big performance improvement b/c 250 times fewer queue entries. (# vms * # days / 250)

@kbrock
Copy link
Member Author

kbrock commented Nov 9, 2022

update:

  • fix cops
  • fix failing test (missed a rename)

@Fryguy Fryguy changed the title Batch collect gap metrics collection [WIP] Batch collect gap metrics collection Dec 7, 2022
@Fryguy Fryguy self-assigned this Dec 7, 2022
@miq-bot miq-bot added the wip label Dec 7, 2022
@kbrock kbrock removed the unmergeable label Dec 8, 2022
@kbrock
Copy link
Member Author

kbrock commented Dec 9, 2022

@Fryguy The commits look good, I just left as WIP because I wanted to discuss which of these commits we want to pull out into separate PRs.

Let me know if you want to discuss them (or if some of the commits need better documentation)

@miq-bot
Copy link
Member

miq-bot commented Apr 10, 2023

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s)

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

1 similar comment
@miq-bot
Copy link
Member

miq-bot commented Apr 10, 2023

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s)

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@miq-bot
Copy link
Member

miq-bot commented Jun 6, 2023

This pull request is not mergeable. Please rebase and repush.

@miq-bot
Copy link
Member

miq-bot commented Jun 12, 2023

This pull request has been automatically closed because it has not been updated for at least 3 months.

Feel free to reopen this pull request if these changes are still valid.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@kbrock kbrock reopened this Jun 12, 2023
only method movement and indentation changes

cutting up perf_process to make it easier to profile.
 - consolidate multiple process_counter_vaues into a single call
 - extract perf_capture_states for multiple perf_capture calls
 - extracted update_last_perf_capture_on
 - extracted raise_perf_complete_alerts
 - extracted rollupo_to_parent_metrics
We run rollups for a single ems, and a single zone.
But we were looking up an ems and my_zone for every resource rollup

optimization: pass the zone to miq, since it is known and we already
have the ems / zone in memory. We have had to do this optimization for a number
of other miq_queue calls, so it makes sense.

This code is using the parent object for the ems. Any one would do.
If this method gets converted to a static method,
then use any one of the resources' ems.

numbers
-------

Running capture for only 16 Vms. They generate:

ManageIQ::Providers::Vmware::InfraManager::HostEsx#perf_capture_realtime => 6
ManageIQ::Providers::Vmware::InfraManager::Vm#perf_capture_realtime => 16

So that will save 22 ems and 22 zone lookups (total number of vms + hosts)
Minor savings. But removes queries
But it clears up the number of queries and makes it easier to

|        ms |        bytes |    objects |query |   qry ms |     rows |`comments`
|       ---:|          ---:|        ---:|  ---:|      ---:|      ---:|  ---
|  19,312.2 | 138,947,654* | 32,019,635 |1,151 |  6,993.9 |      262 |`capture-vc91-250-before`
|  19,192.5 | 108,353,470* | 31,971,845 |1,117 |  6,869.5 |      228 |`capture-vc91-250-after`
|           |              |            |   3% |          |      13% | delta
Since this is on an instance method, just use the common ems in the parent.
All of the targets has the same ems
- Using a common last_perf_capture_on so they can be grouped together.
- This also sets the value from before contacting the server, so we don't
  introduce micro gaps when collecting right around the hour.
sort targets for metrics collection so consistent records are batched together.
Entries are sent in batches of up to 250 records.
Currently, they are still batched by day
@miq-bot
Copy link
Member

miq-bot commented Sep 18, 2023

This pull request has been automatically closed because it has not been updated for at least 3 months.

Feel free to reopen this pull request if these changes are still valid.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@kbrock
Copy link
Member Author

kbrock commented Dec 15, 2023

we still have more work to do here

@kbrock kbrock reopened this Dec 15, 2023
@miq-bot
Copy link
Member

miq-bot commented Dec 15, 2023

Checked commits kbrock/manageiq@e70ca4e~...fa6f72b with ruby 2.7.8, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
5 files checked, 0 offenses detected
Everything looks fine. 🍰

@miq-bot
Copy link
Member

miq-bot commented Mar 18, 2024

This pull request has been automatically marked as stale because it has not been updated for at least 3 months.

If these changes are still valid, please remove the stale label, make any changes requested by reviewers (if any), and ensure that this issue is being looked at by the assigned/reviewer(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants