Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP ] Very rough timeline / event streams optimizations for faster API requests with many tens of thousands of rows #22304

Closed
wants to merge 5 commits into from

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Jan 12, 2023

This was a temporary WIP PR to try a few things:

It included PRs that were either already opened and ones that were extracted:

Additionally, this PR sparked the idea to redo how the event groups were calculated and cached. So, an alternative PR was created:

jrafanie and others added 4 commits January 12, 2023 12:00
For an api request with 10k event_streams, the requests before and after:

initial(cold, no prior requests): 48 => 11 seconds

http://localhost:3000/api/event_streams?limit=100&offset=0&expand=resources&attributes=group,group_level,group_name,id,event_type,message,ems_id,type,timestamp,created_on,host.name,source,ems_id,ext_management_system.name&filter[]=type=EmsEvent&filter[]=group=other&filter[]=group_level=detail
We don't need to classify and constantize a param's model name over
and over again to see if it's valid?

For this api request with 10101010101010101010k event_streams, the requests before and after:

initial(cold, no prior requests): ~11 => ~9 seconds
subsequent(previously requested): ~8 => ~5 seconds

http://localhost:3000/api/event_streams?limit=100&offset=0&expand=resources&attributes=group,group_level,group_name,id,event_type,message,ems_id,type,timestamp,created_on,host.name,source,ems_id,ext_management_system.name&filter[]=type=EmsEvent&filter[]=group=other&filter[]=group_level=detail
For this api request with 10k event_streams, the requests before and after:

initial(cold, no prior requests): ~9 => ~7.5 seconds
subsequent(previously requested): ~5 => ~3.5 seconds

http://localhost:3000/api/event_streams?limit=100&offset=0&expand=resources&attributes=group,group_level,group_name,id,event_type,message,ems_id,type,timestamp,created_on,host.name,source,ems_id,ext_management_system.name&filter[]=type=EmsEvent&filter[]=group=other&filter[]=group_level=detail

9 seconds -> 6 seconds
Before:

Some filtering can only be done in ruby.
For these pages, we download all records and run the filtering locally.
We were filtering every record and then limiting the records for display
This was generating a lot of extra work since we only need to filter the
records that can end up on the page.

After:

In the case that we are returning counts, it is not possible to avoid
filtering every record, but in the typical case (i.e.: `skip_counts = true`),
we are able to only filter the records up until the end of the appropriate page.
So we introduced filtering with a lazy enumerator. In the count case, we use
to_a to undo the lazy enumeration - this avoids running the lazy enumeration
multiple times.

We've just added yet another type that the targets variable can be.
There is the potential that not every type will be in the test cases and
a future change doesn't take this into account. I tried to add come comments
to show where I felt the issues could arise.

We are down to one caller that does not use skip_counts.
This is the main view page in the ui so it is popular by call volume
Removing this would remove a number of touch points in this search code.
@Fryguy
Copy link
Member

Fryguy commented Jan 12, 2023

Cool. That groups_and_levels cache makes a ton of sense, but it needs a small change. What I think we should do is cache it as you did at the class level, but bust that cache if someone changes the settings. We already have hooks for this for things like blacklisted_events, so it shouldn't be a big change.

@Fryguy
Copy link
Member

Fryguy commented Jan 12, 2023

Separately, I'm looking into making a scope for the groups and levels, so that it can be filtered in SQL. We know ahead of time which events are in which category, so it should be a relatively simple thing. That would probably eliminate the problem entirely by avoiding all of the rows to begin with. That being said, the group_and_level cache should stay also.

@Fryguy
Copy link
Member

Fryguy commented Jan 12, 2023

@jrafanie I'm reviewing this locally and I think you get some more bang for the buck if you cache the event_groups as well (or perhaps instead). That is the thing that's unlikely to change and it's very expensive with the deep_merge. This brings a single call to group_and_level from 21ms to 0.5ms (and all subsequent calls will continue to be 0.5 ms regardless of event_type). With this change in this PR only, each unique event will still separately be 21ms, though calling with the same event will drop to 0.005ms.

@jrafanie
Copy link
Member Author

@jrafanie I'm reviewing this locally and I think you get some more bang for the buck if you cache the event_groups as well (or perhaps instead). That is the thing that's unlikely to change and it's very expensive with the deep_merge.

Thanks, I had both originally and started eliminating things no longer needed once one thing was "fixed" and just picked this because that's what the svg said. I had meant to go back to it because yes, that deep merge is quite expensive.

This reverts commit 13eb095.

Note, the commit: "When caching _to_ruby, also cache valid?" mitigates the need
for this commit
@jrafanie
Copy link
Member Author

I pushed a revert commit... we don't need the caching of the result of the classify+constantize if correctly cache the valid? when we cache the result of _to_ruby? (in other words, commit 3 mitigates the need for commit 2)

@miq-bot
Copy link
Member

miq-bot commented Jan 12, 2023

Checked commits jrafanie/manageiq@f521cc4~...123cbe4 with ruby 2.6.10, rubocop 1.28.2, haml-lint 0.35.0, and yamllint
5 files checked, 4 offenses detected

app/models/event_stream.rb

lib/miq_expression.rb

@Fryguy
Copy link
Member

Fryguy commented Jan 12, 2023

@jrafanie I opened #22305 to do the caching a different way. Even with this PR, the sample code there took forever, but after that PR, the group* lookups are negligible.

@kbrock
Copy link
Member

kbrock commented Jan 13, 2023

Have to say it again: the valid? optimization is my favorite here. Not the biggest win, but a great fix

@Fryguy Fryguy self-assigned this Jan 18, 2023
@kbrock
Copy link
Member

kbrock commented Jan 19, 2023

@jrafanie can you pull out the valid? and other low hanging fruit so we can get a few of these merged?

@jrafanie
Copy link
Member Author

@jrafanie can you pull out the valid? and other low hanging fruit so we can get a few of these merged?

yeah, I have to revisit this and apply @Fryguy's change and only the ones we need from this branch. I do think the valid? optimization is good.

@Fryguy
Copy link
Member

Fryguy commented Jan 19, 2023

My change is still WIP due to the need to broadcast a cache bust. Feel free to add a commit to that PR if you have the cycles.

@jrafanie
Copy link
Member Author

@jrafanie can you pull out the valid? and other low hanging fruit so we can get a few of these merged?

@kbrock opened #22318 for the valid? optimization... note my comment about what I changed from what we wrote.

@miq-bot
Copy link
Member

miq-bot commented Jan 27, 2023

This pull request is not mergeable. Please rebase and repush.

@jrafanie
Copy link
Member Author

I think everything from this PR was extracted or replacements found so I'll close it for now

@jrafanie jrafanie closed this Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants