-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up keyword_search
by storing pre-processed data
#3604
base: master
Are you sure you want to change the base?
Conversation
b865501
to
9863bc4
Compare
@PaulWay Is it possible to add tests related to this? |
@psachin - not really sure; the lru_cache decorator does provide a 'cache' method to check how the cache is being used and to reset it if necessary. But the real test is: does everything work the same way it always has - only faster? I'm not sure there's any point in testing the 'faster' assertion 😄 but all the existing tests make sure it works the same way. |
It seems the CI environments got broken, hence the pipelines failed. However, I do get the following test errors in my local:
@PaulWay - could you please have a look and fix the flake8 errors as well? And this change will break the rules that depend on these |
Hi @xiangce - yep, I see similar errors. I'm working on this now. Dicts being unhashable and seeming to change their |
Are we still having problems with the CI environment? |
Yeap, this may relate to this issue: actions/setup-python#162, @chenlizhong mentioned it in #3606 , And I asked him to raise a new PR to fix it. Please wait for a while |
OK, finally we get the tests mostly running successfully? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. But since this is a kind of basic level change, I'd like to wait for feedback from @bfahr or @ryan-blakley.
@xiangce I'm going to run the times and errors report using this change so that we can try to measure any performance improvement for this change. |
keyword_search
by storing pre-processed data
ea0529a
to
7f09b47
Compare
After a lot of testing we've discovered that an internal cache dictionary doesn't work. Sometimes the 'rows' parameter is a list, which we can't add attributes to; sometimes it's a list-like object with dictionaries which we can't add attributes to; sometimes it's a list-like object with dict-like objects. We have to return the original row, rather than any transformed row. Testing has shown it's better to incur the up-front cost of transforming the rows in the outer loop. If we can attach that list of transformed rows to an object - the `rows` object, or a parent object if `rows` is a list - then that speeds up all future searches. We add the optional `parent` parameter to the `keyword_search()` function, so that parsers can supply that to attach the transformed rows to. This introduces two further speed improvements. The first is that instead of constructing a new list of dictionaries for the transformed rows, we can simply construct a dictionary mapping the search term keyword (e.g. `command_name`) to the row data key (e.g. `command name`). This is much smaller, involves less of a full table scan, and is easy to substitute into our search process. The second is that we can actually pre-prepare the search terms into a list of tuples containing the row data key, the matcher function name and function, and the value sought. Instead of doing this for every row in the data, we do it ahead of time. A corollary of both of these is that we can check if the search term does not occur anywhere in the row data keys, before we even do a search of the data. While this would be rare, it could occur with some data that has optional fields (e.g. CPU info for different architectures). Signed-off-by: Paul Wayper <paulway@redhat.com>
7f09b47
to
df4639b
Compare
Just squashed everything down into a single commit for neatness (there were several rewrites of the code that confused the rebasing). |
@RedHatInsights/qe - what's the problem with this test? I can't get any information on it. |
Can one of the admins verify this patch? |
Signed-off-by: Paul Wayper paulway@redhat.com
All Pull Requests:
Check all that apply:
Complete Description of Additions/Changes:
Call analysis by @dkuc has shown the
key_match()
function is called many times, often with the same values. This work attempts to speed up the function by:These speed up the 'keyword_search' from 1.693 seconds of total time to 0.301 seconds in processing a small archive.