Speed up `keyword_search` by storing pre-processed data #3604

PaulWay · 2022-11-22T00:44:30Z

Signed-off-by: Paul Wayper paulway@redhat.com

All Pull Requests:

Check all that apply:

Have you followed the guidelines in our Contributing document, including the instructions about commit messages?
Is this PR to correct an issue?
Is this PR an enhancement?

Complete Description of Additions/Changes:

Call analysis by @dkuc has shown the key_match() function is called many times, often with the same values. This work attempts to speed up the function by:

transforming the row keys explicitly in a dict, rather than implicitly by reconstructing the entire row data with transformed keys
caching the transformed row keys
pre-preparing the keyword arguments into a list of search terms
evaluating equality of row data quickly (the common case).
returning no rows early if the sought key does not exist in the row data (after transform)

These speed up the 'keyword_search' from 1.693 seconds of total time to 0.301 seconds in processing a small archive.

psachin · 2022-11-22T03:46:05Z

@PaulWay Is it possible to add tests related to this?

PaulWay · 2022-11-22T19:59:58Z

@psachin - not really sure; the lru_cache decorator does provide a 'cache' method to check how the cache is being used and to reset it if necessary. But the real test is: does everything work the same way it always has - only faster? I'm not sure there's any point in testing the 'faster' assertion 😄 but all the existing tests make sure it works the same way.

xiangce · 2022-11-23T08:12:38Z

It seems the CI environments got broken, hence the pipelines failed. However, I do get the following test errors in my local:

========================================================================= test session starts =========================================================================
platform linux -- Python 3.9.14, pytest-7.1.3, pluggy-1.0.0
...
test_lspci_vmmkn ___________________________________________________________________________

    def test_lspci_vmmkn():
        lspci_vmmkn = LsPciVmmkn(context_wrap(LSPCI_VMMKN))
        lspci = LsPci(None, lspci_vmmkn)
        assert sorted(lspci.pci_dev_list) == ['00:00.0', '00:02.0', '00:03.0', '00:16.0', '00:19.0', '00:1b.0']
        assert lspci.search(Device='155a', Vendor='8086') == [
            {
                'Slot': '00:19.0', 'Class': '0200', 'Vendor': '8086',
                'Device': '155a', 'SVendor': '17aa', 'SDevice': '2214',                                                                                                                'Rev': '04', 'Driver': 'e1000e', 'Module': ['e1000e'],
            }                                                                                                                                                                  ]
>       assert lspci.search(Dev_Details__contains='I218') == []
E       AssertionError: assert [{'Class': '0...': '09', ...}] == []
E         Left contains one more item: {'Class': '0600', 'Device': '0a04', 'Driver': 'hsw_uncore', 'Rev': '09', ...}
E         Use -v to get more diff

@PaulWay - could you please have a look and fix the flake8 errors as well?

And this change will break the rules that depend on these search interfaces.

PaulWay · 2022-11-23T22:19:16Z

Hi @xiangce - yep, I see similar errors. I'm working on this now. Dicts being unhashable and seeming to change their id()s is not helping me here 🥲

PaulWay · 2022-11-24T02:26:18Z

Are we still having problems with the CI environment?

xiangce · 2022-11-25T05:32:23Z

Are we still having problems with the CI environment?

Yeap, this may relate to this issue: actions/setup-python#162, @chenlizhong mentioned it in #3606 , And I asked him to raise a new PR to fix it. Please wait for a while

PaulWay · 2022-12-06T21:59:11Z

OK, finally we get the tests mostly running successfully?

insights/parsers/netstat.py

xiangce

It looks good to me. But since this is a kind of basic level change, I'd like to wait for feedback from @bfahr or @ryan-blakley.

bfahr · 2022-12-13T19:34:07Z

@xiangce I'm going to run the times and errors report using this change so that we can try to measure any performance improvement for this change.

xiangce · 2023-01-13T01:37:41Z

@PaulWay - @bfahr is now testing this PR with the insights data and comparing it to the original code to check for performance improvements. But recently, the DataHub met some issues due to maybe some outage during the holiday, so the test is not completed yet. Sorry about replying late.

After a lot of testing we've discovered that an internal cache dictionary doesn't work. Sometimes the 'rows' parameter is a list, which we can't add attributes to; sometimes it's a list-like object with dictionaries which we can't add attributes to; sometimes it's a list-like object with dict-like objects. We have to return the original row, rather than any transformed row. Testing has shown it's better to incur the up-front cost of transforming the rows in the outer loop. If we can attach that list of transformed rows to an object - the `rows` object, or a parent object if `rows` is a list - then that speeds up all future searches. We add the optional `parent` parameter to the `keyword_search()` function, so that parsers can supply that to attach the transformed rows to. This introduces two further speed improvements. The first is that instead of constructing a new list of dictionaries for the transformed rows, we can simply construct a dictionary mapping the search term keyword (e.g. `command_name`) to the row data key (e.g. `command name`). This is much smaller, involves less of a full table scan, and is easy to substitute into our search process. The second is that we can actually pre-prepare the search terms into a list of tuples containing the row data key, the matcher function name and function, and the value sought. Instead of doing this for every row in the data, we do it ahead of time. A corollary of both of these is that we can check if the search term does not occur anywhere in the row data keys, before we even do a search of the data. While this would be rare, it could occur with some data that has optional fields (e.g. CPU info for different architectures). Signed-off-by: Paul Wayper <paulway@redhat.com>

PaulWay · 2023-07-04T02:33:27Z

Just squashed everything down into a single commit for neatness (there were several rewrites of the code that confused the rebasing).

PaulWay · 2023-07-05T23:19:55Z

@RedHatInsights/qe - what's the problem with this test? I can't get any information on it.

candlepin-bot · 2024-04-26T15:29:38Z

Can one of the admins verify this patch?

PaulWay self-assigned this Nov 22, 2022

PaulWay force-pushed the paulway_cache_key_match branch from b865501 to 9863bc4 Compare November 22, 2022 00:52

xiangce reviewed Dec 7, 2022

View reviewed changes

insights/parsers/netstat.py Outdated Show resolved Hide resolved

xiangce requested review from bfahr and ryan-blakley December 7, 2022 08:14

xiangce approved these changes Dec 8, 2022

View reviewed changes

xiangce mentioned this pull request Jan 13, 2023

Rework toposort algorithm as an ancestor traversal #3646

Open

PaulWay changed the title ~~Cache key_match - called many many times, often on the same data~~ Speed up keyword_search by storing pre-processed data Feb 6, 2023

PaulWay force-pushed the paulway_cache_key_match branch from ea0529a to 7f09b47 Compare July 4, 2023 02:07

PaulWay force-pushed the paulway_cache_key_match branch from 7f09b47 to df4639b Compare July 4, 2023 02:18

Merge branch 'master' into paulway_cache_key_match

846c622

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `keyword_search` by storing pre-processed data #3604

Speed up `keyword_search` by storing pre-processed data #3604

PaulWay commented Nov 22, 2022 •

edited

Loading

psachin commented Nov 22, 2022

PaulWay commented Nov 22, 2022

xiangce commented Nov 23, 2022 •

edited

Loading

PaulWay commented Nov 23, 2022

PaulWay commented Nov 24, 2022

xiangce commented Nov 25, 2022

PaulWay commented Dec 6, 2022

xiangce left a comment •

edited

Loading

bfahr commented Dec 13, 2022

xiangce commented Jan 13, 2023

PaulWay commented Jul 4, 2023

PaulWay commented Jul 5, 2023

candlepin-bot commented Apr 26, 2024

Speed up keyword_search by storing pre-processed data #3604

Are you sure you want to change the base?

Speed up keyword_search by storing pre-processed data #3604

Conversation

PaulWay commented Nov 22, 2022 • edited Loading

All Pull Requests:

Complete Description of Additions/Changes:

psachin commented Nov 22, 2022

PaulWay commented Nov 22, 2022

xiangce commented Nov 23, 2022 • edited Loading

PaulWay commented Nov 23, 2022

PaulWay commented Nov 24, 2022

xiangce commented Nov 25, 2022

PaulWay commented Dec 6, 2022

xiangce left a comment • edited Loading

Choose a reason for hiding this comment

bfahr commented Dec 13, 2022

xiangce commented Jan 13, 2023

PaulWay commented Jul 4, 2023

PaulWay commented Jul 5, 2023

candlepin-bot commented Apr 26, 2024

Speed up `keyword_search` by storing pre-processed data #3604

Speed up `keyword_search` by storing pre-processed data #3604

PaulWay commented Nov 22, 2022 •

edited

Loading

xiangce commented Nov 23, 2022 •

edited

Loading

xiangce left a comment •

edited

Loading