Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up keyword_search by storing pre-processed data #3604

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

PaulWay
Copy link
Contributor

@PaulWay PaulWay commented Nov 22, 2022

Signed-off-by: Paul Wayper paulway@redhat.com

All Pull Requests:

Check all that apply:

  • Have you followed the guidelines in our Contributing document, including the instructions about commit messages?
  • Is this PR to correct an issue?
  • Is this PR an enhancement?

Complete Description of Additions/Changes:

Call analysis by @dkuc has shown the key_match() function is called many times, often with the same values. This work attempts to speed up the function by:

  • transforming the row keys explicitly in a dict, rather than implicitly by reconstructing the entire row data with transformed keys
  • caching the transformed row keys
  • pre-preparing the keyword arguments into a list of search terms
  • evaluating equality of row data quickly (the common case).
  • returning no rows early if the sought key does not exist in the row data (after transform)

These speed up the 'keyword_search' from 1.693 seconds of total time to 0.301 seconds in processing a small archive.

@PaulWay PaulWay self-assigned this Nov 22, 2022
@psachin
Copy link
Contributor

psachin commented Nov 22, 2022

@PaulWay Is it possible to add tests related to this?

@PaulWay
Copy link
Contributor Author

PaulWay commented Nov 22, 2022

@psachin - not really sure; the lru_cache decorator does provide a 'cache' method to check how the cache is being used and to reset it if necessary. But the real test is: does everything work the same way it always has - only faster? I'm not sure there's any point in testing the 'faster' assertion 😄 but all the existing tests make sure it works the same way.

@xiangce
Copy link
Contributor

xiangce commented Nov 23, 2022

It seems the CI environments got broken, hence the pipelines failed. However, I do get the following test errors in my local:

========================================================================= test session starts =========================================================================
platform linux -- Python 3.9.14, pytest-7.1.3, pluggy-1.0.0
...
test_lspci_vmmkn ___________________________________________________________________________

    def test_lspci_vmmkn():
        lspci_vmmkn = LsPciVmmkn(context_wrap(LSPCI_VMMKN))
        lspci = LsPci(None, lspci_vmmkn)
        assert sorted(lspci.pci_dev_list) == ['00:00.0', '00:02.0', '00:03.0', '00:16.0', '00:19.0', '00:1b.0']
        assert lspci.search(Device='155a', Vendor='8086') == [
            {
                'Slot': '00:19.0', 'Class': '0200', 'Vendor': '8086',
                'Device': '155a', 'SVendor': '17aa', 'SDevice': '2214',                                                                                                                'Rev': '04', 'Driver': 'e1000e', 'Module': ['e1000e'],
            }                                                                                                                                                                  ]
>       assert lspci.search(Dev_Details__contains='I218') == []
E       AssertionError: assert [{'Class': '0...': '09', ...}] == []
E         Left contains one more item: {'Class': '0600', 'Device': '0a04', 'Driver': 'hsw_uncore', 'Rev': '09', ...}
E         Use -v to get more diff

@PaulWay - could you please have a look and fix the flake8 errors as well?

And this change will break the rules that depend on these search interfaces.

@PaulWay
Copy link
Contributor Author

PaulWay commented Nov 23, 2022

Hi @xiangce - yep, I see similar errors. I'm working on this now. Dicts being unhashable and seeming to change their id()s is not helping me here 🥲

@PaulWay
Copy link
Contributor Author

PaulWay commented Nov 24, 2022

Are we still having problems with the CI environment?

@xiangce
Copy link
Contributor

xiangce commented Nov 25, 2022

Are we still having problems with the CI environment?

Yeap, this may relate to this issue: actions/setup-python#162, @chenlizhong mentioned it in #3606 , And I asked him to raise a new PR to fix it. Please wait for a while

@PaulWay
Copy link
Contributor Author

PaulWay commented Dec 6, 2022

OK, finally we get the tests mostly running successfully?

insights/parsers/netstat.py Outdated Show resolved Hide resolved
Copy link
Contributor

@xiangce xiangce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. But since this is a kind of basic level change, I'd like to wait for feedback from @bfahr or @ryan-blakley.

@bfahr
Copy link
Contributor

bfahr commented Dec 13, 2022

@xiangce I'm going to run the times and errors report using this change so that we can try to measure any performance improvement for this change.

@xiangce
Copy link
Contributor

xiangce commented Jan 13, 2023

@PaulWay - @bfahr is now testing this PR with the insights data and comparing it to the original code to check for performance improvements. But recently, the DataHub met some issues due to maybe some outage during the holiday, so the test is not completed yet. Sorry about replying late.

@PaulWay PaulWay changed the title Cache key_match - called many many times, often on the same data Speed up keyword_search by storing pre-processed data Feb 6, 2023
After a lot of testing we've discovered that an internal cache dictionary
doesn't work.  Sometimes the 'rows' parameter is a list, which we can't add
attributes to; sometimes it's a list-like object with dictionaries which we
can't add attributes to; sometimes it's a list-like object with dict-like
objects.  We have to return the original row, rather than any transformed
row.

Testing has shown it's better to incur the up-front cost of transforming the
rows in the outer loop.  If we can attach that list of transformed rows to
an object - the `rows` object, or a parent object if `rows` is a list - then
that speeds up all future searches.  We add the optional `parent` parameter to
the `keyword_search()` function, so that parsers can supply that to attach
the transformed rows to.

This introduces two further speed improvements.  The first is that instead of
constructing a new list of dictionaries for the transformed rows, we can
simply construct a dictionary mapping the search term keyword (e.g.
`command_name`) to the row data key (e.g. `command name`).  This is much
smaller, involves less of a full table scan, and is easy to substitute into
our search process.

The second is that we can actually pre-prepare the search terms into a list
of tuples containing the row data key, the matcher function name and function,
and the value sought.  Instead of doing this for every row in the data, we
do it ahead of time.

A corollary of both of these is that we can check if the search term does not
occur anywhere in the row data keys, before we even do a search of the data.
While this would be rare, it could occur with some data that has optional
fields (e.g. CPU info for different architectures).

Signed-off-by: Paul Wayper <paulway@redhat.com>
@PaulWay
Copy link
Contributor Author

PaulWay commented Jul 4, 2023

Just squashed everything down into a single commit for neatness (there were several rewrites of the code that confused the rebasing).

@PaulWay
Copy link
Contributor Author

PaulWay commented Jul 5, 2023

@RedHatInsights/qe - what's the problem with this test? I can't get any information on it.

@candlepin-bot
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants