Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement 1: queryplans #191

Merged
merged 9 commits into from Dec 14, 2022
Merged

Conversation

msm-code
Copy link
Contributor

@msm-code msm-code commented Dec 13, 2022

While "Query Plans" sound like an advanced feature, this is actually an regression from QueryGraphs. Even more - after this PR query graphs will be unused and we could remove them in principle.

This PR reduces the number of reads between 10% and 300%. As seen on this screenshot (ignore time in ms, it's related to disk load and cache contents at the benchmark moment - read count is what matters):

image

Since read is the most time consuming part of ursadb work, especially on HDD, this may speed up the database significantly.

In this PR I rewrite the matching algorithm to use a minimal covering set of necessary ngrams. For example, to match:

abcde?ghi

(where ? in an unknown/missing character), we will check the following ngrams (assuming all indexes are available):

abcd
 bcde
      ghi

While in the past we would check:

abcd
 bcde
abc
 bcd
  cde
abcd (hash4)
 bcde (hash4)
      ghi

and AND the results later. most of these are unnecessary - for example there's no point checking 3grams abc and bcd since they're implied by 4gram abcd. The same goes for pseudo-4gram abcd provided by the hash4 index - it's implied by text4 one.

The downside is that it's hard to combine this with querygraphs. As mentioned in the PR introducing querygraphs, they're a nice data structure that enables a lot of optimisation - but I never did follow up, so they're quite slow in practice. The only thing we use them for right now are nocase strings. In this case this PR introduces a regression (nocase strings are ignored again). I think this is worth it for a much better performance in a general case.

Full benchmark results here: https://raw.githubusercontent.com/msm-code/ursa-bench/master/results/hdd_all.html (download and open with a browser of your choice).

@msm-code msm-code merged commit bbb2986 into master Dec 14, 2022
@msm-code msm-code deleted the fix/performance1-queryplan branch December 14, 2022 17:54
@msm-code msm-code added this to the v1.6.0 milestone Dec 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants