Performance improvement 1: queryplans #191

msm-code · 2022-12-13T21:28:26Z

While "Query Plans" sound like an advanced feature, this is actually an regression from QueryGraphs. Even more - after this PR query graphs will be unused and we could remove them in principle.

This PR reduces the number of reads between 10% and 300%. As seen on this screenshot (ignore time in ms, it's related to disk load and cache contents at the benchmark moment - read count is what matters):

Since read is the most time consuming part of ursadb work, especially on HDD, this may speed up the database significantly.

In this PR I rewrite the matching algorithm to use a minimal covering set of necessary ngrams. For example, to match:

abcde?ghi

(where ? in an unknown/missing character), we will check the following ngrams (assuming all indexes are available):

abcd
 bcde
      ghi

While in the past we would check:

abcd
 bcde
abc
 bcd
  cde
abcd (hash4)
 bcde (hash4)
      ghi

and AND the results later. most of these are unnecessary - for example there's no point checking 3grams abc and bcd since they're implied by 4gram abcd. The same goes for pseudo-4gram abcd provided by the hash4 index - it's implied by text4 one.

The downside is that it's hard to combine this with querygraphs. As mentioned in the PR introducing querygraphs, they're a nice data structure that enables a lot of optimisation - but I never did follow up, so they're quite slow in practice. The only thing we use them for right now are nocase strings. In this case this PR introduces a regression (nocase strings are ignored again). I think this is worth it for a much better performance in a general case.

Full benchmark results here: https://raw.githubusercontent.com/msm-code/ursa-bench/master/results/hdd_all.html (download and open with a browser of your choice).

msm-code mentioned this pull request Dec 13, 2022

[META] Ursadb performance improvements #190

Open

10 tasks

msm-code requested a review from nazywam December 13, 2022 21:51

nazywam approved these changes Dec 14, 2022

View reviewed changes

msm-code added 8 commits December 14, 2022 18:34

Use simple ngram list instead of QueryGraphs

24744f0

Reformat the code

434cdc0

Make the github formatter happy

b6f67a6

Small improvement to make clang-tidy happier

0d4922a

Fix the tests and a single regression

f6e125f

Fix db config passing

e58fe80

Fix param name

594304b

Fix a test

b3c0917

msm-code force-pushed the fix/performance1-queryplan branch from 432f1b4 to b3c0917 Compare December 14, 2022 17:34

Remove clang tidy for good

535115b

msm-code merged commit bbb2986 into master Dec 14, 2022

msm-code deleted the fix/performance1-queryplan branch December 14, 2022 17:54

msm-code added this to the v1.6.0 milestone Dec 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement 1: queryplans #191

Performance improvement 1: queryplans #191

msm-code commented Dec 13, 2022 •

edited

Performance improvement 1: queryplans #191

Performance improvement 1: queryplans #191

Conversation

msm-code commented Dec 13, 2022 • edited

msm-code commented Dec 13, 2022 •

edited