Performance improvement 1: queryplans #191
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While "Query Plans" sound like an advanced feature, this is actually an regression from QueryGraphs. Even more - after this PR query graphs will be unused and we could remove them in principle.
This PR reduces the number of reads between 10% and 300%. As seen on this screenshot (ignore time in
ms
, it's related to disk load and cache contents at the benchmark moment - read count is what matters):Since read is the most time consuming part of ursadb work, especially on HDD, this may speed up the database significantly.
In this PR I rewrite the matching algorithm to use a minimal covering set of necessary ngrams. For example, to match:
(where ? in an unknown/missing character), we will check the following ngrams (assuming all indexes are available):
While in the past we would check:
and AND the results later. most of these are unnecessary - for example there's no point checking 3grams
abc
andbcd
since they're implied by 4gramabcd
. The same goes for pseudo-4gramabcd
provided by the hash4 index - it's implied by text4 one.The downside is that it's hard to combine this with querygraphs. As mentioned in the PR introducing querygraphs, they're a nice data structure that enables a lot of optimisation - but I never did follow up, so they're quite slow in practice. The only thing we use them for right now are
nocase
strings. In this case this PR introduces a regression (nocase strings are ignored again). I think this is worth it for a much better performance in a general case.Full benchmark results here: https://raw.githubusercontent.com/msm-code/ursa-bench/master/results/hdd_all.html (download and open with a browser of your choice).