Performance improvement 3: implement ngram cache #197
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In this PR I introduce a ngram cache per dataset query. This means that ngrams (where it make sense) won't be read from the disk twice. This causes a small problem, because this cache can get really big pretty quick.
After #195 and #196, I thought we finally have something worth merging.
Just looking at https://github.com/msm-code/ursa-bench/tree/master/results (3_ngramcache_hdd_all.txt) http://65.21.130.153:8000/hdd_all.html ) - it looks like we have almost 100% speedup. At least, as measured by a number of reads - but in a real world case all other metrics are basically irrelevant for ursadb (ANDs on CPU are super quick compared to HDD reads, even on SSD there's an order of magnitude difference). Just by the raw numbers as reported there's rougly 50% speedup.
Just one technicality though - since ursadb doesn't return information about wall-clock time for the whole request, we don't actually have it. This means I had to resort to using good old
time
:Nice! Looks like we are onto something - still the same ~45% speedup. We've reduced the number of reads by roughly ~45%, and query time by the same amount. All checks out so far.
But I hear you asking, why does it work? Doesn't Linux already cache disk reads for us? Shouldn't disk cache for the second read always be warm anyway (since we're reading the same data for a second time)? Well yes, but clearly it's doing a lousy job. By caching reads we also avoid unnecessarily moving the data around the kernel, and decoding ngrams again. Overall this results in ~45% speedup.
Or does it? All our tests are on a warm disk cache, so reads would never actually hit a physical disk. Just in case, let's test after dropping pagecache:
That 30 seconds doesn't look impressive now, does it?
What a total disaster. Looks like I've been optimising a wrong number. We don't really care about the number of reads, we care about a number of unique reads - cached disk reads are almost as fast as just reading data from RAM. And we don't have to worry about RAM usage etc - all is handled by the system.
Takeaways:
("AAA" & "BBB" & "CCC) | ("DDD" & "AAA", & "BBB)
when evaluating the secondor
parameter we should firstand
"AAA" and "BBB" (which are already cached), in hope that this will return an empty set and we will return early. More advanced option is to have some "ngram profile" and look for rare ngrams first.Thanks for reading my dev blog.