Performance improvement 3: implement ngram cache #197

msm-code · 2022-12-17T03:23:24Z

In this PR I introduce a ngram cache per dataset query. This means that ngrams (where it make sense) won't be read from the disk twice. This causes a small problem, because this cache can get really big pretty quick.

After #195 and #196, I thought we finally have something worth merging.

Just looking at https://github.com/msm-code/ursa-bench/tree/master/results (3_ngramcache_hdd_all.txt) http://65.21.130.153:8000/hdd_all.html ) - it looks like we have almost 100% speedup. At least, as measured by a number of reads - but in a real world case all other metrics are basically irrelevant for ursadb (ANDs on CPU are super quick compared to HDD reads, even on SSD there's an order of magnitude difference). Just by the raw numbers as reported there's rougly 50% speedup.

Just one technicality though - since ursadb doesn't return information about wall-clock time for the whole request, we don't actually have it. This means I had to resort to using good old time:

# old version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    1m26.660s
user    0m14.244s
sys     0m0.813s
# new version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    1m0.512s
user    0m13.857s
sys     0m0.806s

Nice! Looks like we are onto something - still the same ~45% speedup. We've reduced the number of reads by roughly ~45%, and query time by the same amount. All checks out so far.

But I hear you asking, why does it work? Doesn't Linux already cache disk reads for us? Shouldn't disk cache for the second read always be warm anyway (since we're reading the same data for a second time)? Well yes, but clearly it's doing a lousy job. By caching reads we also avoid unnecessarily moving the data around the kernel, and decoding ngrams again. Overall this results in ~45% speedup.

Or does it? All our tests are on a warm disk cache, so reads would never actually hit a physical disk. Just in case, let's test after dropping pagecache:

# old version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    71m23.290s
user    0m17.739s
sys     0m0.796s

# new version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    70m53.808s
user    0m17.651s
sys     0m0.801s

That 30 seconds doesn't look impressive now, does it?

What a total disaster. Looks like I've been optimising a wrong number. We don't really care about the number of reads, we care about a number of unique reads - cached disk reads are almost as fast as just reading data from RAM. And we don't have to worry about RAM usage etc - all is handled by the system.

Takeaways:

The most important (and maybe only) thing we should optimise is a number of unique ngram reads. Reading the same data multiple times in a row is fast.
Disk cache is a serious business. Thanks, Linux.
We can also drop Performance improvement 3: implement string cache #196. Although that one at least saves us some CPU work. I think I should actually continue working on Performance improvement 3: simplify queries #195 - it actually reduces number of reads we have to do. Maybe we can use lessons learned here to prioritise already cached ngrams - so we maximise chance that we don't have to do another read. For example in ("AAA" & "BBB" & "CCC) | ("DDD" & "AAA", & "BBB) when evaluating the second or parameter we should first and "AAA" and "BBB" (which are already cached), in hope that this will return an empty set and we will return early. More advanced option is to have some "ngram profile" and look for rare ngrams first.

Thanks for reading my dev blog.

Performance improvement 3: implement ngram cache

da31960

msm-code marked this pull request as draft December 17, 2022 03:23

This was referenced Dec 17, 2022

Add a performance counter for unique ngram reads #198

Closed

Performance improvement 3: simplify queries #195

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement 3: implement ngram cache #197

Performance improvement 3: implement ngram cache #197

msm-code commented Dec 17, 2022 •

edited

Loading

Performance improvement 3: implement ngram cache #197

Are you sure you want to change the base?

Performance improvement 3: implement ngram cache #197

Conversation

msm-code commented Dec 17, 2022 • edited Loading

msm-code commented Dec 17, 2022 •

edited

Loading