Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement 3: implement ngram cache #197

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

msm-code
Copy link
Contributor

@msm-code msm-code commented Dec 17, 2022

In this PR I introduce a ngram cache per dataset query. This means that ngrams (where it make sense) won't be read from the disk twice. This causes a small problem, because this cache can get really big pretty quick.

After #195 and #196, I thought we finally have something worth merging.

Just looking at https://github.com/msm-code/ursa-bench/tree/master/results (3_ngramcache_hdd_all.txt) http://65.21.130.153:8000/hdd_all.html ) - it looks like we have almost 100% speedup. At least, as measured by a number of reads - but in a real world case all other metrics are basically irrelevant for ursadb (ANDs on CPU are super quick compared to HDD reads, even on SSD there's an order of magnitude difference). Just by the raw numbers as reported there's rougly 50% speedup.

Just one technicality though - since ursadb doesn't return information about wall-clock time for the whole request, we don't actually have it. This means I had to resort to using good old time:

# old version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    1m26.660s
user    0m14.244s
sys     0m0.813s
# new version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    1m0.512s
user    0m13.857s
sys     0m0.806s

Nice! Looks like we are onto something - still the same ~45% speedup. We've reduced the number of reads by roughly ~45%, and query time by the same amount. All checks out so far.

But I hear you asking, why does it work? Doesn't Linux already cache disk reads for us? Shouldn't disk cache for the second read always be warm anyway (since we're reading the same data for a second time)? Well yes, but clearly it's doing a lousy job. By caching reads we also avoid unnecessarily moving the data around the kernel, and decoding ngrams again. Overall this results in ~45% speedup.

Or does it? All our tests are on a warm disk cache, so reads would never actually hit a physical disk. Just in case, let's test after dropping pagecache:

# old version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    71m23.290s
user    0m17.739s
sys     0m0.796s

# new version
(venv) root@mquery ~/ursa-bench # time python3 ursabench.py rules/signature-base/yara/* > /dev/null
real    70m53.808s
user    0m17.651s
sys     0m0.801s

That 30 seconds doesn't look impressive now, does it?

What a total disaster. Looks like I've been optimising a wrong number. We don't really care about the number of reads, we care about a number of unique reads - cached disk reads are almost as fast as just reading data from RAM. And we don't have to worry about RAM usage etc - all is handled by the system.

Takeaways:

  • The most important (and maybe only) thing we should optimise is a number of unique ngram reads. Reading the same data multiple times in a row is fast.
  • Disk cache is a serious business. Thanks, Linux.
  • We can also drop Performance improvement 3: implement string cache #196. Although that one at least saves us some CPU work. I think I should actually continue working on Performance improvement 3: simplify queries #195 - it actually reduces number of reads we have to do. Maybe we can use lessons learned here to prioritise already cached ngrams - so we maximise chance that we don't have to do another read. For example in ("AAA" & "BBB" & "CCC) | ("DDD" & "AAA", & "BBB) when evaluating the second or parameter we should first and "AAA" and "BBB" (which are already cached), in hope that this will return an empty set and we will return early. More advanced option is to have some "ngram profile" and look for rare ngrams first.

Thanks for reading my dev blog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant