New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hashtable: implement different hashing functions #4816
Conversation
Djb2, Murmur3 and CRC32 Use only Djb2 so far
Sorry semi-random drive-by commenting as I was browsing through Suricata tickets... This is a follow-up on "benchmark main" discussion from v1. Did not want to comment on a closed ticket.
Since your hash is uint32 you can sum up the calculated hashes in the loop and then use that sum somehow after the loop, or look at how micro benchmarking frameworks such as google benchmark or (facebook) folly benchmark do it. I.e. create a helper function that tricks the compiler to no optimize a variable away. You need to do some inline asm with read/global memory barriers essentially. However, the optimizer still might cause havoc on the benchmark. Example: Facebook folly benchmark: doNotOptimizeAway() Don't get scared off by folly's C++ machinery. Btw, have not seen the benchmarking main in this PR, but only in comments in v1. |
Thanks for the anti-optimization-sum tip @roligugus Any thoughts on the collisions ? |
No worries. Please see the update about not copy&pasting code from benchmark libs, but rather re-implementing because of incompatible licenses.
Not much. Does not seem out-of-line that murmur3 has more collisions than crc. Makes sense, imho. You can also check that old, but "valuable for available hashes and overview" stackexchange link you're also referring to in the bug, see https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed. IIRC, it also mentions collision stats in many of the comments and I think it confirms what you see in your tests. The only thing you can do is test some other hashes and see if you prefer those. I think I did mention some alternative hashes in the bug. E.g. google uses cityhash as their "general purpose hash" in abseil. However, it's really a question of how much time you want to spend analysing and benchmarking hashes. Good thing is you have done the leg work to be able to compare hashes so that you could rerun benchmarks etc once a new hash needs testing. I'd recommend to resurrect the benches/ directory for your benchmarks so that they are checked in. ;) FYI, we also use Murmur3 in internal projects and live with the consequences and resulting slow paths in case of collisions. This does not mean that Murmur3 is the best, it's just what we chose it years ago as default. And no, there was no elaborate testing done before deciding on that besides some superficial comparisons. ;) It was also before smhasher became available, see https://github.com/rurban/smhasher for the latest with some further links as well. I am not a hash specialist by any means, just a curious dev, but my take always has been: use the best generic approach and then replace with something more specific where it becomes a problem. Unless you know better from the get-go for a particular problem and have time to properly spend the engineering time. I call it the a whack-a-mole approach, i.e. "this is slow, let's analyse it - oh, it's the lookup - let's improve that". I find that often best for maintenance given a diverse set of developers, albeit not always ideal from an engineering point-of-view. Anyhow, I am digressing. Not sure if that helps at all. Je suis desole pour la deviation. ;) |
As for the collisions... Generally, the fewer the better, but it really is a balance of speed vs collisions to which there is no definitive answer. The real impact of hash collisions is seen where the hash is used. However, the actual hash function only provides parts of the picture. You counted your collisions from the calculated 32bit hash value which is good. So the thash collision numbers can give you an idea about clusters in the actual hash table, but to do a full assessment you would also need to know the bucket lengths, load factor, etc. Ideally, the buckets are shallow as the deeper they get the slower. Also depends if thash uses open or closed addressing. Anyhow, this is kind of digressing into a thash discussion and greatly depends on params used to create the thash. Don't think you want to measure thash with this ticket. So I would recommend to also calculate the collision rates caused by thash, i.e. calculate mod hash_size collisions as well to give you an idea as the mod hash_size will have an impact if the hash used is not perfectly uniform. Unfortunately, this is configurable, so do the stats on the default value and "some common/typical values" maybe? Sidenote to the sidenote, fibonacci hashing instead of mod hash_size might help a bit with the distribution in thash. See e.g. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ Once those thash collisions numbers are calculated, you will have more info, but probably still no definitive answer. Ultimately, it's a "speed to calculate the hash" for non-collision cases vs the "speed of hash calculation plus linear search or so required to find the element in the bucket" for collisions. Benchmark inserts/gets/updates/removes on the thash table to get a full picture of all use cases. Last but not least, this should probably be done on the major supported/typical platforms/CPU architectures. This is before throwing in different compiler versions. This will get time-consuming very quickly. Personally don't think this is all worthwhile unless you really care and have the time to spend. I personally think getting the performance of the different hashes plus the collision rates is a good start and you've already done that. Recommend to do that on at least the main supported platforms/architectures and compilers versions. Adding thash collision rates would be an improvement on top of that, but I am not sure it's worth the effort. I know all this is easy to say as a bystander. |
hash = ((hash << 5) + hash) + str->md5[i]; /* hash * 33 + c */ | ||
} | ||
return hash; | ||
#if defined(TEST_HASH_MURMUR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You went from an inlined hash calculation, to a non-inlined hash calculation through the function call.
This is not necessarily bad, it's just different.
If you absolutely must have the same behaviour as the previous code, you could make HashFunctionDjb2() an inline function and hope the compiler inlines it.
Sorry, kid break... From a practical point-of-view, if you want to change, pick the fastest one as the collision rates for the given test data seem to be "somewhat in the same ballpark"? Just make sure you test on more than one arch/compiler version. This will cover the "get/lookup" use case I assume is the most common one in Suricata? Use different hashes for different use cases if it's worthwhile. Always use release builds for measurements, or better put: use the compile flags the installed Suricata uses. Btw, your test is looking at 1 million elements for three hashes, that is 3 * 4 bytes * 1 million -> 12MB. You should be able to easily keep that in memory. Fill in the array when calculating hashes, and do the stats (collisions, etc) and file printing afterwards with the in-memory data. |
Closing waiting for a QA lab for now |
In emergency mode the time set to wait could be in the past. Bug: OISF#4816.
Link to redmine ticket:
https://redmine.openinfosecfoundation.org/issues/3322
Describe changes:
This is a draft PR not to be merged.
Cf
#4371 (comment)
You can have 3 different builds with using
CFLAGS=-DTEST_HASH_MURMUR ./configure && make
,CFLAGS=-DTEST_HASH_CRC32 ./configure && make
or simply./configure && make