Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hashtable: implement different hashing functions #4816

Closed
wants to merge 1 commit into from

Conversation

catenacyber
Copy link
Contributor

Link to redmine ticket:
https://redmine.openinfosecfoundation.org/issues/3322

Describe changes:

  • Implements other hashing functions : Murmur3, CRC32, Djb2
  • Use one implementation based on C define flags

This is a draft PR not to be merged.

Cf
#4371 (comment)

I can run my pcap based bench on a rebased version of this.

You can have 3 different builds with using CFLAGS=-DTEST_HASH_MURMUR ./configure && make, CFLAGS=-DTEST_HASH_CRC32 ./configure && make or simply ./configure && make

Djb2, Murmur3 and CRC32
Use only Djb2 so far
@roligugus
Copy link
Contributor

roligugus commented May 21, 2020

Sorry semi-random drive-by commenting as I was browsing through Suricata tickets...

This is a follow-up on "benchmark main" discussion from v1. Did not want to comment on a closed ticket.

@catenacyber:

Indeed it would be better, but how do you get sure that compiler does not optimize away everything ?

Since your hash is uint32 you can sum up the calculated hashes in the loop and then use that sum somehow after the loop, or look at how micro benchmarking frameworks such as google benchmark or (facebook) folly benchmark do it. I.e. create a helper function that tricks the compiler to no optimize a variable away. You need to do some inline asm with read/global memory barriers essentially. However, the optimizer still might cause havoc on the benchmark.

Example:
Google benchmark: DoNotOptimize() and possibly ClobberMemory()
https://github.com/google/benchmark/blob/master/README.md#preventing-optimization
https://github.com/google/benchmark/blob/master/include/benchmark/benchmark.h#L307

Facebook folly benchmark: doNotOptimizeAway()
https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md#donotoptimizeaway
https://github.com/facebook/folly/blob/master/folly/Benchmark.h#L268

Don't get scared off by folly's C++ machinery.
UPDATE: It should be easy enough to copy&paste. Sorry google/folly benchmark libs are APL2 which is incompatible with GPL2! Don't copy&paste code. However, you still can get inspired by what they do and re-implement it in your own benchmarking framework/test.

Btw, have not seen the benchmarking main in this PR, but only in comments in v1.
It could be prudent to add this hash micro benchmark tests to suricata makefiles somewhere? You could even use one of the micro benchmark frameworks. I doubt you'd wanted to add any dependency on any C++ framework, but I assume rust has some micro benchmarking capability as well, or at least a crate available.

@catenacyber
Copy link
Contributor Author

Thanks for the anti-optimization-sum tip @roligugus

Any thoughts on the collisions ?
#4371 (comment)

@roligugus
Copy link
Contributor

roligugus commented May 28, 2020

Thanks for the anti-optimization-sum tip @roligugus

No worries. Please see the update about not copy&pasting code from benchmark libs, but rather re-implementing because of incompatible licenses.

Any thoughts on the collisions ?
#4371 (comment)

Not much. Does not seem out-of-line that murmur3 has more collisions than crc. Makes sense, imho.

You can also check that old, but "valuable for available hashes and overview" stackexchange link you're also referring to in the bug, see https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed. IIRC, it also mentions collision stats in many of the comments and I think it confirms what you see in your tests.

The only thing you can do is test some other hashes and see if you prefer those. I think I did mention some alternative hashes in the bug. E.g. google uses cityhash as their "general purpose hash" in abseil. However, it's really a question of how much time you want to spend analysing and benchmarking hashes. Good thing is you have done the leg work to be able to compare hashes so that you could rerun benchmarks etc once a new hash needs testing. I'd recommend to resurrect the benches/ directory for your benchmarks so that they are checked in. ;)

FYI, we also use Murmur3 in internal projects and live with the consequences and resulting slow paths in case of collisions. This does not mean that Murmur3 is the best, it's just what we chose it years ago as default. And no, there was no elaborate testing done before deciding on that besides some superficial comparisons. ;) It was also before smhasher became available, see https://github.com/rurban/smhasher for the latest with some further links as well.
Having said that, we've replaced our generic hash/hashmaps, based on Murmur3 or c++'s default unordered/ordered maps (but could be anything really), with custom hashes/maps/lookups where we were expecting too many collisions or felt we could do better than a generic hash for a problem space. IP/port lookups and some others come to mind. I think Suricata uses a trie lookup as well for IPs, afaik.

I am not a hash specialist by any means, just a curious dev, but my take always has been: use the best generic approach and then replace with something more specific where it becomes a problem. Unless you know better from the get-go for a particular problem and have time to properly spend the engineering time. I call it the a whack-a-mole approach, i.e. "this is slow, let's analyse it - oh, it's the lookup - let's improve that". I find that often best for maintenance given a diverse set of developers, albeit not always ideal from an engineering point-of-view.

Anyhow, I am digressing. Not sure if that helps at all. Je suis desole pour la deviation. ;)

@roligugus
Copy link
Contributor

roligugus commented May 28, 2020

As for the collisions...

Generally, the fewer the better, but it really is a balance of speed vs collisions to which there is no definitive answer. The real impact of hash collisions is seen where the hash is used. However, the actual hash function only provides parts of the picture.

You counted your collisions from the calculated 32bit hash value which is good.
From what I can tell the hash is actually used in thash. thash seems like a regular bucketed hash table. I.e. it does a mod hash_size on top of your chosen hash to "map the hash to the actual slot in the table" from what I can tell. I see that hash_size is configurable. have not checked if it could dynamically grow as well. Anyhow, this actually defines the "real" collisions numbers. Have not checked too deeply in the code, but it looks like you will be down to a linear search though the bucket on a collision. That is your main slow-down on retrieval or insertion when having collisions.

So the thash collision numbers can give you an idea about clusters in the actual hash table, but to do a full assessment you would also need to know the bucket lengths, load factor, etc. Ideally, the buckets are shallow as the deeper they get the slower. Also depends if thash uses open or closed addressing. Anyhow, this is kind of digressing into a thash discussion and greatly depends on params used to create the thash. Don't think you want to measure thash with this ticket.

So I would recommend to also calculate the collision rates caused by thash, i.e. calculate mod hash_size collisions as well to give you an idea as the mod hash_size will have an impact if the hash used is not perfectly uniform. Unfortunately, this is configurable, so do the stats on the default value and "some common/typical values" maybe?

Sidenote to the sidenote, fibonacci hashing instead of mod hash_size might help a bit with the distribution in thash. See e.g. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/

Once those thash collisions numbers are calculated, you will have more info, but probably still no definitive answer.

Ultimately, it's a "speed to calculate the hash" for non-collision cases vs the "speed of hash calculation plus linear search or so required to find the element in the bucket" for collisions. Benchmark inserts/gets/updates/removes on the thash table to get a full picture of all use cases.

Last but not least, this should probably be done on the major supported/typical platforms/CPU architectures. This is before throwing in different compiler versions.

This will get time-consuming very quickly. Personally don't think this is all worthwhile unless you really care and have the time to spend.

I personally think getting the performance of the different hashes plus the collision rates is a good start and you've already done that. Recommend to do that on at least the main supported platforms/architectures and compilers versions. Adding thash collision rates would be an improvement on top of that, but I am not sure it's worth the effort.

I know all this is easy to say as a bystander.

hash = ((hash << 5) + hash) + str->md5[i]; /* hash * 33 + c */
}
return hash;
#if defined(TEST_HASH_MURMUR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You went from an inlined hash calculation, to a non-inlined hash calculation through the function call.
This is not necessarily bad, it's just different.
If you absolutely must have the same behaviour as the previous code, you could make HashFunctionDjb2() an inline function and hope the compiler inlines it.

@roligugus
Copy link
Contributor

Sorry, kid break...

From a practical point-of-view, if you want to change, pick the fastest one as the collision rates for the given test data seem to be "somewhat in the same ballpark"? Just make sure you test on more than one arch/compiler version. This will cover the "get/lookup" use case I assume is the most common one in Suricata? Use different hashes for different use cases if it's worthwhile.

Always use release builds for measurements, or better put: use the compile flags the installed Suricata uses. Btw, your test is looking at 1 million elements for three hashes, that is 3 * 4 bytes * 1 million -> 12MB. You should be able to easily keep that in memory. Fill in the array when calculating hashes, and do the stats (collisions, etc) and file printing afterwards with the in-memory data.

@catenacyber
Copy link
Contributor Author

Closing waiting for a QA lab for now
To be followed inhttps://redmine.openinfosecfoundation.org/issues/3322

victorjulien added a commit to victorjulien/suricata that referenced this pull request Nov 12, 2021
In emergency mode the time set to wait could be in the past.

Bug: OISF#4816.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants