New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default hash function for consigned structure collections. #8
Change default hash function for consigned structure collections. #8
Conversation
Awesome PR, loved the pun, thanks 😺 I agree with your analysis and will tell you what I think we should do in a bit. Let me first explain one thing you may find interesting. The main reason I wrote this trivial hasher is because it is deterministic and will yield the same ordering on any machine, as long as term creation is sequential. In my case, this is quite crucial for support in general and bug fixing in particular. Many algorithms I work on can, sadly, behave very differently if you change the ordering. So, I think the most sensible thing to do is to expose the hashers for the factory and the collections. Then we can provide (probably feature-gated) convenience Set/Maps with the hashers you suggest. What do you think? |
There's also the question of what the default should be. I think it must be the trivial hasher as some people might be like me and depend on its properties. Then hopefully we can do some analytics magic, see which one is the most popular, and make it the default for |
Thanks for looking at this so quickly, Adrien!
This makes sense, and is a good reason to avoid
What kind of collection are you referring to? If I understand correctly, Rust's hash table makes no guarantees about iteration order. Does it currently happen to use hash-order? Are you thinking of some other collection that does iteration in hash-order?
A pretty reasonable idea. Happy to modify the PR to do this.
I'm definitely with you on the importance of preserving expected properties. In my mind the necessity of cross-table hash (and iteration order) stability eliminates |
Well generally in this case I would use However I seem to recall that Forget I said anything :) Edit: well, actually, even though they would not necessarily respect the ordering, iteration order on |
Note that it's much better to open collections so that users can decide which hasher they want (and provide feature-gated-or-not defaults), regardless of the discussion on the properties of the trivial hasher I mentioned. |
Manifested as * a BuildHasher parameter for HConMap/HConSet * with `with_hasher` and `with_capacity_and_hasher` functions * a build_hasher argument to the consign macro
a05676a
to
f246b54
Compare
Should have tested, not built the docs...
Yep! While this is not promised by the documentation, iteration order is most likely deterministic and platform-agnostic for a fixed version of However, between
I've modified this PR to expose the hash builder from I've also changed the default hash to |
I think I would prefer something that does not silence the change for unsuspecting users. I was thinking of something like
That way, we do not change anything for average users, but we notify them that they should check out the new hash-related module, forcing them to read about the issues you pointed out in this PR. That way they can make an informed decision. Does this sound reasonable to you? I can definitely help if you need/want me to. |
That sounds like a great plan to me. I've put up a commit attempting to follow this plan, but feel free to make any edits you desire. |
* move the configurable collections to hash_coll. * keep prime-hash as the default hash * keep the old coll module. * deprecate the coll module.
902833d
to
17bb9f1
Compare
Sorry it took me so long to take care of this. I'm merging your contribution in a Thank you again! |
@alex-ozdemir in case you can donate more time to this library, I would love to have your opinion on #9 ! |
It turns out that the collections provided in this crate perform somewhat poorly, because of an unfortunate collision (forgive my pun) between Rust's hash-table implementation and this crate's custom hash function. I suggest we fix this. Details follow.
Background
This library ascribes unique 64 bit identifiers to all consigned structures. These identifiers are generally consecutive: 0, 1, 2, ... and so on. The
coll
module exposes hash maps from (and hash sets of) consigned structures. The module uses the Rust standard library's hash-table, but to reduce hashing costs the table is configured to use the raw identifiers as hashes. The idea (probably) is that since the identifiers are unique, they are a perfectly collision-resistant hash, which preserve hash-table performance.Problem
The problem is that the Rust's hash-table uses linear-probing, which relies on more than just collision-resistance for good performance. While incrementing identifiers are collision resistant, they are also adjacent: systems that use them tend to produce sets of elements whose identifiers are consecutive. This is a particularly bad case of a linear-probing table. Items with consecutive identifiers end up being adjacent in the hash table; when a collision does occur (because of wrapping), long linear scans are required to get past the consecutive occupied table entries.
Solutions
We should probably modify the
coll
module to use a hash which performs better on increasing identifies in a linear-probing table. If we call the current systemid-hash
("identity hash"), the alternatives are:sip-hash
("Sip hash"): usestd
's default hash implementationp-hash
("prime hash"): hash identifiers by multiplying them by a 64-bit prime, ensuring good distribution when wrapped by a power of two.a-hash
("ahash"): use theahash
crate.The first commit of this PR adds a benchmark for which the hash table is the bottleneck (post-order DAG traversal with a visit set). I used this benchmark to measure the cost of traversing a random 10,000 sub-term DAG (sub-terms counted as if the DAG were a tree), under each of the aforementioned schemes. I can give machine details if you'd like.
id-hash
sip-hash
p-hash
a-hash
Discussion
So, what path should we take?
a-hash
: this introduces a new dependency (bad), gets this library out of the business of hashing (good?), and is the best performingp-hash
: this introduces no dependencies (good), keeps this library in the business of hashing (bad?), and performs almost as wellsip-hash
: this introduces no dependencies (good), gets this library out of the business of hashing (good?), and performs quite a bit worse thana-hash
andp-hash
, but still much better thanid-hash
.What option is best aligned with your goals for this crate? This PR implements
sip-hash
, but I have commits for the other two as well, and can add them to the PR.Alternatives Discussion
p-hash
where the identifiers are0, p, 2p, 3p, ...
for large primep
, which essentially front-loads the cost of the multiplication to identifier-generation time, which de-duplicates work, but mangles identifiers. I figure that this isn't worth the code complexity, since multiplication is fast.