Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default hash function for consigned structure collections. #8

Merged
merged 7 commits into from Apr 19, 2021

Conversation

alex-ozdemir
Copy link
Contributor

@alex-ozdemir alex-ozdemir commented Mar 31, 2021

It turns out that the collections provided in this crate perform somewhat poorly, because of an unfortunate collision (forgive my pun) between Rust's hash-table implementation and this crate's custom hash function. I suggest we fix this. Details follow.

Background

This library ascribes unique 64 bit identifiers to all consigned structures. These identifiers are generally consecutive: 0, 1, 2, ... and so on. The coll module exposes hash maps from (and hash sets of) consigned structures. The module uses the Rust standard library's hash-table, but to reduce hashing costs the table is configured to use the raw identifiers as hashes. The idea (probably) is that since the identifiers are unique, they are a perfectly collision-resistant hash, which preserve hash-table performance.

Problem

The problem is that the Rust's hash-table uses linear-probing, which relies on more than just collision-resistance for good performance. While incrementing identifiers are collision resistant, they are also adjacent: systems that use them tend to produce sets of elements whose identifiers are consecutive. This is a particularly bad case of a linear-probing table. Items with consecutive identifiers end up being adjacent in the hash table; when a collision does occur (because of wrapping), long linear scans are required to get past the consecutive occupied table entries.

Solutions

We should probably modify the coll module to use a hash which performs better on increasing identifies in a linear-probing table. If we call the current system id-hash ("identity hash"), the alternatives are:

  • sip-hash ("Sip hash"): use std's default hash implementation
  • p-hash ("prime hash"): hash identifiers by multiplying them by a 64-bit prime, ensuring good distribution when wrapped by a power of two.
  • a-hash ("ahash"): use the ahash crate.

The first commit of this PR adds a benchmark for which the hash table is the bottleneck (post-order DAG traversal with a visit set). I used this benchmark to measure the cost of traversing a random 10,000 sub-term DAG (sub-terms counted as if the DAG were a tree), under each of the aforementioned schemes. I can give machine details if you'd like.

Scheme Traversal Time (ns) Speedup
id-hash 866,761 ns/iter (+/- 20,670) 1.00x
sip-hash 613,830 ns/iter (+/- 8,522) 1.41x
p-hash 408,546 ns/iter (+/- 5,738) 2.12x
a-hash 385,035 ns/iter (+/- 5,215) 2.25x

Discussion

So, what path should we take?

  • a-hash: this introduces a new dependency (bad), gets this library out of the business of hashing (good?), and is the best performing
  • p-hash: this introduces no dependencies (good), keeps this library in the business of hashing (bad?), and performs almost as well
  • sip-hash: this introduces no dependencies (good), gets this library out of the business of hashing (good?), and performs quite a bit worse than a-hash and p-hash, but still much better than id-hash.

What option is best aligned with your goals for this crate? This PR implements sip-hash, but I have commits for the other two as well, and can add them to the PR.

Alternatives Discussion

  • one could rig up a modified version of p-hash where the identifiers are 0, p, 2p, 3p, ... for large prime p, which essentially front-loads the cost of the multiplication to identifier-generation time, which de-duplicates work, but mangles identifiers. I figure that this isn't worth the code complexity, since multiplication is fast.
  • we could change to a different data-structure (e.g., a separate-chaining hash-table, or something fancier). I think that a separate-chaining hash table would be a bad idea.

@AdrienChampion
Copy link
Owner

Awesome PR, loved the pun, thanks 😺

I agree with your analysis and will tell you what I think we should do in a bit. Let me first explain one thing you may find interesting.

The main reason I wrote this trivial hasher is because it is deterministic and will yield the same ordering on any machine, as long as term creation is sequential. In my case, this is quite crucial for support in general and bug fixing in particular. Many algorithms I work on can, sadly, behave very differently if you change the ordering.
Also, the ordering has the property that whenever you iterate on a collection (potential) sub-terms come before their super-terms, which can be useful sometimes.
But yes, I did feel smart because I assumed it was quite fast: I was mostly ignorant of what you just explained. Thanks!

So, I think the most sensible thing to do is to expose the hashers for the factory and the collections. Then we can provide (probably feature-gated) convenience Set/Maps with the hashers you suggest.

What do you think?

@AdrienChampion
Copy link
Owner

There's also the question of what the default should be. I think it must be the trivial hasher as some people might be like me and depend on its properties.

Then hopefully we can do some analytics magic, see which one is the most popular, and make it the default for 2.0.

@alex-ozdemir
Copy link
Contributor Author

Thanks for looking at this so quickly, Adrien!

The main reason I wrote this trivial hasher is because it is deterministic and will yield the same ordering on any machine, as long as term creation is sequential. In my case, this is quite crucial for support in general and bug fixing in particular. Many algorithms I work on can, sadly, behave very differently if you change the ordering.

This makes sense, and is a good reason to avoid sip-hash and a-hash, which both randomize the hash function on a per-table basis.

Also, the ordering has the property that whenever you iterate on a collection (potential) sub-terms come before their super-terms, which can be useful sometimes.

What kind of collection are you referring to? If I understand correctly, Rust's hash table makes no guarantees about iteration order. Does it currently happen to use hash-order? Are you thinking of some other collection that does iteration in hash-order?

So, I think the most sensible thing to do is to expose the hashers for the factory and the collections. Then we can provide (probably feature-gated) convenience Set/Maps with the hashers you suggest.

A pretty reasonable idea. Happy to modify the PR to do this.

There's also the question of what the default should be. I think it must be the trivial hasher as some people might be like me and depend on its properties.

I'm definitely with you on the importance of preserving expected properties. In my mind the necessity of cross-table hash (and iteration order) stability eliminates a-hash (at least in its default configuration---one can configure it to be deterministic) and sip-hash (which, if I understand correctly, can not be de-randomized in its std implementation). p-hash still seems like a reasonable default to me, but perhaps your answer to my above question will convince me otherwise :).

@AdrienChampion
Copy link
Owner

AdrienChampion commented Apr 1, 2021

What kind of collection are you referring to? If I understand correctly, Rust's hash table makes no guarantees about iteration order. Does it currently happen to use hash-order? Are you thinking of some other collection that does iteration in hash-order?

Well generally in this case I would use BTreeMaps/Sets so what I said was not very relevant. (It was late my time, sorry.)

However I seem to recall that HashMaps/Sets did respect the ordering, but might very well be wrong about this.

Forget I said anything :)

Edit: well, actually, even though they would not necessarily respect the ordering, iteration order on HashMaps/Sets should still be deterministic and platform-agnostic. So half my point applies! I think... right?

@AdrienChampion
Copy link
Owner

AdrienChampion commented Apr 1, 2021

Note that it's much better to open collections so that users can decide which hasher they want (and provide feature-gated-or-not defaults), regardless of the discussion on the properties of the trivial hasher I mentioned.
So we should do it!

Manifested as
* a BuildHasher parameter for HConMap/HConSet
   * with `with_hasher` and `with_capacity_and_hasher` functions
* a build_hasher argument to the consign macro
@alex-ozdemir
Copy link
Contributor Author

Edit: well, actually, even though they would not necessarily respect the ordering, iteration order on HashMaps/Sets should still be deterministic and platform-agnostic. So half my point applies! I think... right?

Yep! While this is not promised by the documentation, iteration order is most likely deterministic and platform-agnostic for a fixed version of std and a deterministic BuildHasher implementation. This is because the design of the hash-table module is to push all non-determinism into BuildHasher.

However, between std versions, I believe all bets are off. While I don't know the details of SwissTable's iteration implementation, hash table iteration orders are generally very sensitive to any change in table parameters, implementation etc. Unless the maintainers have committed to cross-version consistent ordering (which I believe they have not, since it is not documented), I would expect it to change sometimes.

Note that it's much better to open collections so that users can decide which hasher they want (and provide feature-gated-or-not defaults), regardless of the discussion on the properties of the trivial hasher I mentioned.

I've modified this PR to expose the hash builder from HConMap, HConSet, and consign!.

I've also changed the default hash to p-hash. If you buy my argument above, changing the default hash to p-hash is fine in a minor hashconsing release, since it only seems to effect iteration order, which may already change between minor std releases. If you'd prefer to keep the original default, we can do that too.

@AdrienChampion
Copy link
Owner

I've also changed the default hash to p-hash. If you buy my argument above, changing the default hash to p-hash is fine in a minor hashconsing release, since it only seems to effect iteration order, which may already change between minor std releases. If you'd prefer to keep the original default, we can do that too.

I think I would prefer something that does not silence the change for unsuspecting users. I was thinking of something like

  • put the hash-collections in a new module colls or hash_coll or something similar, with documentation briefly discussing what we discussed here (probably with a link to this PR);
  • change what's in coll so that it
    • uses the collections with the trivial hasher from the new module,
    • exposes exactly the same API as the current version,
    • is marked as #[deprecated], with a link to the new collection module.

That way, we do not change anything for average users, but we notify them that they should check out the new hash-related module, forcing them to read about the issues you pointed out in this PR. That way they can make an informed decision.

Does this sound reasonable to you? I can definitely help if you need/want me to.

@alex-ozdemir
Copy link
Contributor Author

Does this sound reasonable to you? I can definitely help if you need/want me to.

That sounds like a great plan to me. I've put up a commit attempting to follow this plan, but feel free to make any edits you desire.

* move the configurable collections to hash_coll.
  * keep prime-hash as the default hash
* keep the old coll module.
* deprecate the coll module.
@AdrienChampion AdrienChampion changed the base branch from master to hash April 19, 2021 12:44
@AdrienChampion
Copy link
Owner

Sorry it took me so long to take care of this.

I'm merging your contribution in a hash branch, there's a few things I want to check in particular what the consign macro is doing.

Thank you again!

@AdrienChampion AdrienChampion merged commit 0caaf8c into AdrienChampion:hash Apr 19, 2021
@AdrienChampion
Copy link
Owner

@alex-ozdemir in case you can donate more time to this library, I would love to have your opinion on #9 !

@alex-ozdemir alex-ozdemir deleted the default-hash branch April 21, 2021 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants