Change default hash function for consigned structure collections. #8

alex-ozdemir · 2021-03-31T21:37:18Z

It turns out that the collections provided in this crate perform somewhat poorly, because of an unfortunate collision (forgive my pun) between Rust's hash-table implementation and this crate's custom hash function. I suggest we fix this. Details follow.

Background

This library ascribes unique 64 bit identifiers to all consigned structures. These identifiers are generally consecutive: 0, 1, 2, ... and so on. The coll module exposes hash maps from (and hash sets of) consigned structures. The module uses the Rust standard library's hash-table, but to reduce hashing costs the table is configured to use the raw identifiers as hashes. The idea (probably) is that since the identifiers are unique, they are a perfectly collision-resistant hash, which preserve hash-table performance.

Problem

The problem is that the Rust's hash-table uses linear-probing, which relies on more than just collision-resistance for good performance. While incrementing identifiers are collision resistant, they are also adjacent: systems that use them tend to produce sets of elements whose identifiers are consecutive. This is a particularly bad case of a linear-probing table. Items with consecutive identifiers end up being adjacent in the hash table; when a collision does occur (because of wrapping), long linear scans are required to get past the consecutive occupied table entries.

Solutions

We should probably modify the coll module to use a hash which performs better on increasing identifies in a linear-probing table. If we call the current system id-hash ("identity hash"), the alternatives are:

sip-hash ("Sip hash"): use std's default hash implementation
p-hash ("prime hash"): hash identifiers by multiplying them by a 64-bit prime, ensuring good distribution when wrapped by a power of two.
a-hash ("ahash"): use the ahash crate.

The first commit of this PR adds a benchmark for which the hash table is the bottleneck (post-order DAG traversal with a visit set). I used this benchmark to measure the cost of traversing a random 10,000 sub-term DAG (sub-terms counted as if the DAG were a tree), under each of the aforementioned schemes. I can give machine details if you'd like.

Scheme	Traversal Time (ns)	Speedup
`id-hash`	866,761 ns/iter (+/- 20,670)	1.00x
`sip-hash`	613,830 ns/iter (+/- 8,522)	1.41x
`p-hash`	408,546 ns/iter (+/- 5,738)	2.12x
`a-hash`	385,035 ns/iter (+/- 5,215)	2.25x

Discussion

So, what path should we take?

a-hash: this introduces a new dependency (bad), gets this library out of the business of hashing (good?), and is the best performing
p-hash: this introduces no dependencies (good), keeps this library in the business of hashing (bad?), and performs almost as well
sip-hash: this introduces no dependencies (good), gets this library out of the business of hashing (good?), and performs quite a bit worse than a-hash and p-hash, but still much better than id-hash.

What option is best aligned with your goals for this crate? This PR implements sip-hash, but I have commits for the other two as well, and can add them to the PR.

Alternatives Discussion

one could rig up a modified version of p-hash where the identifiers are 0, p, 2p, 3p, ... for large prime p, which essentially front-loads the cost of the multiplication to identifier-generation time, which de-duplicates work, but mangles identifiers. I figure that this isn't worth the code complexity, since multiplication is fast.
we could change to a different data-structure (e.g., a separate-chaining hash-table, or something fancier). I think that a separate-chaining hash table would be a bad idea.

AdrienChampion · 2021-03-31T22:49:35Z

Awesome PR, loved the pun, thanks 😺

I agree with your analysis and will tell you what I think we should do in a bit. Let me first explain one thing you may find interesting.

The main reason I wrote this trivial hasher is because it is deterministic and will yield the same ordering on any machine, as long as term creation is sequential. In my case, this is quite crucial for support in general and bug fixing in particular. Many algorithms I work on can, sadly, behave very differently if you change the ordering.
Also, the ordering has the property that whenever you iterate on a collection (potential) sub-terms come before their super-terms, which can be useful sometimes.
But yes, I did feel smart because I assumed it was quite fast: I was mostly ignorant of what you just explained. Thanks!

So, I think the most sensible thing to do is to expose the hashers for the factory and the collections. Then we can provide (probably feature-gated) convenience Set/Maps with the hashers you suggest.

What do you think?

AdrienChampion · 2021-03-31T22:54:05Z

There's also the question of what the default should be. I think it must be the trivial hasher as some people might be like me and depend on its properties.

Then hopefully we can do some analytics magic, see which one is the most popular, and make it the default for 2.0.

alex-ozdemir · 2021-04-01T05:18:51Z

Thanks for looking at this so quickly, Adrien!

The main reason I wrote this trivial hasher is because it is deterministic and will yield the same ordering on any machine, as long as term creation is sequential. In my case, this is quite crucial for support in general and bug fixing in particular. Many algorithms I work on can, sadly, behave very differently if you change the ordering.

This makes sense, and is a good reason to avoid sip-hash and a-hash, which both randomize the hash function on a per-table basis.

Also, the ordering has the property that whenever you iterate on a collection (potential) sub-terms come before their super-terms, which can be useful sometimes.

What kind of collection are you referring to? If I understand correctly, Rust's hash table makes no guarantees about iteration order. Does it currently happen to use hash-order? Are you thinking of some other collection that does iteration in hash-order?

So, I think the most sensible thing to do is to expose the hashers for the factory and the collections. Then we can provide (probably feature-gated) convenience Set/Maps with the hashers you suggest.

A pretty reasonable idea. Happy to modify the PR to do this.

There's also the question of what the default should be. I think it must be the trivial hasher as some people might be like me and depend on its properties.

I'm definitely with you on the importance of preserving expected properties. In my mind the necessity of cross-table hash (and iteration order) stability eliminates a-hash (at least in its default configuration---one can configure it to be deterministic) and sip-hash (which, if I understand correctly, can not be de-randomized in its std implementation). p-hash still seems like a reasonable default to me, but perhaps your answer to my above question will convince me otherwise :).

AdrienChampion · 2021-04-01T16:22:08Z

What kind of collection are you referring to? If I understand correctly, Rust's hash table makes no guarantees about iteration order. Does it currently happen to use hash-order? Are you thinking of some other collection that does iteration in hash-order?

Well generally in this case I would use BTreeMaps/Sets so what I said was not very relevant. (It was late my time, sorry.)

However I seem to recall that HashMaps/Sets did respect the ordering, but might very well be wrong about this.

Forget I said anything :)

Edit: well, actually, even though they would not necessarily respect the ordering, iteration order on HashMaps/Sets should still be deterministic and platform-agnostic. So half my point applies! I think... right?

AdrienChampion · 2021-04-01T16:29:36Z

Note that it's much better to open collections so that users can decide which hasher they want (and provide feature-gated-or-not defaults), regardless of the discussion on the properties of the trivial hasher I mentioned.
So we should do it!

Manifested as * a BuildHasher parameter for HConMap/HConSet * with `with_hasher` and `with_capacity_and_hasher` functions * a build_hasher argument to the consign macro

Should have tested, not built the docs...

alex-ozdemir · 2021-04-02T15:44:01Z

Edit: well, actually, even though they would not necessarily respect the ordering, iteration order on HashMaps/Sets should still be deterministic and platform-agnostic. So half my point applies! I think... right?

Yep! While this is not promised by the documentation, iteration order is most likely deterministic and platform-agnostic for a fixed version of std and a deterministic BuildHasher implementation. This is because the design of the hash-table module is to push all non-determinism into BuildHasher.

However, between std versions, I believe all bets are off. While I don't know the details of SwissTable's iteration implementation, hash table iteration orders are generally very sensitive to any change in table parameters, implementation etc. Unless the maintainers have committed to cross-version consistent ordering (which I believe they have not, since it is not documented), I would expect it to change sometimes.

Note that it's much better to open collections so that users can decide which hasher they want (and provide feature-gated-or-not defaults), regardless of the discussion on the properties of the trivial hasher I mentioned.

I've modified this PR to expose the hash builder from HConMap, HConSet, and consign!.

I've also changed the default hash to p-hash. If you buy my argument above, changing the default hash to p-hash is fine in a minor hashconsing release, since it only seems to effect iteration order, which may already change between minor std releases. If you'd prefer to keep the original default, we can do that too.

AdrienChampion · 2021-04-05T07:49:40Z

I've also changed the default hash to p-hash. If you buy my argument above, changing the default hash to p-hash is fine in a minor hashconsing release, since it only seems to effect iteration order, which may already change between minor std releases. If you'd prefer to keep the original default, we can do that too.

I think I would prefer something that does not silence the change for unsuspecting users. I was thinking of something like

put the hash-collections in a new module colls or hash_coll or something similar, with documentation briefly discussing what we discussed here (probably with a link to this PR);
change what's in coll so that it
- uses the collections with the trivial hasher from the new module,
- exposes exactly the same API as the current version,
- is marked as #[deprecated], with a link to the new collection module.

That way, we do not change anything for average users, but we notify them that they should check out the new hash-related module, forcing them to read about the issues you pointed out in this PR. That way they can make an informed decision.

Does this sound reasonable to you? I can definitely help if you need/want me to.

alex-ozdemir · 2021-04-06T16:35:32Z

Does this sound reasonable to you? I can definitely help if you need/want me to.

That sounds like a great plan to me. I've put up a commit attempting to follow this plan, but feel free to make any edits you desire.

* move the configurable collections to hash_coll. * keep prime-hash as the default hash * keep the old coll module. * deprecate the coll module.

AdrienChampion · 2021-04-19T12:45:09Z

Sorry it took me so long to take care of this.

I'm merging your contribution in a hash branch, there's a few things I want to check in particular what the consign macro is doing.

Thank you again!

AdrienChampion · 2021-04-19T15:10:47Z

@alex-ozdemir in case you can donate more time to this library, I would love to have your opinion on #9 !

alex-ozdemir added 2 commits March 31, 2021 12:10

Add traversal benchmark

4f33484

finalize hash with multiplication

49f1bd4

Expose optional hash builder.

f246b54

Manifested as * a BuildHasher parameter for HConMap/HConSet * with `with_hasher` and `with_capacity_and_hasher` functions * a build_hasher argument to the consign macro

alex-ozdemir force-pushed the default-hash branch from a05676a to f246b54 Compare April 2, 2021 15:22

alex-ozdemir added 3 commits April 2, 2021 08:29

typo: should have tested...

b8b4f0a

Add custom hash to HConsign.

347e40c

Should have tested, not built the docs...

cargo fmt

bbb3385

Move new collections to new mod. Deprecate old.

17bb9f1

* move the configurable collections to hash_coll. * keep prime-hash as the default hash * keep the old coll module. * deprecate the coll module.

alex-ozdemir force-pushed the default-hash branch from 902833d to 17bb9f1 Compare April 6, 2021 17:20

AdrienChampion changed the base branch from master to hash April 19, 2021 12:44

AdrienChampion merged commit 0caaf8c into AdrienChampion:hash Apr 19, 2021

AdrienChampion mentioned this pull request Apr 19, 2021

Improve Efficiency for Hash-Collections #9

Merged

alex-ozdemir deleted the default-hash branch April 21, 2021 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default hash function for consigned structure collections. #8

Change default hash function for consigned structure collections. #8

alex-ozdemir commented Mar 31, 2021 •

edited

AdrienChampion commented Mar 31, 2021

AdrienChampion commented Mar 31, 2021

alex-ozdemir commented Apr 1, 2021

AdrienChampion commented Apr 1, 2021 •

edited

AdrienChampion commented Apr 1, 2021 •

edited

alex-ozdemir commented Apr 2, 2021

AdrienChampion commented Apr 5, 2021

alex-ozdemir commented Apr 6, 2021

AdrienChampion commented Apr 19, 2021

AdrienChampion commented Apr 19, 2021

Change default hash function for consigned structure collections. #8

Change default hash function for consigned structure collections. #8

Conversation

alex-ozdemir commented Mar 31, 2021 • edited

Background

Problem

Solutions

Discussion

Alternatives Discussion

AdrienChampion commented Mar 31, 2021

AdrienChampion commented Mar 31, 2021

alex-ozdemir commented Apr 1, 2021

AdrienChampion commented Apr 1, 2021 • edited

AdrienChampion commented Apr 1, 2021 • edited

alex-ozdemir commented Apr 2, 2021

AdrienChampion commented Apr 5, 2021

alex-ozdemir commented Apr 6, 2021

AdrienChampion commented Apr 19, 2021

AdrienChampion commented Apr 19, 2021

alex-ozdemir commented Mar 31, 2021 •

edited

AdrienChampion commented Apr 1, 2021 •

edited

AdrienChampion commented Apr 1, 2021 •

edited