Skip to content

Implement a universal hash function#528

Merged
bosilca merged 3 commits intoICLDisco:masterfrom
devreal:universal_hash
Apr 27, 2023
Merged

Implement a universal hash function#528
bosilca merged 3 commits intoICLDisco:masterfrom
devreal:universal_hash

Conversation

@devreal
Copy link
Copy Markdown
Contributor

@devreal devreal commented Apr 26, 2023

The rehashing function in the hash table does a poor job of scrambling bits on structured keys. At least in TTG, keys are typically generated from a 2D or 3D index, e.g., by shifting the key dimensions into a 64bit value. This PR implements a variation of [1] that avoids 128bit arithmetic by initially folding the key into 32bit. That means we're losing some information on the key but I expect it to perform better than the original rehash.

This PR also extends the hash benchmark with options to provide an initial seed for random keys and to force a 3D key generation, shifting three counters into a single 64bit value.

Some measurement

In all cases, we generate 100k keys, with 5 repetitions and a single thread.

Random keys

Current Hash

mpirun -n 1 ./tests/class/hash -p -c 1 -r 5 -# 100000 -s 42
Time to do 100000 insertions on thread 0: 5544000 ns
table 0x1006e9278 level 0: 8192 lists, of length 0 to 16 average length: 5.41272 and variance 5.50551
table 0x1006e9278 level 1: 4096 lists, of length 0 to 17 average length: 6.40479 and variance 6.15015
table 0x1006e9278 level 2: 2048 lists, of length 0 to 17 average length: 7.54199 and variance 7.26155
table 0x1006e9278 level 3: 1024 lists, of length 0 to 17 average length: 6.17383 and variance 5.78403
table 0x1006e9278 level 4: 512 lists, of length 1 to 17 average length: 7.16211 and variance 6.52749
table 0x1006e9278 level 5: 256 lists, of length 0 to 17 average length: 7.38281 and variance 8.37837
table 0x1006e9278 level 6: 128 lists, of length 1 to 17 average length: 7.48437 and variance 7.44857
table 0x1006e9278 level 7: 64 lists, of length 3 to 17 average length: 9 and variance 9.1746
table 0x1006e9278 level 8: 32 lists, of length 3 to 17 average length: 9.59375 and variance 9.02319
table 0x1006e9278 level 9: 16 lists, of length 4 to 17 average length: 10.9375 and variance 19.1292
table 0x1006e9278 level 10: 8 lists, of length 4 to 17 average length: 10.5 and variance 21.4286
Time to do 100000 removals on thread 0: 19465000 ns
Time to do 100000 insertions on thread 0: 5790000 ns
table 0x1006e9278 level 0: 16384 lists, of length 0 to 14 average length: 3.35303 and variance 3.37485
table 0x1006e9278 level 1: 8192 lists, of length 0 to 17 average length: 5.50098 and variance 5.50043
Time to do 100000 removals on thread 0: 13986000 ns
Time to do 100000 insertions on thread 0: 6034000 ns
table 0x1006e9278 level 0: 32768 lists, of length 0 to 4 average length: 0.139374 and variance 0.139057
table 0x1006e9278 level 1: 16384 lists, of length 0 to 17 average length: 5.82477 and variance 5.77354
Time to do 100000 removals on thread 0: 21896000 ns
Time to do 100000 insertions on thread 0: 7226000 ns
table 0x1006e9278 level 0: 32768 lists, of length 0 to 13 average length: 3.05176 and variance 3.03983
Time to do 100000 removals on thread 0: 10903000 ns
Time to do 100000 insertions on thread 0: 6973000 ns
table 0x1006e9278 level 0: 32768 lists, of length 0 to 13 average length: 3.05176 and variance 3.03983
Time to do 100000 removals on thread 0: 11053000 ns
1 threads 21896000 nanosecond max_coll 16 max_table_depth 24

Universal Hash

mpirun -n 1 ./tests/class/hash -p -c 1 -r 5 -# 100000 -s 42
Time to do 100000 insertions on thread 0: 6044000 ns
table 0x10e1bb278 level 0: 16384 lists, of length 0 to 6 average length: 0.625061 and variance 0.624168
table 0x10e1bb278 level 1: 8192 lists, of length 0 to 17 average length: 5.28809 and variance 5.18351
table 0x10e1bb278 level 2: 4096 lists, of length 0 to 17 average length: 4.81738 and variance 4.81304
table 0x10e1bb278 level 3: 2048 lists, of length 0 to 17 average length: 5.35352 and variance 5.16954
table 0x10e1bb278 level 4: 1024 lists, of length 1 to 17 average length: 7.62793 and variance 8.00121
table 0x10e1bb278 level 5: 512 lists, of length 1 to 17 average length: 7.28516 and variance 7.28448
table 0x10e1bb278 level 6: 256 lists, of length 2 to 17 average length: 7.91406 and variance 8.19651
table 0x10e1bb278 level 7: 128 lists, of length 2 to 17 average length: 8.28906 and variance 7.56145
table 0x10e1bb278 level 8: 64 lists, of length 3 to 17 average length: 8.23437 and variance 9.00769
table 0x10e1bb278 level 9: 32 lists, of length 4 to 17 average length: 9.4375 and variance 10.9637
table 0x10e1bb278 level 10: 16 lists, of length 9 to 17 average length: 12.5 and variance 5.2
table 0x10e1bb278 level 11: 8 lists, of length 7 to 17 average length: 10.75 and variance 13.9286
Time to do 100000 removals on thread 0: 20860000 ns
Time to do 100000 insertions on thread 0: 6052000 ns
table 0x10e1bb278 level 0: 32768 lists, of length 0 to 6 average length: 0.546234 and variance 0.544083
table 0x10e1bb278 level 1: 16384 lists, of length 0 to 17 average length: 5.01105 and variance 5.02857
Time to do 100000 removals on thread 0: 19475000 ns
Time to do 100000 insertions on thread 0: 6864000 ns
table 0x10e1bb278 level 0: 32768 lists, of length 0 to 13 average length: 3.05176 and variance 3.04746
Time to do 100000 removals on thread 0: 11192000 ns
Time to do 100000 insertions on thread 0: 6583000 ns
table 0x10e1bb278 level 0: 32768 lists, of length 0 to 13 average length: 3.05176 and variance 3.04746
Time to do 100000 removals on thread 0: 13619000 ns
Time to do 100000 insertions on thread 0: 6863000 ns
table 0x10e1bb278 level 0: 32768 lists, of length 0 to 13 average length: 3.05176 and variance 3.04746
Time to do 100000 removals on thread 0: 10057000 ns
1 threads 20860000 nanosecond max_coll 16 max_table_depth 24

The universal hash function seems to cause one more bucket being used, although the average is roughly comparable. Performance seems slightly lower in some cases, but nothing I would consider concerning. The keys are already random, so this might be a result of the folding from 64bit to 32bit we do on the input key. I don't think our keys are ever truly random

Structured keys

Instead of a random key generator, here we generate keys in a structured 3D key space. This should be closer to what we see in applications.

Current Hash

% mpirun -n 1 ./tests/class/hash -p -c 1 -r 5 -# 100000 -3 
Time to do 100000 insertions on thread 0: 5867000 ns
table 0x100637280 level 0: 32768 lists, of length 0 to 12 average length: 0.169708 and variance 1.83774
table 0x100637280 level 1: 16384 lists, of length 0 to 17 average length: 0.130005 and variance 1.64482
table 0x100637280 level 2: 8192 lists, of length 4 to 17 average length: 9.06299 and variance 3.89397
table 0x100637280 level 3: 4096 lists, of length 0 to 17 average length: 0.535645 and variance 6.69421
table 0x100637280 level 4: 2048 lists, of length 0 to 17 average length: 0.696289 and variance 9.91553
table 0x100637280 level 5: 1024 lists, of length 0 to 17 average length: 2.82617 and variance 32.0245
table 0x100637280 level 6: 512 lists, of length 2 to 17 average length: 10.8164 and variance 21.2226
table 0x100637280 level 7: 256 lists, of length 7 to 17 average length: 11.3203 and variance 12.5088
table 0x100637280 level 8: 128 lists, of length 8 to 17 average length: 11.8594 and variance 6.86196
table 0x100637280 level 9: 64 lists, of length 9 to 17 average length: 12.3438 and variance 8.35615
table 0x100637280 level 10: 32 lists, of length 10 to 17 average length: 13.3125 and variance 6.93145
table 0x100637280 level 11: 16 lists, of length 15 to 17 average length: 16 and variance 0.133333
table 0x100637280 level 12: 8 lists, of length 14 to 17 average length: 15.625 and variance 1.125
Time to do 100000 removals on thread 0: 10951000 ns
Time to do 100000 insertions on thread 0: 5113000 ns
table 0x100637280 level 0: 32768 lists, of length 0 to 12 average length: 3.05176 and variance 25.1254
Time to do 100000 removals on thread 0: 6729000 ns
Time to do 100000 insertions on thread 0: 5167000 ns
table 0x100637280 level 0: 32768 lists, of length 0 to 12 average length: 3.05176 and variance 25.1254
Time to do 100000 removals on thread 0: 7266000 ns
Time to do 100000 insertions on thread 0: 5175000 ns
table 0x100637280 level 0: 32768 lists, of length 0 to 12 average length: 3.05176 and variance 25.1254
Time to do 100000 removals on thread 0: 7175000 ns
Time to do 100000 insertions on thread 0: 5054000 ns
table 0x100637280 level 0: 32768 lists, of length 0 to 12 average length: 3.05176 and variance 25.1254
Time to do 100000 removals on thread 0: 7175000 ns
1 threads 10951000 nanosecond max_coll 16 max_table_depth 24

Universal Hash


% mpirun -n 1 ./tests/class/hash -p -c 1 -r 5 -# 100000 -3   
Time to do 100000 insertions on thread 0: 5367000 ns
table 0x10aacc280 level 0: 8192 lists, of length 0 to 1 average length: 0.281006 and variance 0.202066
table 0x10aacc280 level 1: 4096 lists, of length 6 to 17 average length: 11.5959 and variance 3.23719
table 0x10aacc280 level 2: 2048 lists, of length 8 to 17 average length: 12.8018 and variance 2.00856
table 0x10aacc280 level 3: 1024 lists, of length 6 to 17 average length: 10.7041 and variance 5.11373
table 0x10aacc280 level 4: 512 lists, of length 9 to 17 average length: 12.7676 and variance 2.22572
table 0x10aacc280 level 5: 256 lists, of length 9 to 17 average length: 12.4844 and variance 2.20368
table 0x10aacc280 level 6: 128 lists, of length 9 to 17 average length: 12.7812 and variance 2.17224
table 0x10aacc280 level 7: 64 lists, of length 8 to 17 average length: 12.7969 and variance 4.48189
table 0x10aacc280 level 8: 32 lists, of length 12 to 17 average length: 14.5937 and variance 1.60383
table 0x10aacc280 level 9: 16 lists, of length 13 to 17 average length: 15.1875 and variance 1.3625
table 0x10aacc280 level 10: 8 lists, of length 15 to 17 average length: 15.5 and variance 0.571429
Time to do 100000 removals on thread 0: 17301000 ns
Time to do 100000 insertions on thread 0: 5592000 ns
table 0x10aacc280 level 0: 8192 lists, of length 7 to 16 average length: 12.207 and variance 1.70979
Time to do 100000 removals on thread 0: 8188000 ns
Time to do 100000 insertions on thread 0: 5983000 ns
table 0x10aacc280 level 0: 8192 lists, of length 7 to 16 average length: 12.207 and variance 1.70979
Time to do 100000 removals on thread 0: 8387000 ns
Time to do 100000 insertions on thread 0: 5471000 ns
table 0x10aacc280 level 0: 8192 lists, of length 7 to 16 average length: 12.207 and variance 1.70979
Time to do 100000 removals on thread 0: 8097000 ns
Time to do 100000 insertions on thread 0: 5910000 ns
table 0x10aacc280 level 0: 8192 lists, of length 7 to 16 average length: 12.207 and variance 1.70979
Time to do 100000 removals on thread 0: 8338000 ns
1 threads 17301000 nanosecond max_coll 16 max_table_depth 24

Note that the universal hash requires 2 buckets less (10 instead of 12) and has generally a higher minimum and average length. Insertion and removal times are naturally slower than before because buckets are longer, i.e., there are longer element lists to traverse. However, the tighter packing means we are less likely to run into the maximum bucket size, at which point buckets can become really long

Structured 1M Keys

Current Hash

% mpirun -n 1 ./tests/class/hash -p -c 1 -r 5 -# 1000000 -3         
Time to do 1000000 insertions on thread 0: 52854000 ns
table 0x104dd4280 level 0: 131072 lists, of length 0 to 7 average length: 2.36941 and variance 9.03189
table 0x104dd4280 level 1: 65536 lists, of length 9 to 17 average length: 9.78784 and variance 1.95899
table 0x104dd4280 level 2: 32768 lists, of length 0 to 17 average length: 0.0974731 and variance 1.29059
table 0x104dd4280 level 3: 16384 lists, of length 0 to 17 average length: 0.192444 and variance 2.95246
table 0x104dd4280 level 4: 8192 lists, of length 0 to 17 average length: 2.69751 and variance 27.9432
table 0x104dd4280 level 5: 4096 lists, of length 0 to 17 average length: 0.394531 and variance 5.89705
table 0x104dd4280 level 6: 2048 lists, of length 0 to 17 average length: 0.773438 and variance 11.4787
table 0x104dd4280 level 7: 1024 lists, of length 0 to 17 average length: 3.09375 and variance 38.7498
table 0x104dd4280 level 8: 512 lists, of length 8 to 17 average length: 12.3516 and variance 13.7666
table 0x104dd4280 level 9: 256 lists, of length 12 to 17 average length: 13.0703 and variance 1.11661
table 0x104dd4280 level 10: 128 lists, of length 12 to 17 average length: 14.6953 and variance 1.66234
table 0x104dd4280 level 11: 64 lists, of length 9 to 17 average length: 12.6719 and variance 9.17634
table 0x104dd4280 level 12: 32 lists, of length 12 to 17 average length: 13.375 and variance 1.98387
table 0x104dd4280 level 13: 16 lists, of length 15 to 17 average length: 15.8125 and variance 0.295833
table 0x104dd4280 level 14: 8 lists, of length 14 to 17 average length: 15.625 and variance 1.125
Time to do 1000000 removals on thread 0: 89970000 ns
W@-0001 tests/class/hash.c:87 -- Hash table has 17 collisions in bucket 216, but it already spans over 8388608 buckets. Performance might get very bad if more elements continue to stack in this bucket. Consider allowing larger resize with the MCA parameter parsec_hash_table_max_table_nb_bits
Time to do 1000000 insertions on thread 0: 187238000 ns
table 0x104dd4280 level 0: 8388608 lists, of length 0 to 27 average length: 0.00959897 and variance 0.230266
table 0x104dd4280 level 1: 4194304 lists, of length 0 to 17 average length: 0.00133705 and variance 0.0162531
table 0x104dd4280 level 2: 2097152 lists, of length 0 to 17 average length: 0.00151062 and variance 0.0235735
table 0x104dd4280 level 3: 1048576 lists, of length 0 to 17 average length: 0.0115185 and variance 0.171527
table 0x104dd4280 level 4: 524288 lists, of length 0 to 17 average length: 0.0929031 and variance 1.41071
table 0x104dd4280 level 5: 262144 lists, of length 0 to 17 average length: 0.72472 and variance 10.3417
table 0x104dd4280 level 6: 131072 lists, of length 0 to 17 average length: 5.0349 and variance 40.4407
Time to do 1000000 removals on thread 0: 1180434000 ns
Time to do 1000000 insertions on thread 0: 50675000 ns
table 0x104dd4280 level 0: 8388608 lists, of length 0 to 315 average length: 0.119209 and variance 35.397
Time to do 1000000 removals on thread 0: 3181104000 ns
Time to do 1000000 insertions on thread 0: 50988000 ns
table 0x104dd4280 level 0: 8388608 lists, of length 0 to 315 average length: 0.119209 and variance 35.397
Time to do 1000000 removals on thread 0: 3468603000 ns
Time to do 1000000 insertions on thread 0: 53840000 ns
table 0x104dd4280 level 0: 8388608 lists, of length 0 to 315 average length: 0.119209 and variance 35.397
Time to do 1000000 removals on thread 0: 3408925000 ns
1 threads 3468603000 nanosecond max_coll 16 max_table_depth 24

Note that the current hash function exhausts the default bucket count (8M). This likely will negatively impact performance.

Universal Hash

mpirun -n 1 ./tests/class/hash -p -c 1 -r 5 -# 1000000 -3 
Time to do 1000000 insertions on thread 0: 60520000 ns
table 0x108608280 level 0: 65536 lists, of length 0 to 7 average length: 3.9245 and variance 0.982993
table 0x108608280 level 1: 32768 lists, of length 8 to 17 average length: 11.9218 and variance 1.42783
table 0x108608280 level 2: 16384 lists, of length 7 to 17 average length: 10.8092 and variance 1.87948
table 0x108608280 level 3: 8192 lists, of length 6 to 17 average length: 10.6249 and variance 2.90126
table 0x108608280 level 4: 4096 lists, of length 6 to 17 average length: 10.8027 and variance 3.28025
table 0x108608280 level 5: 2048 lists, of length 5 to 17 average length: 10.0566 and variance 4.39542
table 0x108608280 level 6: 1024 lists, of length 4 to 17 average length: 9.55957 and variance 4.2291
table 0x108608280 level 7: 512 lists, of length 9 to 17 average length: 13.1641 and variance 1.66775
table 0x108608280 level 8: 256 lists, of length 10 to 17 average length: 12.8633 and variance 2.05574
table 0x108608280 level 9: 128 lists, of length 9 to 17 average length: 13.2031 and variance 3.18676
table 0x108608280 level 10: 64 lists, of length 8 to 17 average length: 13.2969 and variance 4.81523
table 0x108608280 level 11: 32 lists, of length 11 to 17 average length: 14 and variance 2.51613
table 0x108608280 level 12: 16 lists, of length 13 to 17 average length: 15.125 and variance 1.05
table 0x108608280 level 13: 8 lists, of length 13 to 17 average length: 14.5 and variance 2.57143
Time to do 1000000 removals on thread 0: 276213000 ns
Time to do 1000000 insertions on thread 0: 66114000 ns
table 0x108608280 level 0: 131072 lists, of length 0 to 4 average length: 1.67895 and variance 0.480784
table 0x108608280 level 1: 65536 lists, of length 7 to 17 average length: 11.9009 and variance 1.82469
Time to do 1000000 removals on thread 0: 169802000 ns
Time to do 1000000 insertions on thread 0: 65822000 ns
table 0x108608280 level 0: 131072 lists, of length 4 to 12 average length: 7.62939 and variance 1.28913
Time to do 1000000 removals on thread 0: 112347000 ns
Time to do 1000000 insertions on thread 0: 66539000 ns
table 0x108608280 level 0: 131072 lists, of length 4 to 12 average length: 7.62939 and variance 1.28913
Time to do 1000000 removals on thread 0: 111142000 ns
Time to do 1000000 insertions on thread 0: 65047000 ns
table 0x108608280 level 0: 131072 lists, of length 4 to 12 average length: 7.62939 and variance 1.28913
Time to do 1000000 removals on thread 0: 112833000 ns
1 threads 276213000 nanosecond max_coll 16 max_table_depth 24

No excessive collisions here, and decent average length.

[1] https://en.wikipedia.org/wiki/Universal_hashing#Avoiding_modular_arithmetic

devreal added 2 commits April 21, 2023 11:40
A slight variation of https://en.wikipedia.org/wiki/Universal_hashing#Avoiding_modular_arithmetic
without the need for 128bit arithmetic. The 64bit hash is
folded into 32bit, expanded into 64bit using two random numbers
and compressed into nb_bits using fast power-of-2 modulo and div.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@devreal devreal requested a review from a team as a code owner April 26, 2023 03:02
Copy link
Copy Markdown
Contributor

@therault therault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There is a line I think we can remove (see comment below), but otherwise, I think it should be merged.

Comment thread tests/class/hash.c Outdated
@bosilca
Copy link
Copy Markdown
Contributor

bosilca commented Apr 26, 2023

Removals are consistently slower, and not marginally slower. This is counter-intuitive as the average length of the buckets decreases and so do the number of collisions. It would seem normal to have a decreasing removal (and insertion) time, so why it is not the case ?

@devreal
Copy link
Copy Markdown
Contributor Author

devreal commented Apr 26, 2023

For random keys, the difference is marginal, sometimes faster, sometimes slower. For 100k structured keys, the universal hash is slower because buckets are more tightly packed, thus on average longer lists to traverse. For 1M structured keys, the universal hash is an order of magnitude faster in removals because the lists grow longer with the current hash (up to 27 and 6 more levels than the universal hash).

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
@bosilca bosilca merged commit 6ed3dab into ICLDisco:master Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants