`<functional>`: better hash function #2360

AlexGuteniev · 2021-11-26T11:25:15Z

@lhecker observes FNV1a is bad for integers.

https://discord.com/channels/737189251069771789/737189251497721887/913600700846596147

vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.

Also tracked by VSO-371547 / AB#371547 .

AlexGuteniev · 2021-11-26T15:46:41Z

#686 (comment) mentions SipHash-2-4

lhecker · 2021-11-26T16:07:20Z

I did some benchmarks based on https://github.com/rurban/smhasher
The test involves putting numbers 0 to 127 into a std::unordered_map<uint32_t, uint32_t>, choosing 128 random keys and calling find() for at least 100ms for each of those and computing the average time per random key. The table below contains the mean/median/stddev between those 128 randomly chosen keys.

Benchmark	Time	CPU
FNV1a mean median stddev	1.660 ns 1.650 ns 0.035 ns	1.670 ns 1.560 ns 0.168 ns
Murmur3F finalizer mean median stddev	1.640 ns 1.480 ns 0.274 ns	1.640 ns 1.400 ns 0.314 ns
XXH3 mean median stddev	2.340 ns 2.040 ns 0.411 ns	2.340 ns 2.090 ns 0.500 ns
SipHash-2-4 mean median stddev	8.810 ns 8.550 ns 0.488 ns	8.820 ns 9.340 ns 1.020 ns
SipHash-1-3 mean median stddev	7.030 ns 6.680 ns 0.524 ns	7.030 ns 6.250 ns 0.980 ns

While SipHash seems promising in preventing hash flood attacks, the actual paper describes it as fast in relation to other MACs and not other PRFs. The claim that it's fast for short messages in general appears to be an overstatement by easily 4x. Considering things like VSO-653642 (complaint about 2x std::hash performance regression) it might be difficult to introduce SipHash into a language like C++.

StephanTLavavej · 2021-12-01T22:43:23Z

Thanks - the SipHash perf data is definitely news to me, so we'll need to consider the available options carefully.

cbezault · 2021-12-01T22:55:52Z

I would consider also looking at SipHash13 rust-lang/rust#29754 (comment)

AlexGuteniev · 2021-12-01T23:01:27Z

@BillyONeal also mentioned a security consideration:

the hash values are not the same in successive runs of the program to make unordered_Xxx safe to use with untrusted input out of the box.

But I'm not sure if this can be implemented with insignificant perf impact.

BillyONeal · 2021-12-01T23:31:36Z

I would argue that the default, user doesn't know any better, option, should be the not exploitable one, even at a performance cost. Moreover, the resulting strength of the hash may help some programs by reducing collisions vs. other hash functions; although it will not for very small keys in relatively small containers like 127 int32s.

If you want the absolute fastest of juggling razor blades fast, unordered_map is not your port of call in the first place.

AlexGuteniev · 2021-12-01T23:51:29Z

If you want the absolute fastest of juggling razor blades fast, unordered_map is not your port of call in the first place.

std::hash is not neccessarily for unordered_map, a proper hash table may use it either

lhecker · 2021-12-02T00:03:20Z

@cbezault For short inputs the performance difference between SipHash-2-4 and 1-3 is minimal. I've updated the table above.

@BillyONeal Good point. I agree that a better, but slower hash function should be preferred as the default one for the stdlib. However SipHash isn't a stronger hash function than alternatives and it's rate of collisions is the same as for any other one that more or less passes the smhasher suite.

But personally I'm still against using SipHash as the default for this STL so far - at least according to what I know at the moment. I believe the absolute majority of code written with this STL are end-user, desktop applications and not publicly exposed web servers. In fact I'd explicitly argue that the number of web servers written in C++ and running on Windows is quite low. While the safety promises of SipHash are nice, it doesn't excuse the increase in computing power consumption for applications that don't require any such safety at all.

Before using a new hash function I believe there's a large number of alternatives we could first consider.
smhasher has useful reference code, results and benchmarks on lower end hardware, which could be used as a point of reference for us.

StephanTLavavej · 2021-12-08T22:44:45Z

I agree that we'll need to explore the increasingly large space of alternatives (when vNext work resumes). On the STL Discord, I observed that C++'s concerns are somewhat different from Python's; notably, we don't use dictionaries as our universal data structure (so the impact of a hash function choice is dramatically smaller), and map is available as an alternative to unordered_map and although map is typically somewhat slower, it also offers absolute immunity against hash-denial-of-service attacks.

AlexGuteniev changed the title ~~<functional>: better has function~~ <functional>: better hash function Nov 26, 2021

StephanTLavavej added performance Must go faster vNext Breaks binary compatibility labels Dec 1, 2021

lhecker mentioned this issue Dec 7, 2021

Introduce til::hasher for sucessive hashing of structs microsoft/terminal#11887

Merged

2 tasks

AlexGuteniev mentioned this issue Jun 9, 2022

Implementation of unsorted_map for strings could be optimized (performance) #2777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<functional>`: better hash function #2360

`<functional>`: better hash function #2360

AlexGuteniev commented Nov 26, 2021 •

edited by StephanTLavavej

AlexGuteniev commented Nov 26, 2021

lhecker commented Nov 26, 2021 •

edited

StephanTLavavej commented Dec 1, 2021

cbezault commented Dec 1, 2021

AlexGuteniev commented Dec 1, 2021

BillyONeal commented Dec 1, 2021

AlexGuteniev commented Dec 1, 2021

lhecker commented Dec 2, 2021 •

edited

StephanTLavavej commented Dec 8, 2021

<functional>: better hash function #2360

<functional>: better hash function #2360

Comments

AlexGuteniev commented Nov 26, 2021 • edited by StephanTLavavej

AlexGuteniev commented Nov 26, 2021

lhecker commented Nov 26, 2021 • edited

StephanTLavavej commented Dec 1, 2021

cbezault commented Dec 1, 2021

AlexGuteniev commented Dec 1, 2021

BillyONeal commented Dec 1, 2021

AlexGuteniev commented Dec 1, 2021

lhecker commented Dec 2, 2021 • edited

StephanTLavavej commented Dec 8, 2021

`<functional>`: better hash function #2360

`<functional>`: better hash function #2360

AlexGuteniev commented Nov 26, 2021 •

edited by StephanTLavavej

lhecker commented Nov 26, 2021 •

edited

lhecker commented Dec 2, 2021 •

edited