Reduce string hash collisions #168

bungle · 2016-04-28T10:10:33Z

I started this discussion first on LuaJIT mailing list:
http://www.freelists.org/post/luajit/Slowness-due-to-memory-allocation-related-problems

This was also discussed on Hacker News a few years ago:
https://news.ycombinator.com/item?id=3401900

Since then it looks like many of the programming languages have fixed this issue using e.g. randomization. PUC Lua 5.3.2 also seems to have fix for this (don't know exactly when it was fixed there, though). LuaJIT has no fix for this, but there are at least a few unofficial fixes that DO work:

How to reproduce:

Download this: https://github.com/bungle/lua-resty-template
Run luajit -e 'require "resty.template.microbenchmark".run(1000)' (in lib dir)
Run luajit -e 'require "resty.template.microbenchmark".run(10000)' (in lib dir)

Now compare the results with and without the patches mentioned above. Without patches the 3. will be about 10x slower than the patched run on the same test (you can try also luajit -e 'require "resty.template.microbenchmark".run(100000)' – 1000000 will start emitting out of memory when 2 GB limit is hit if you have not patched LuaJIT with some bigger memory space patches).

The text was updated successfully, but these errors were encountered:

DemiMarie · 2016-05-02T21:13:03Z

The best strong hash function is SipHash. An optimized implementation that runs quite quickly can be found here. SipHash is considered to be cryptographically secure, so it avoids all hash collision attacks provided that the 128-bit key is chosen at random and kept secret.

If SipHash is too slow (it might be) another option is to switch the layout of tables to use a balanced binary tree to handle collisions. This mostly mitigates the problem since the degradation is only to O(log n).

funny-falcon · 2016-05-04T12:38:22Z

SipHash is too slow (for LuaJIT).

Look to the page you've linked: every 7byte string you push to Lua will consume 135 cycle with SSE2 on 32bit platform and 61 cycle on 64bit platform - it is huge amount of CPU time.

So it should be smarter. In #174 I'm proposing "smart hashing":

use current hash most of time
fallback to "harder" hash on collisions

With this scheme, SipHash could be used as fallback function. In fact, I've add an option to use 32bit cousin to SipHash, cause it is faster on 32bit platform, and I wish not use SSE.

But even with this "fallback" approach, it is measurable performance hit to use "strong cryptographic" function instead of "fast and dump" simple whole string hash. And larger function is, larger the hit is.

Perhaps, hardware assisted functions (using aesenc hardware instruction) may give good quality function with affordable performance. But, given Mike doesn't want to use platform dependent functions in this place, and even multiplication is forbidden, it will be hard to push it into code base.

And do not forget, when you will suggest SipHash as a function for hash table next time:

SipHash13 is just enough for this use case, Jean-Philippe Aumasson (author of SipHash) confirms that
SipHash24 is recommended for MAC - Message Authentication Code - ie when you sign a message, and attacker may see this sign.

MikePall · 2016-05-06T11:56:42Z

The discussion mixes two different issues:

Cryptographically secure hashing.
Improving the string hash quality.

In detail:

Cryptographically secure hashing is an explicit non-goal.
1. It's simply unattainable for the core data structure of LuaJIT.
2. A weakened variant is pointless, since it can still be exploited.
3. Corollary: if you're operating in a context where maliciously crafted data might affect your service, then make sure this doesn't end up as an interned string.
Improving the hash quality may be a potentially interesting goal.
1. But it has to be weighed against the resulting performance impact. Changing a sparse, near constant-time hash to a dense, linear-time hash is a drastic step.
2. NB: hashing has to be deterministic, due to a couple of internal shortcuts.
3. To evaluate such a change, it's not helpful and likely misleading to use benchmarks that only employ synthetic strings. It's easy to construct collisions for any non-cryptographically secure hash. But that doesn't say anything about the impact on the average case or the likeliness of the worst case (still) occuring.
There are more options:
1. Don't intern long strings. This has wider reaching consequences for the whole VM and is a non-trivial change.
2. Use a different data structure, such as a parallel binary search. Needs very careful evaluation.
3. Combine both options.

daurnimator · 2016-05-06T11:59:19Z

Don't intern long strings. This has wider reaching consequences for the whole VM and is a non-trivial change.

This option is notable as PUC-Rio lua has taken this route.

funny-falcon · 2016-05-06T13:46:26Z

@MikePall look at #174 - it keeps using current sparse hash until long collision chain is detected. So that, usual workload is not affected.

When long collision chain is detected, only strings that are not covered by sparse hash (ie longer than 12bytes) and only for this collisioned chain are created with full hash (for default case).

"internal shortcuts" are handled by computed "sparse hash" for a string, that has "full hash" computed. It looks like, all "shortcuts" are for small strings, so that, their performance is not affected.

Benchmark shows no effect on benchmark with no "bad" strings (case 1: unpatched 0.196ms , patched 0.192ms - statistically same) (it could be even faster cause of first commit - compare hashsum while traversing collision chain).

As a bonus, last commit adds option "SMART_STRINGS=2": strong hash function used for all strings that fall in long collision chain - it will cover all paranoid needs. Used function is as resistant to "Seed Independent Collisions" as SipHash, but it is not crypto-safe, cause uses less state and produce just 32bit output. It is slower than SipHash on 64bit platform, but faster on 32bit platform.

DemiMarie · 2016-05-27T23:17:38Z

A couple suggestions:

Instead of a weakened SipHash, just use SipHash-1-3 and truncate the output if necessary. Crypto algorithms are brittle – you can't change them and expect any guarantees unless you are a cryptographer yourself.
The table I am most worried about is the string intern table. While it might be okay for LuaJIT's ordinary tables to not be resistant to HashDOS, if the intern table is vulnerable then all strings must be stored in FFI data structures. That means that – unless one is willing to risk bugs leading to arbitrary code execution due to violations of memory safety – one must pay an extra indirection (to keep the FFI data structures in the metatable hidden), and one loses compatibility with most libraries.
Judy arrays would be a good choice. Unfortunately the only implementation is LGPL and it doesn't allow strings with NUL bytes in them.

funny-falcon · 2016-05-28T10:35:10Z

SipHash has awful performance on 32bit platform. This 32bit version has all SipHash's properties valuable for hash table, and it has same perfornance either on 32bit or 64bit.

DemiMarie · 2016-06-26T22:15:09Z

@funny-falcon Yes, SipHash is slow on 32 bit. The solution is to write it using SIMD instructions, either using compiler intrinsics or just placing it in the assembler part of LuaJIT.

The reason why a 32-bit version of SipHash doesn't have SipHash's problems is that one does not get as much nonlinearity, which is critical for security. The best 32-bit crypto hash function is Chaskey, which is not really any faster than SipHash.

The best solution is to use SipHash-1-3, but to avoid computing hashes at run time as much as possible. If a table key is a literal in the source code, its hash should be computed during bytecode compilation, not after. Neither x.a = b, y = x.c, nor x['alpha'] = b, nor q = x['beta'] should ever cause a hash recomputation at runtime. String interning should be lazy: a string's hash code should not be computed until it is used as a hashtable key and/or compared to another string – and, in the latter case, the strings should be compared first; if they are equal, only one needs to be hashed.

Saying "don't let untrusted data in an interned string" is a very bad idea, since it means that the obvious code (that almost everyone writing Lua code will use) is vulnerable to a denial of service attack. In fact, it goes even further: it means that one must put all untrusted data (which, in a web app, is pretty much everything not in the source code itself) in FFI datastructures. That means manual memory management and no bounds checks. This is very bad. Security must always be the default.

One alternative is to switch to hash tables backed by trees or tries. This prevents the quadratic time DoS attack, since the degradation is only to O(n log n).

@MikePall If hash tables cannot be randomized, then it is game over anyway – finding collisions ahead of time is trivial. Unless you want to use 160-bit or beyond hash codes.

For a universal hash function, it is probably better to use a fresh key for every single table. The stream cipher ChaCha20 is a sufficiently fast PRNG.

Finally, note that none of this is needed for hashing pointers, at least ones that are not numbers. Since they represent machine addresses, they can be assumed to not be under an attacker's control. Therefore, a very simple hash function can be used that is very fast.

funny-falcon · 2016-06-27T06:38:31Z

The reason why a 32-bit version of SipHash doesn't have SipHash's problems is that one does not get as much nonlinearity, which is critical for security.

@DemiMarie But SipHash's author (Jean-Philippe Aumasson) doesn't agree with you. He says, 32bit version actually mixes bits faster than 64bit version so gives better non-linearity. The single reason why 32bit version is not "secure" is smaller state and 32bit result. That is what he said.

Don't mix crypto-security and hash tables, please. Hash tables doesn't need cryptographically secure functions. They need protection from hash flood (seed independent collisions). Cryptographically secure hash provides protection from hash flood, but it is really overkill.

All your other suggestions are huge modifications. You are free to make a pull request. Lets write meaningfull code, not meaningless words. Ok?

DemiMarie · 2016-06-28T04:55:13Z

@funny-falcon sorry!

DemiMarie · 2016-08-16T02:13:12Z

Here are two concrete proposals. Both keep the current data structure in the fast path, while falling back to a slower (but more robust) alternative in the slow path.

Instead of using linear probing to back the hash table, use a tree. This solves most of the DoS problem – a slowdown to O(n log(n)) is likely to be unexploitable.
Use a 2 level hash, with a (cryptographically) strong hash function as the backup hash. This solves the problem entirely.

strrchr · 2016-08-19T10:02:24Z

I found some hard coded hash values in lj_cparse.c and lib_ffi.c.
If I change the hash algorithm for string interning, I must change these hard coded hash values too.
Horrible!
I suggest use strcmp, simple!
Then the string hash algorithm will be simply configured.

funny-falcon · 2016-08-19T10:58:57Z

My patch has work around ( see #174 )

funny-falcon · 2016-08-19T11:00:48Z

@DemiMarie
One proposal has implementation and bechmark ( #174 ).
Other is just theoretical suggestion.

DemiMarie · 2016-08-19T18:13:17Z

I like your proposal and hope that it will be merged.

On Aug 19, 2016 7:00 AM, "Sokolov Yura" notifications@github.com wrote:

@DemiMarie https://github.com/DemiMarie
One proposal has implementation and bechmark ( #174
#174 ).
Other is just theoretical suggestion.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGGWBwqZR_bFATLgc9rxcvpVEbmQ_cPWks5qhYzmgaJpZM4IRxfJ
.

corsix · 2016-08-19T22:31:05Z

An alternative workaround would be to extend GG_State to include all of
these strings, and then identify said strings based on their offset
relative to g rather than based on their (fast) hash value. The same
inclusion-in-GG_State/offset trick could be used for lexically reserved
identifiers instead of the current GCstr::reserved field, which could be
beneficial in the context of the New GC (as abolishing the GCHeader would
leave no convenient space for the GCstr::reserved field, and inclusion in
the GG_State block would be a nice alternative to the LJ_GC_FIXED flag).

On Fri, Aug 19, 2016 at 11:58 AM, Sokolov Yura notifications@github.com
wrote:

My patch has work around.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#168 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAGhknUo4HKaIQtizAQMzcvJ3BgGYE1ks5qhYxxgaJpZM4IRxfJ
.

DemiMarie · 2016-08-20T03:07:48Z

Do you (or anyone else) plan on implementing the new GC?

On Aug 19, 2016 6:31 PM, "Peter Cawley" notifications@github.com wrote:

An alternative workaround would be to extend GG_State to include all of
these strings, and then identify said strings based on their offset
relative to g rather than based on their (fast) hash value. The same
inclusion-in-GG_State/offset trick could be used for lexically reserved
identifiers instead of the current GCstr::reserved field, which could be
beneficial in the context of the New GC (as abolishing the GCHeader would
leave no convenient space for the GCstr::reserved field, and inclusion in
the GG_State block would be a nice alternative to the LJ_GC_FIXED flag).

On Fri, Aug 19, 2016 at 11:58 AM, Sokolov Yura notifications@github.com
wrote:

My patch has work around.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#168 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/
AAAGhknUo4HKaIQtizAQMzcvJ3BgGYE1ks5qhYxxgaJpZM4IRxfJ>
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#168 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGGWB75esqLDvTuabT_i_O3hd27YLtPfks5qhi6vgaJpZM4IRxfJ
.

fsfod · 2016-08-20T05:32:56Z

Maybe someone 😉 has already working partial implementation of the new GC. That also allocates the GG_State in the arena and was thinking about the a similar idea to Corsix's for built-in strings allocating them after it you could use the strings cellId in place of reserved id.
I'm not the biggest fan of losing the gct field from objects though because it will no longer be possible to enumerate the objects in an arena based on block bits and for fixed strings the ctype system also creates them you would have to keep them all in a big hash table instead of a dense cellId list per arena.

DemiMarie · 2016-08-21T16:55:12Z

@fsfod are you that someone 😉 ?

thegrb93 · 2018-07-09T22:05:00Z

Apparently HashDOS is being used against luajit powered games now? Facepunch/garrysmod-issues#3526

gonzalezjo · 2018-07-09T22:08:04Z

Hey, that's me. Yeah, here was the code that I used to generate collisions: https://github.com/gonzalezjo/ljhashdos/blob/master/gen.c

MikePall · 2020-06-23T01:39:01Z

Fixed.

bungle mentioned this issue Apr 28, 2016

"different template cached" is more slowy than "different template, different context cached" bungle/lua-resty-template#18

Closed

funny-falcon mentioned this issue Apr 28, 2016

Simple string hash for large strings #169

Closed

bungle mentioned this issue May 1, 2016

performance tests between lua-resty-template and others bungle/lua-resty-template#3

Closed

funny-falcon mentioned this issue May 4, 2016

Smart string hashing #174

Closed

MikePall added the enhancement label May 6, 2016

MikePall changed the title ~~String hash collision~~ Reduce string hash collisions May 6, 2016

MikePall mentioned this issue Sep 1, 2016

Will LUAI_MAXSHORTLEN in Lua-5.3 be supported in LuaJIT? #210

Closed

yanxurui mentioned this issue Nov 10, 2017

feature: support pass table to send_query openresty/lua-resty-mysql#68

Open

agentzh mentioned this issue Apr 25, 2018

process exit with code 132 openresty/docker-openresty#39

Closed

thegrb93 mentioned this issue Jul 9, 2018

Crash the server w/a malformed table Facepunch/garrysmod-issues#3526

Open

daurnimator mentioned this issue Jun 13, 2019

Add support for Timing attack safe string comparison wahern/luaossl#170

Closed

MikePall mentioned this issue Feb 2, 2020

Pathological string performance #555

Closed

MikePall closed this as completed Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce string hash collisions #168

Reduce string hash collisions #168

bungle commented Apr 28, 2016 •

edited

Loading

DemiMarie commented May 2, 2016

funny-falcon commented May 4, 2016

MikePall commented May 6, 2016

daurnimator commented May 6, 2016

funny-falcon commented May 6, 2016

DemiMarie commented May 27, 2016

funny-falcon commented May 28, 2016

DemiMarie commented Jun 26, 2016 •

edited

Loading

funny-falcon commented Jun 27, 2016 •

edited

Loading

DemiMarie commented Jun 28, 2016

DemiMarie commented Aug 16, 2016

strrchr commented Aug 19, 2016 •

edited

Loading

funny-falcon commented Aug 19, 2016 •

edited

Loading

funny-falcon commented Aug 19, 2016

DemiMarie commented Aug 19, 2016

corsix commented Aug 19, 2016

DemiMarie commented Aug 20, 2016

fsfod commented Aug 20, 2016

DemiMarie commented Aug 21, 2016

thegrb93 commented Jul 9, 2018

gonzalezjo commented Jul 9, 2018

MikePall commented Jun 23, 2020

Reduce string hash collisions #168

Reduce string hash collisions #168

Comments

bungle commented Apr 28, 2016 • edited Loading

DemiMarie commented May 2, 2016

funny-falcon commented May 4, 2016

MikePall commented May 6, 2016

daurnimator commented May 6, 2016

funny-falcon commented May 6, 2016

DemiMarie commented May 27, 2016

funny-falcon commented May 28, 2016

DemiMarie commented Jun 26, 2016 • edited Loading

funny-falcon commented Jun 27, 2016 • edited Loading

DemiMarie commented Jun 28, 2016

DemiMarie commented Aug 16, 2016

strrchr commented Aug 19, 2016 • edited Loading

funny-falcon commented Aug 19, 2016 • edited Loading

funny-falcon commented Aug 19, 2016

DemiMarie commented Aug 19, 2016

corsix commented Aug 19, 2016

DemiMarie commented Aug 20, 2016

fsfod commented Aug 20, 2016

DemiMarie commented Aug 21, 2016

thegrb93 commented Jul 9, 2018

gonzalezjo commented Jul 9, 2018

MikePall commented Jun 23, 2020

bungle commented Apr 28, 2016 •

edited

Loading

DemiMarie commented Jun 26, 2016 •

edited

Loading

funny-falcon commented Jun 27, 2016 •

edited

Loading

strrchr commented Aug 19, 2016 •

edited

Loading

funny-falcon commented Aug 19, 2016 •

edited

Loading