ustring hash collision instrumentation #2786

lgritz · 2020-12-14T17:59:10Z

OSL has some concerns about how ustring hash collisions could impact its
strategy for strings on GPU.

This patch adds some instrumentation to ustring to detect hash
collisions. Added static ustring methods total_ustrings() and
hash_collisions(). The latter tells you the number of collision pairs,
including optionally enumerating them into a vector of ustrings. Also,
getstats(verbose=true), in Debug builds only, will print all of the
colliding pairs. These are both somewhat expensive if there are lots
of strings (my laptop takes about 4 seconds to find and print all
collisions for 10M ustrings). So if you are concerned about
performance in debug builds, call getstats(false) instead if you have
a truly enormous ustring table.

OSL has some concerns about how ustring hash collisions could impact its strategy for strings on GPU. This patch adds some instrumentation to ustring to detect hash collisions. Added static ustring methods total_ustrings() and hash_collisions(). The latter tells you the number of collision pairs, including optionally enumerating them into a vector of ustrings. Also, `getstats(verbose=true)`, in Debug builds only, will print all of the colliding pairs. These are both somewhat expensive if there are lots of strings (my laptop takes about 4 seconds to find and print all collisions for 10M ustrings). So if you are concerned about performance in debug builds, call getstats(false) instead if you have a truly enormous ustring table.

lgritz · 2020-12-15T17:48:21Z

@etheory

etheory · 2020-12-17T08:51:18Z

Thanks @lgritz I'll take a look!
Out of curiosity, how many collisions did you get in your 10 million test?

lgritz · 2020-12-17T17:09:20Z

None. In my BILLION string test, I still got no collisions. I tried running 10 billion, but it wedged my workstation after the strings ran out of memory (I think I have 192 GB on that machine?).

There are a lot of distinct values in a uint64!

TBH, I was really hoping that I could find some pair that collided, then I could make a unit test including those two strings, just to prove that the "catch duplicates" code really worked. So here is the question -- are the chances higher that there were no collisions, or that there were but my code has a subtle bug making it impossible to find collisions?

OSL has some concerns about how ustring hash collisions could impact its strategy for strings on GPU. This patch adds some instrumentation to ustring to detect hash collisions. Added static ustring methods total_ustrings() and hash_collisions(). The latter tells you the number of collision pairs, including optionally enumerating them into a vector of ustrings. Also, `getstats(verbose=true)`, in Debug builds only, will print all of the colliding pairs. These are both somewhat expensive if there are lots of strings (my laptop takes about 4 seconds to find and print all collisions for 10M ustrings). So if you are concerned about performance in debug builds, call getstats(false) instead if you have a truly enormous ustring table.

etheory · 2021-01-12T03:57:23Z

Hi @lgritz - I managed to find an example of a collision.
Here is the first:

ustring a("yerkn11ke1");
ustring b("SKXs0SVGGx");
CPPUNIT_ASSERT(a.hash() == b.hash());

I'm not suggesting this is a likely filename, but I'll find some more-filename-like examples for you.

My point is simply that I found a counter-example.

I found some great info here: https://stackoverflow.com/questions/53543035/reverse-jenkins-one-at-a-time-hash

lgritz · 2021-01-12T04:25:58Z

wow, cool. Er, sorta.

It's hard to know what to make of this.

In some sense, this is not unexpected. We knew that strings could collide, we just weren't sure that we could proactively generate short-ish strings that would collide.

So, are you saying, this is totally not gonna work and we need to find some other plan for OSL strings before there's a disaster?

Or do you think the chances of a collision of things that matter is low enough that we can punt on this for right now, knowing that eventually we're going to need an alternate collision-free string strategy, but we can stall for a while to try to hash out (excuse the pun) a solution that isn't going to bite us in the PTX cache?

etheory · 2021-01-12T05:16:56Z

wow, cool. Er, sorta.

😃

It's hard to know what to make of this.

Well, IMO, it's proof that using a hash and assuming it won't (or "can't") collide is not a good idea.

In some sense, this is not unexpected.

😃 - I kind got the idea it sort of was unexpected from the TSC. Maybe just not THIS easily.

We knew that strings could collide, we just weren't sure that we could proactively generate short-ish strings that would collide.

Having seen this stuff in the wild, I knew it was possible, but it's good to now finally have at least semi-reasonable proof of my seemingly-impossible claims.

So, are you saying, this is totally not gonna work and we need to find some other plan for OSL strings before there's a disaster?

That is the pessimistic way of stating it, but basically, yes.
I think you could still use statically-compiled hashes, but there must be some way of dealing with collisions.
Would it be possible to provide some kind of mechanism to leave checking the hash collisions up to the user?
i.e. you have a hash, and then you manually check the strings to see if they are the same. But that string comparison is up to user code?

Or do you think the chances of a collision of things that matter is low enough that we can punt on this for right now, knowing that eventually we're going to need an alternate collision-free string strategy, but we can stall for a while to try to hash out (excuse the pun) a solution that isn't going to bite us in the PTX cache?

I wouldn't suggest this, since test code has a habit of become production code so quickly that you can't back out. But that's just my opinion. It might make sense knowing this to take it to the next TSC and get some other opinions. But I think that the take-away is that whatever the system is, it MUST allow for collisions, since they will happen.

lgritz · 2021-01-12T05:21:52Z

Let's follow these up on AcademySoftwareFoundation/OpenShadingLanguage#1309 rather than in this old OIIO ticket, so that the right people are following the discussion.

lgritz merged commit 6b6d455 into AcademySoftwareFoundation:master Dec 25, 2020

lgritz deleted the lg-ustringcollision branch December 25, 2020 20:17

lgritz mentioned this pull request Jan 12, 2021

change how strings are handled n OptiX 7 AcademySoftwareFoundation/OpenShadingLanguage#1309

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ustring hash collision instrumentation #2786

ustring hash collision instrumentation #2786

lgritz commented Dec 14, 2020

lgritz commented Dec 15, 2020

etheory commented Dec 17, 2020

lgritz commented Dec 17, 2020

etheory commented Jan 12, 2021 •

edited

lgritz commented Jan 12, 2021

etheory commented Jan 12, 2021

lgritz commented Jan 12, 2021

ustring hash collision instrumentation #2786

ustring hash collision instrumentation #2786

Conversation

lgritz commented Dec 14, 2020

lgritz commented Dec 15, 2020

etheory commented Dec 17, 2020

lgritz commented Dec 17, 2020

etheory commented Jan 12, 2021 • edited

lgritz commented Jan 12, 2021

etheory commented Jan 12, 2021

lgritz commented Jan 12, 2021

etheory commented Jan 12, 2021 •

edited