Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ustring hash collision instrumentation #2786

Merged
merged 1 commit into from Dec 25, 2020

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Dec 14, 2020

OSL has some concerns about how ustring hash collisions could impact its
strategy for strings on GPU.

This patch adds some instrumentation to ustring to detect hash
collisions. Added static ustring methods total_ustrings() and
hash_collisions(). The latter tells you the number of collision pairs,
including optionally enumerating them into a vector of ustrings. Also,
getstats(verbose=true), in Debug builds only, will print all of the
colliding pairs. These are both somewhat expensive if there are lots
of strings (my laptop takes about 4 seconds to find and print all
collisions for 10M ustrings). So if you are concerned about
performance in debug builds, call getstats(false) instead if you have
a truly enormous ustring table.

OSL has some concerns about how ustring hash collisions could impact
its strategy for strings on GPU.

This patch adds some instrumentation to ustring to detect hash
collisions.  Added static ustring methods total_ustrings() and
hash_collisions(). The latter tells you the number of collision pairs,
including optionally enumerating them into a vector of ustrings. Also,
`getstats(verbose=true)`, in Debug builds only, will print all of the
colliding pairs.  These are both somewhat expensive if there are lots
of strings (my laptop takes about 4 seconds to find and print all
collisions for 10M ustrings). So if you are concerned about
performance in debug builds, call getstats(false) instead if you have
a truly enormous ustring table.
@lgritz
Copy link
Collaborator Author

lgritz commented Dec 15, 2020

@etheory

@etheory
Copy link
Contributor

etheory commented Dec 17, 2020

Thanks @lgritz I'll take a look!
Out of curiosity, how many collisions did you get in your 10 million test?

@lgritz
Copy link
Collaborator Author

lgritz commented Dec 17, 2020

None. In my BILLION string test, I still got no collisions. I tried running 10 billion, but it wedged my workstation after the strings ran out of memory (I think I have 192 GB on that machine?).

There are a lot of distinct values in a uint64!

TBH, I was really hoping that I could find some pair that collided, then I could make a unit test including those two strings, just to prove that the "catch duplicates" code really worked. So here is the question -- are the chances higher that there were no collisions, or that there were but my code has a subtle bug making it impossible to find collisions?

@lgritz lgritz merged commit 6b6d455 into AcademySoftwareFoundation:master Dec 25, 2020
@lgritz lgritz deleted the lg-ustringcollision branch December 25, 2020 20:17
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request Jan 4, 2021
OSL has some concerns about how ustring hash collisions could impact
its strategy for strings on GPU.

This patch adds some instrumentation to ustring to detect hash
collisions.  Added static ustring methods total_ustrings() and
hash_collisions(). The latter tells you the number of collision pairs,
including optionally enumerating them into a vector of ustrings. Also,
`getstats(verbose=true)`, in Debug builds only, will print all of the
colliding pairs.  These are both somewhat expensive if there are lots
of strings (my laptop takes about 4 seconds to find and print all
collisions for 10M ustrings). So if you are concerned about
performance in debug builds, call getstats(false) instead if you have
a truly enormous ustring table.
@etheory
Copy link
Contributor

etheory commented Jan 12, 2021

Hi @lgritz - I managed to find an example of a collision.
Here is the first:

ustring a("yerkn11ke1");
ustring b("SKXs0SVGGx");
CPPUNIT_ASSERT(a.hash() == b.hash());

I'm not suggesting this is a likely filename, but I'll find some more-filename-like examples for you.

My point is simply that I found a counter-example.

I found some great info here: https://stackoverflow.com/questions/53543035/reverse-jenkins-one-at-a-time-hash

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 12, 2021

wow, cool. Er, sorta.

It's hard to know what to make of this.

In some sense, this is not unexpected. We knew that strings could collide, we just weren't sure that we could proactively generate short-ish strings that would collide.

So, are you saying, this is totally not gonna work and we need to find some other plan for OSL strings before there's a disaster?

Or do you think the chances of a collision of things that matter is low enough that we can punt on this for right now, knowing that eventually we're going to need an alternate collision-free string strategy, but we can stall for a while to try to hash out (excuse the pun) a solution that isn't going to bite us in the PTX cache?

@etheory
Copy link
Contributor

etheory commented Jan 12, 2021

wow, cool. Er, sorta.

😃

It's hard to know what to make of this.

Well, IMO, it's proof that using a hash and assuming it won't (or "can't") collide is not a good idea.

In some sense, this is not unexpected.

😃 - I kind got the idea it sort of was unexpected from the TSC. Maybe just not THIS easily.

We knew that strings could collide, we just weren't sure that we could proactively generate short-ish strings that would collide.

Having seen this stuff in the wild, I knew it was possible, but it's good to now finally have at least semi-reasonable proof of my seemingly-impossible claims.

So, are you saying, this is totally not gonna work and we need to find some other plan for OSL strings before there's a disaster?

That is the pessimistic way of stating it, but basically, yes.
I think you could still use statically-compiled hashes, but there must be some way of dealing with collisions.
Would it be possible to provide some kind of mechanism to leave checking the hash collisions up to the user?
i.e. you have a hash, and then you manually check the strings to see if they are the same. But that string comparison is up to user code?

Or do you think the chances of a collision of things that matter is low enough that we can punt on this for right now, knowing that eventually we're going to need an alternate collision-free string strategy, but we can stall for a while to try to hash out (excuse the pun) a solution that isn't going to bite us in the PTX cache?

I wouldn't suggest this, since test code has a habit of become production code so quickly that you can't back out. But that's just my opinion. It might make sense knowing this to take it to the next TSC and get some other opinions. But I think that the take-away is that whatever the system is, it MUST allow for collisions, since they will happen.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 12, 2021

Let's follow these up on AcademySoftwareFoundation/OpenShadingLanguage#1309 rather than in this old OIIO ticket, so that the right people are following the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants