New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ustring hash collision instrumentation #2786
ustring hash collision instrumentation #2786
Conversation
OSL has some concerns about how ustring hash collisions could impact its strategy for strings on GPU. This patch adds some instrumentation to ustring to detect hash collisions. Added static ustring methods total_ustrings() and hash_collisions(). The latter tells you the number of collision pairs, including optionally enumerating them into a vector of ustrings. Also, `getstats(verbose=true)`, in Debug builds only, will print all of the colliding pairs. These are both somewhat expensive if there are lots of strings (my laptop takes about 4 seconds to find and print all collisions for 10M ustrings). So if you are concerned about performance in debug builds, call getstats(false) instead if you have a truly enormous ustring table.
Thanks @lgritz I'll take a look! |
None. In my BILLION string test, I still got no collisions. I tried running 10 billion, but it wedged my workstation after the strings ran out of memory (I think I have 192 GB on that machine?). There are a lot of distinct values in a uint64! TBH, I was really hoping that I could find some pair that collided, then I could make a unit test including those two strings, just to prove that the "catch duplicates" code really worked. So here is the question -- are the chances higher that there were no collisions, or that there were but my code has a subtle bug making it impossible to find collisions? |
OSL has some concerns about how ustring hash collisions could impact its strategy for strings on GPU. This patch adds some instrumentation to ustring to detect hash collisions. Added static ustring methods total_ustrings() and hash_collisions(). The latter tells you the number of collision pairs, including optionally enumerating them into a vector of ustrings. Also, `getstats(verbose=true)`, in Debug builds only, will print all of the colliding pairs. These are both somewhat expensive if there are lots of strings (my laptop takes about 4 seconds to find and print all collisions for 10M ustrings). So if you are concerned about performance in debug builds, call getstats(false) instead if you have a truly enormous ustring table.
Hi @lgritz - I managed to find an example of a collision. ustring a("yerkn11ke1"); I'm not suggesting this is a likely filename, but I'll find some more-filename-like examples for you. My point is simply that I found a counter-example. I found some great info here: https://stackoverflow.com/questions/53543035/reverse-jenkins-one-at-a-time-hash |
wow, cool. Er, sorta. It's hard to know what to make of this. In some sense, this is not unexpected. We knew that strings could collide, we just weren't sure that we could proactively generate short-ish strings that would collide. So, are you saying, this is totally not gonna work and we need to find some other plan for OSL strings before there's a disaster? Or do you think the chances of a collision of things that matter is low enough that we can punt on this for right now, knowing that eventually we're going to need an alternate collision-free string strategy, but we can stall for a while to try to hash out (excuse the pun) a solution that isn't going to bite us in the PTX cache? |
😃
Well, IMO, it's proof that using a hash and assuming it won't (or "can't") collide is not a good idea.
😃 - I kind got the idea it sort of was unexpected from the TSC. Maybe just not THIS easily.
Having seen this stuff in the wild, I knew it was possible, but it's good to now finally have at least semi-reasonable proof of my seemingly-impossible claims.
That is the pessimistic way of stating it, but basically, yes.
I wouldn't suggest this, since test code has a habit of become production code so quickly that you can't back out. But that's just my opinion. It might make sense knowing this to take it to the next TSC and get some other opinions. But I think that the take-away is that whatever the system is, it MUST allow for collisions, since they will happen. |
Let's follow these up on AcademySoftwareFoundation/OpenShadingLanguage#1309 rather than in this old OIIO ticket, so that the right people are following the discussion. |
OSL has some concerns about how ustring hash collisions could impact its
strategy for strings on GPU.
This patch adds some instrumentation to ustring to detect hash
collisions. Added static ustring methods total_ustrings() and
hash_collisions(). The latter tells you the number of collision pairs,
including optionally enumerating them into a vector of ustrings. Also,
getstats(verbose=true)
, in Debug builds only, will print all of thecolliding pairs. These are both somewhat expensive if there are lots
of strings (my laptop takes about 4 seconds to find and print all
collisions for 10M ustrings). So if you are concerned about
performance in debug builds, call getstats(false) instead if you have
a truly enormous ustring table.