-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to trap concurrent hash accesses. #1492
Conversation
@jnthn: Your example does this:
Previously it went |
Certainly an improvement, if nothing else because the stack trace points at where to start looking in order to fix the problem, whereas a SEGV gives no such clue.
Certainly not something for this PR, but it'd be interesting to consider if it's possible to make this situation recoverable. So far as I understand, this only really detects cases where we have a conflict that leads to that hash growing in size, and the control structure is replaced at these grow points, thus why we can mark the old one state? |
Otherwise anything oversize that has been queued to free at a safepoint can leak on --full-cleanup.
… free. The control structure now has a flag which is set set to 0 when it is allocated. When the hash expands (and needs a new larger allocation) flag is set to 1 in the soon-to-be-freed memory, and the memory is scheduled to be released at the next safe point. This way avoid C-level use-after-free if threads attempt to mutate the same hash concurrently, and hopefully can spot at least some cases and fault them, often enough for bugs to be noticed.
Previously MoarVM could well SEGV if two user threads access the same hash without locking. Now if it detects concurrent access it will oops instead with a diagnostic. It's not clear to me that we can really recover if we detect this, as concurrent write accesses might have already corrupted the hash structures too far for us to resolve problems.
User threads are responsible for ensuring that they don't access MoarVM hashes concurrently, with mutexes or other concurrency primitives. If they ignore this there is the potential for VM-level memory corruption and crashes (or worse). Growing hashes is rare, hence we afford the small performance hit of using atomic operations to detect if two threads attempt to reallocate the same hash at the same time, and exit with an informative message if so. (At the point we detect this we can't actually recover as we have a double free.)
Rebasing onto #1493 |
Yes, correct in your understanding. For what it is doing, it can only detect a subset of data races. if two threads concurrently add then delete keys but never increase the size, this won't get spotted (and I'm not sure that it's possible to detect and recover in the general case (ie try to do anything more sophisticated than this) without actual locking (on every write). In that, the worst case that I can think of is already pretty bad - you have two threads both intending to insert a fresh key. Both keys hash into the middle of a run of occupied slots. So both threads need to walk along from the ideal position, find where their key belongs (based on the Robin Hood offset invariants), and then potentially move some of the later keys "to the right". If one thread has already decided to move a block of memory at the same time that the second thread is reading in the middle of it, then all bets are off as to what the second thread actually reads. Hence I can't see how you can recover from that. So you'd have to lock before every write. (And unlike a hash based on chained buckets, you'd have to hold a global lock for the hash, instead of being able to have a finer grained lock for each chain. Or some CAS approach to inserting into the chains) Even the "less bad" failure cases (I suspect) can end up breaking some invariants, such as trying to add two keys at the same time in two threads might exceed the maximum probe distance. |
Previously MoarVM could well SEGV if two user threads access the same hash without locking. Now if it detects concurrent access it will oops instead with a diagnostic. It's not clear to me that we can really recover if we detect this, as concurrent write accesses might have already corrupted the hash structures too far for us to resolve problems.