Try to trap concurrent hash accesses. #1492

nwc10 · 2021-05-16T10:18:51Z

Previously MoarVM could well SEGV if two user threads access the same hash without locking. Now if it detects concurrent access it will oops instead with a diagnostic. It's not clear to me that we can really recover if we detect this, as concurrent write accesses might have already corrupted the hash structures too far for us to resolve problems.

nwc10 · 2021-05-16T10:20:17Z

@jnthn: Your example does this:

   at -e:1  (<ephemeral file>:)
 from -e:1  (<ephemeral file>:)
 from SETTING::src/core.c/Promise.pm6:263  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:882  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:251  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:245  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:242  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:run-one)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:284  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/Thread.pm6:54  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:THREAD-ENTRY)

Previously it went SEGV, or occasionally malloc aborted after detecting a double free.

jnthn · 2021-05-16T12:50:48Z

Now if it detects concurrent access it will oops instead with a diagnostic.

Certainly an improvement, if nothing else because the stack trace points at where to start looking in order to fix the problem, whereas a SEGV gives no such clue.

It's not clear to me that we can really recover if we detect this, as concurrent write accesses might have already corrupted the hash structures too far for us to resolve problems.

Certainly not something for this PR, but it'd be interesting to consider if it's possible to make this situation recoverable. So far as I understand, this only really detects cases where we have a conflict that leads to that hash growing in size, and the control structure is replaced at these grow points, thus why we can mark the old one state?

Otherwise anything oversize that has been queued to free at a safepoint can leak on --full-cleanup.

… free. The control structure now has a flag which is set set to 0 when it is allocated. When the hash expands (and needs a new larger allocation) flag is set to 1 in the soon-to-be-freed memory, and the memory is scheduled to be released at the next safe point. This way avoid C-level use-after-free if threads attempt to mutate the same hash concurrently, and hopefully can spot at least some cases and fault them, often enough for bugs to be noticed.

Previously MoarVM could well SEGV if two user threads access the same hash without locking. Now if it detects concurrent access it will oops instead with a diagnostic. It's not clear to me that we can really recover if we detect this, as concurrent write accesses might have already corrupted the hash structures too far for us to resolve problems.

User threads are responsible for ensuring that they don't access MoarVM hashes concurrently, with mutexes or other concurrency primitives. If they ignore this there is the potential for VM-level memory corruption and crashes (or worse). Growing hashes is rare, hence we afford the small performance hit of using atomic operations to detect if two threads attempt to reallocate the same hash at the same time, and exit with an informative message if so. (At the point we detect this we can't actually recover as we have a double free.)

nwc10 · 2021-05-16T19:33:00Z

Rebasing onto #1493

nwc10 · 2021-05-16T20:00:28Z

Certainly not something for this PR, but it'd be interesting to consider if it's possible to make this situation recoverable. So far as I understand, this only really detects cases where we have a conflict that leads to that hash growing in size, and the control structure is replaced at these grow points, thus why we can mark the old one state?

Yes, correct in your understanding. For what it is doing, it can only detect a subset of data races. if two threads concurrently add then delete keys but never increase the size, this won't get spotted (and oopsed), but MoarVM likely could go wrong in all sorts of strange ways (possibly up to SEGV). But I realised that the approach would add very little overhead, and probably catch most naughtiness eventually.

I'm not sure that it's possible to detect and recover in the general case (ie try to do anything more sophisticated than this) without actual locking (on every write). In that, the worst case that I can think of is already pretty bad - you have two threads both intending to insert a fresh key. Both keys hash into the middle of a run of occupied slots. So both threads need to walk along from the ideal position, find where their key belongs (based on the Robin Hood offset invariants), and then potentially move some of the later keys "to the right". If one thread has already decided to move a block of memory at the same time that the second thread is reading in the middle of it, then all bets are off as to what the second thread actually reads.

Hence I can't see how you can recover from that. So you'd have to lock before every write. (And unlike a hash based on chained buckets, you'd have to hold a global lock for the hash, instead of being able to have a finer grained lock for each chain. Or some CAS approach to inserting into the chains)

Even the "less bad" failure cases (I suspect) can end up breaking some invariants, such as trying to add two keys at the same time in two threads might exceed the maximum probe distance.

nwc10 requested a review from jnthn May 16, 2021 10:19

nwc10 added 4 commits May 16, 2021 19:19

MVM_fixed_size_destroy needs to free the safepoint overflow list.

5ba30ed

Otherwise anything oversize that has been queued to free at a safepoint can leak on --full-cleanup.

nwc10 force-pushed the hash-stale branch from d181883 to 7775615 Compare May 16, 2021 20:24

jnthn approved these changes May 26, 2021

View reviewed changes

jnthn merged commit 6c5adb6 into master May 26, 2021

jnthn deleted the hash-stale branch May 26, 2021 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to trap concurrent hash accesses. #1492

Try to trap concurrent hash accesses. #1492

nwc10 commented May 16, 2021

nwc10 commented May 16, 2021

jnthn commented May 16, 2021

nwc10 commented May 16, 2021

nwc10 commented May 16, 2021

Try to trap concurrent hash accesses. #1492

Try to trap concurrent hash accesses. #1492

Conversation

nwc10 commented May 16, 2021

nwc10 commented May 16, 2021

jnthn commented May 16, 2021

nwc10 commented May 16, 2021

nwc10 commented May 16, 2021