Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to trap concurrent hash accesses. #1492

Merged
merged 4 commits into from
May 26, 2021
Merged

Try to trap concurrent hash accesses. #1492

merged 4 commits into from
May 26, 2021

Conversation

nwc10
Copy link
Contributor

@nwc10 nwc10 commented May 16, 2021

Previously MoarVM could well SEGV if two user threads access the same hash without locking. Now if it detects concurrent access it will oops instead with a diagnostic. It's not clear to me that we can really recover if we detect this, as concurrent write accesses might have already corrupted the hash structures too far for us to resolve problems.

@nwc10 nwc10 requested a review from jnthn May 16, 2021 10:19
@nwc10
Copy link
Contributor Author

nwc10 commented May 16, 2021

@jnthn: Your example does this:

   at -e:1  (<ephemeral file>:)
 from -e:1  (<ephemeral file>:)
 from SETTING::src/core.c/Promise.pm6:263  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:882  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:251  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:245  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:242  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:run-one)
 from SETTING::src/core.c/ThreadPoolScheduler.pm6:284  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:)
 from SETTING::src/core.c/Thread.pm6:54  (/home/nick/Perl/rakudo/blib/CORE.c.setting.moarvm:THREAD-ENTRY)

Previously it went SEGV, or occasionally malloc aborted after detecting a double free.

@jnthn
Copy link
Member

jnthn commented May 16, 2021

Now if it detects concurrent access it will oops instead with a diagnostic.

Certainly an improvement, if nothing else because the stack trace points at where to start looking in order to fix the problem, whereas a SEGV gives no such clue.

It's not clear to me that we can really recover if we detect this, as concurrent write accesses might have already corrupted the hash structures too far for us to resolve problems.

Certainly not something for this PR, but it'd be interesting to consider if it's possible to make this situation recoverable. So far as I understand, this only really detects cases where we have a conflict that leads to that hash growing in size, and the control structure is replaced at these grow points, thus why we can mark the old one state?

nwc10 added 4 commits May 16, 2021 19:19
Otherwise anything oversize that has been queued to free at a safepoint
can leak on --full-cleanup.
… free.

The control structure now has a flag which is set set to 0 when it is
allocated. When the hash expands (and needs a new larger allocation) flag is
set to 1 in the soon-to-be-freed memory, and the memory is scheduled to be
released at the next safe point. This way avoid C-level use-after-free if
threads attempt to mutate the same hash concurrently, and hopefully can spot
at least some cases and fault them, often enough for bugs to be noticed.
Previously MoarVM could well SEGV if two user threads access the same hash
without locking. Now if it detects concurrent access it will oops instead
with a diagnostic. It's not clear to me that we can really recover if we
detect this, as concurrent write accesses might have already corrupted the
hash structures too far for us to resolve problems.
User threads are responsible for ensuring that they don't access MoarVM
hashes concurrently, with mutexes or other concurrency primitives.

If they ignore this there is the potential for VM-level memory corruption
and crashes (or worse). Growing hashes is rare, hence we afford the small
performance hit of using atomic operations to detect if two threads attempt
to reallocate the same hash at the same time, and exit with an informative
message if so. (At the point we detect this we can't actually recover as
we have a double free.)
@nwc10
Copy link
Contributor Author

nwc10 commented May 16, 2021

Rebasing onto #1493

@nwc10
Copy link
Contributor Author

nwc10 commented May 16, 2021

Certainly not something for this PR, but it'd be interesting to consider if it's possible to make this situation recoverable. So far as I understand, this only really detects cases where we have a conflict that leads to that hash growing in size, and the control structure is replaced at these grow points, thus why we can mark the old one state?

Yes, correct in your understanding. For what it is doing, it can only detect a subset of data races. if two threads concurrently add then delete keys but never increase the size, this won't get spotted (and oopsed), but MoarVM likely could go wrong in all sorts of strange ways (possibly up to SEGV). But I realised that the approach would add very little overhead, and probably catch most naughtiness eventually.

I'm not sure that it's possible to detect and recover in the general case (ie try to do anything more sophisticated than this) without actual locking (on every write). In that, the worst case that I can think of is already pretty bad - you have two threads both intending to insert a fresh key. Both keys hash into the middle of a run of occupied slots. So both threads need to walk along from the ideal position, find where their key belongs (based on the Robin Hood offset invariants), and then potentially move some of the later keys "to the right". If one thread has already decided to move a block of memory at the same time that the second thread is reading in the middle of it, then all bets are off as to what the second thread actually reads.

Hence I can't see how you can recover from that. So you'd have to lock before every write. (And unlike a hash based on chained buckets, you'd have to hold a global lock for the hash, instead of being able to have a finer grained lock for each chain. Or some CAS approach to inserting into the chains)

Even the "less bad" failure cases (I suspect) can end up breaking some invariants, such as trying to add two keys at the same time in two threads might exceed the maximum probe distance.

@jnthn jnthn merged commit 6c5adb6 into master May 26, 2021
@jnthn jnthn deleted the hash-stale branch May 26, 2021 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants