Theoretical data race in non-raw mode, due to incorrect atomic orderings #14

nyanpasu64 · 2021-01-02T17:36:27Z

While writing my own version of this crate (I independently came across the same concept of a triple buffer plus an atomic bitflag), I tested my code with Loom to check for atomic ordering bugs. (Unfortunately, Loom's version of UnsafeCell removes the APIs which return & or &mut T, and only supplies APIs which accept callbacks, so I had to cripple my API on Loom-enabled builds, and use the crippled API for Loom testing).

If I either wrote Input::publish() to use Release ordering, or Output::update() to use Acquire ordering, then Loom would detect a data race when accessing the UnsafeCells. After turning on Loom debugging and adding debug prints, I traced how the data race occurred, and translated my types/variables/methods to the triple-buffer equivalents.

Trace

Output thread runs:

Output::output_buffer() returns &mut into SharedState::buffers[2], Output::read() casts it to &, and it's subsequently read (a)
- I'm not sure if even constructing a &mut is UB if you're not using AcqRel on both the reader and writer... But this is moot, because you can get a data race even if Output only reads.
Output::update() swaps atomic SharedState::back_info: store 2 with Relaxed ordering, load 1 with Acquire ordering (b)

Input thread runs:

Input::publish() swaps atomic SharedState::back_info: store 0 with Release ordering, load 2 with Relaxed ordering (c)
Input::input_buffer() returns &mut into buffers[2], and Input::write() writes into it (d)

In Rust, for each atomic memory location, all threads see all reads/writes in the same order (because C++ and Rust only exposes LLVM's Monotonic and stronger levels). But the order may not be consistent between different memory locations. As I understand it, thread 1 reads memory location A before writing to B, and thread 2 reads the new value of B before writing to A, but thread 2's write to A could happen before thread 1's read of A (due to hardware reordering or caching or something).

Explained differently, (a) and (d) form a data race (in spite of (b) and (c)), because the the output's Relaxed store (b) does not synchronize-with the input's Relaxed load (c), so it does not force (a) to happen-before (d) (terminology from Preshing on Programming).

Does it happen in practice? (no)

Looking at Wikipedia's article on memory ordering, on ARM devices, (a) can be reordered with (b), and also (c) reordered with (d). I ran triple-buffer's unit tests on an ARM32 and ARM64 Android device (using Termux's rust package) but they didn't detect any reorderings in practice. The contended_concurrent_read_write test (append --exact tests::contended_concurrent_read_write to the command line) operates on a RaceCell, which points to both a stack and heap element. However, it never found any inconsistencies.

(a) and (b) happen in different function calls, and so do (c) and (d), which probably limits the ability for compilers to reorder them. I'm not sure if hardware can reorder them, since I haven't learned about atomic operations at the assembly/hardware/cache level.

Solution

I switched to Ordering::AcqRel for both atomic accesses, to silence Loom (which also fixed the underlying bug). I don't know how much performance you lose by doing that regardless of "raw" mode, though.

The text was updated successfully, but these errors were encountered:

HadrienG2 · 2021-01-10T18:00:05Z

Ah, I see, the theoretical data race that I overlooked is that at the time of a buffer swap, hardware can do this:

The reader thread requests a read from the buffer. Since reading data takes time, the associated CPU asynchronously schedules the read and tries to speculatively execute compatible program instructions coming after that while waiting for the results of the read to come back.
For some reason, the reader thread immediately schedules an update, so it checks the back-buffer index, notices that new data has been written to the back-buffer, and swaps the back-buffer index. These reads and writes target a different memory location from the previous buffer read and there's no Read-Write barrier around (that would require Release ordering), so in the eyes of the CPU manipulating the back-buffer index is a compatible operation that can be executed concurrently. Hence those two streams of memory instructions are scheduled in parallel.
The back-buffer swap completes before the buffer read has been carried out. This is actually quite unlikely to happen in practice, because the back-buffer index write has been requested a few instructions later and that data should not usually be warmer in cache than the back-buffer data. If anything, it should be less warm, since the write from the writer thread "stole" that cache line from the L1 cache of the reader thread's CPU core...
But anyway, let's assume that this does happen. Meanwhile, the writer thread fetches a new back-buffer, and gets the buffer that the reader thread has just released (which is a cache miss, since the reader "stole" the cache line of the back-buffer index).
It writes into it very quickly (which again is a cache miss, since the reader was reading from that buffer and should therefore have it in its L1 cache), and as it is in the process of doing so...
...the reader thread finally finishes to read from the buffer, and gets corrupt data.

I cannot think of a practical hardware implementation in which all this can happen in the short time span between the moment where the reader's back-buffer index write has been committed and the moment where the reader's old back-buffer read is performed. Which is most likely why you did not observe it on real hardware. But that behavior is indeed allowed by the memory model, and so a Sufficiently Evil Optimizer (which is what Loom tries to model) would be within its right to generate code that makes this data race much more likely to happen.

I guess I need to give up on the Acquire/Release optimization and use AcqRel in all operating modes, then...

Since this was the main motivation for putting the "raw" interface behind a feature flag, I could use that bugfix as an opportunity to question whether I should keep the feature flag or not. Getting rid of it would simplify the code, but would arguably be a semver-breaking change, although I can make a semver-compatible version of it by keeping a "raw" feature flag that does nothing for a while, and not renaming methods right away.

HadrienG2 · 2021-01-14T07:21:39Z

There is now a fix for this on master. I discovered that I did not have benchmarks for the low-level API along the way, so I'm going to fix that, and then I'll tag a bugfix release.

nyanpasu64 · 2021-01-16T09:20:50Z

I'm not very familiar with how atomics map onto generated code (only really spent time studying high-level memory models, not CPU coherence and barriers), but do you think memory fences would be faster or slower than atomic swaps? (I don't know fences well, but Mintomic (an older atomic library) only offers fences, not read/write orderings.)

HadrienG2 · 2021-01-16T09:39:54Z

The swap must be atomic in any case, because otherwise you would have this race condition:

Reader reads index of the back buffer (let's call that i1)
Writer swaps its current buffer (let's call it i2) with the back buffer (which was i1)
- Now writer writes to i1 and back buffer is i2
Reader writes index of its former buffer (let's call that i3) as the new back buffer index and starts reading from i1, thinking that this was still the back buffer at the time where it performed the "swap".
- Now reader and writer are both accessing i1 in a racey fashion.

However, as an atomic swap, it could be a Relaxed atomic operation, surrounded by fences (you need to put a fence before a write for Release ordering and a fence after a read for Acquire ordering).

This should be slower than an AcqRel swap on any hardware architecture where Acquire and Release require actual synchronization (i.e. basically any modern architecture other than x86), especially if the CPU architecture has load-acquire and store-release instructions (see e.g. ARMv8), because...

You need two fences and a swap instead of just an acqrel swap, which is 3x more instructions and probably 3x more synchronization unless the CPU hardware is super clever.
The compiler will not replace the fence/swap/fence sandwich with an acqrel operation during optimization as...
- The semantics are probably not the same (e.g. a Read/Read fence prevents all reads before from being reordered with all reads after, whereas a load-acquire allows reads and writes coming before the load-acquire to be reordered after it... not sure if similar considerations apply to acq-rel though, since it's a 2-way barrier).
- Even if it were legal, LLVM is very poor at optimizing atomics-based code, so it probably wouldn't do it anyway.

HadrienG2 · 2021-01-16T13:20:32Z

By the way, I just tagged the bugfix, so this bug is now officially resolved ;)

HadrienG2 added the bug label Jan 2, 2021

HadrienG2 closed this as completed Jan 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Theoretical data race in non-raw mode, due to incorrect atomic orderings #14

Theoretical data race in non-raw mode, due to incorrect atomic orderings #14

nyanpasu64 commented Jan 2, 2021

HadrienG2 commented Jan 10, 2021 •

edited

Loading

HadrienG2 commented Jan 14, 2021

nyanpasu64 commented Jan 16, 2021

HadrienG2 commented Jan 16, 2021 •

edited

Loading

HadrienG2 commented Jan 16, 2021

Theoretical data race in non-raw mode, due to incorrect atomic orderings #14

Theoretical data race in non-raw mode, due to incorrect atomic orderings #14

Comments

nyanpasu64 commented Jan 2, 2021

Trace

Does it happen in practice? (no)

Solution

HadrienG2 commented Jan 10, 2021 • edited Loading

HadrienG2 commented Jan 14, 2021

nyanpasu64 commented Jan 16, 2021

HadrienG2 commented Jan 16, 2021 • edited Loading

HadrienG2 commented Jan 16, 2021

HadrienG2 commented Jan 10, 2021 •

edited

Loading

HadrienG2 commented Jan 16, 2021 •

edited

Loading