Skip to content

Latest commit

 

History

History
195 lines (151 loc) · 9.97 KB

algorithm.md

File metadata and controls

195 lines (151 loc) · 9.97 KB

Documentation for the algorithm implemented by this crate.

There is no code or public API surface in this file, just documentation for the underlying algorithm.

General Design

The purpose behind this concurrent map implementation is to minimize the overhead for reading as much as possible. This presents obvious challenges since readers cannot access the map while it is being written to. We solve this problem by keeping two copies of the underlying map - one for the writer to modify, and one for the readers to read from. When the writer wants to publish its modifications, we atomically swap the two maps such that new readers see the writer's changes. We then re-apply those changes to the old map when it is safe to do so.

This approach is already implemented by the crate evmap, however performance suffers due to high latency writes and a shared atomic pointer to the readable map. flashmap's implementation of this approach ensures the readers access no shared state with each other (other than the actual map) in their critical path, and moreover it ensures that sufficiently infrequent writes are effectively wait-free.

The trade-offs are that this map is eventually consistent, so existing readers can still access stale data, and there is only a single writer, so concurrent modification is not supported.

Map State

This section briefly describes where state lives in this data structure.

Shared State

The two maps are stored in a length-2 array in memory. Currently the backing map implementation is provided by hashbrown, but this might be made generic in the future. The pointer to this array is shared with all readers and the writer.

Readers - or more specifically ReadGuards - are tracked via atomic reference counting. Each read handle has its own individual refcount, but these refcounts are shared with the writer.

Because of the design of this map, we are able to have the writer block when it cannot make progress, and be unparked when the last reader gets out of its way. Thus the shared state of the map includes the necessarry bookkeeping to accomplish this, and will be explained in greater detail later.

Reader State

Currently, readers store no additional state other than what is described above. Each read handle does hold onto a refcount_key which is used to deallocate that handle's refcount on drop, but this state is immutable.

Writer State

Writers, in addition to an Arc of the aforementioned shared state, must store an operation log of changes. These changes are re-applied to the second map at the beginning of the next write.

The Algorithm

There are multiple components to this algorithm, each of which is explained individually below.

Ref Counts

A core component of this algorithm is reference counting. Each read handle has its own reference count (refcount) which it uses to track the number of outstanding guards which need to be dropped. In addition to the number of outstanding guards, the refcount is also used to store the current readable map (the map which new read guards should point to). Since there are only two maps, this information is stored in the high bit of the refcount, and the actual guard count is stored in the lower usize::BITS - 2 bits (we don't use a bit so we can check for overflow).

A refcount has three principle operations: increment, decrement, and swap maps. Part of the implementation is shown below:

pub struct RefCount {
    value: CachePadded<AtomicUsize>,
}

impl RefCount {
    /* The implementation of some helpers has been omitted for brevity */

    pub fn increment(&self) -> MapIndex {
        let old_value = self.value.fetch_add(1, Ordering::Acquire);
        Self::check_overflow(old_value);
        Self::to_map_index(old_value)
    }

    pub fn decrement(&self) -> MapIndex {
        let old_value = self.value.fetch_sub(1, Ordering::Release);
        Self::to_map_index(old_value)
    }

    pub fn swap_maps(&self) -> usize {
        let old_value = self
            .value
            .fetch_add(Self::MAP_INDEX_FLAG, Ordering::Relaxed);
        Self::to_refcount(old_value)
    }
}

increment and decrement both change the refcount as expected, but also make use of the return value of the fetch* operations. The old value of the refcount (before the RMW) will, in its high bit, tell us which map the writer expects us to be reading from. This is useful especially in increment, since the reader can guard its access to the map and know which map to read from in a single atomic operation. Why we also need the map index from decrement will be explained later.

The fetch_add in swap_maps has the same effect as toggling the high bit. This could be done through fetch_xor but I've seen that compiled to a CAS loop on most architectures. The return value is the number of active guards at the time of swapping the maps. This quantity will be referred to as "residual" since it's the number of residual readers still accessing the old map.

Creating and Dropping Read Handles

Each read handle has its own refcount, and all the refcounts for every read handle are stored in a single contiguous array (currently backed by slab). This array is wrapped in a Mutex, so any access to the entire array involves acquiring a lock. Specifically, when a read handle is constructed, a new refcount is allocated on the heap, the refcount array is locked, and a pointer to that refcount is added to the array. Similarly, when a read handle is dropped, the array is locked, the pointer is removed, and the refcount deallocated.

The Read Algorithm

The read algorithm is described below:

  1. increment the refcount.
  2. Take the returned MapIndex, convert it to an offset in the lengt-2 map array, and get a reference to the map to read from. Store the map index in the read guard.
  3. When the read guard is dropped, decrement the refcount. If the returned map index matches the one stored on the guard, we are done. If they do not match, proceed to 4.
  4. Since the map indexes do not match, this is a residual guard. Call release_residual to signal that we are done reading.
  5. If we see from the atomic decrement in release_residual that we were the last residual reader, wake the writer (this portion of the algorithm will be explained later).

release_residual is a function that operates on an atomic counter called residual which tracks the number of read guards blocking the writer from making progress. This quantity will be explained with the write algorithm.

Parking/Unparking the Writer

Parking and unparking the writer is also managed through the residual quantity. The residual count is stored in the lower 63 bits (or however many for the given architecture), and the highest bit signals whether or not the writer is waiting to be unparked.

When the writer prepares to park, it sets the high bit of the residual count to 1. If the return value of this RMW operation signals that no residual readers were left, then the count is set to zero and the writer does not park. Otherwise, it parks.

When the last residual reader decides to unpark the writer, it will atomically set the residual count to 0 and then unpark the writing thread.

The Write Algorithm

The write algorithm is split into two parts: synchronize + start write, and publish. When a new write guard is created, start_write is called, and when that guard is dropped publish is called.

synchronize + start write:

  1. If there are no residual readers, we are done.
  2. If there are residual readers, then park as described above.
  3. Once unblocked, obtain a reference to the writable map.
  4. Apply changes from previous write.

After these steps but before publish, the changes to the writable map are made and stored in the operation log.

publish:

  1. Acquire the lock on the refcount array.
  2. Swap out the value of the field storing the writable map with the old map.
  3. Call swap_maps on each refcount in the array, and (non-atomically) accumulate a sum of the returned residual count. Call this sum initial_residual.
  4. Release the lock on the refcount array.
  5. Atomically add initial_residual to the actual residual counter the readers will be decrementing. If there were no residual readers, or they finished reading while performing 3, then the residual count is now 0, signaling that we don't need to park on the next call to synchronize.

Note that read handles are not swapped to the new map at the same time, this is done one-by-one.

An important invariant of this algorithm is that residual == 0 whenever publish is called. That way, either the writing thread will see residual == 0 after swapping all the maps, or one of the residual readers will see residual & isize::MAX == 1 as the old value when it performs its atomic decrement. In either case, this provides a definite signal that there are no more readers lingering on the old map.

Recap

Creating a read guard corresponds to an atomic increment, and dropping a read guard corresponds to an atomic decrement, and at most one reader will execute the logic to unpark the writer. So overall the readers' critical path is wait-free if this crate is compiled on an architecture which supports wait-free atomic fetch-adds, and even if not, or the unparking logic is engaged, those operations are still fairly cheap compared to hash map operations.

Creating a write guard involves executing the synchronize + start write algorithm and then flushing the operations from the previous write. When the write guard is dropped, the maps are swapped, but no modifications to either map is made until the next creation of the write guard. This means that the maps (after the first write) are always out of sync technically, but this is by design. If we assume that there is some non-negligible amount of time between writes, and that readers don't have large critical sections, then that means the writer will almost never have to wait! The time between writes gives the residual readers a chance to move to the new map, and thus when the writer goes to write again, it won't have to wait for the readers.