feat(sync): high performance channel implementation + other sync utilities #127

ultd · 2024-04-15T01:13:11Z

This PR includes new synchronization utilities including a new channel implementation based on Rust's crossbeam but simplified significantly. Changes include:

sync/chanx.zig: a new Channel(T) enum implementation allowing for ease of switching underlying channel implementation if needed for fine-tuning, etc.
sync/backoff.zig: a simple backoff utility allowing for exponential backoff when busy-wait spin-looping allowing for spin-loop hinting to the processor and yielding when certain thresholds are reached.
sync/bounded.zig: a bounded channel implementation based on Rust's crossbeam crate but simplified and tuned for better performance (~15-20% better than crossbeam)
sync/waker.zig: a thread-safe sleeper waker for waking up sleeping threads :)
sync/parker.zig: a thread parking utility using libc.pthreads mutexes and conditions
sync/thread_context.zig: a thread-local context object which allows for thread state to be shared with waker and holds the Parker

Benchmarks:

Below is a benchmark run comparing the two channel implementations, Simple (old/original) and Bounded (new) and the results are very positive:


Benchmark (Original/Old)                                               Iterations    Min(us)    Max(us)   Variance   Mean(us)
-----------------------------------------------------------------------------------------------------------------------------
benchmarkSimpleUsizeChannel(  10k_items,   1_senders,   1_receivers )          10       1312       2406     192967       1946
benchmarkSimpleUsizeChannel( 100k_items,   4_senders,   4_receivers )          10       7639      12252    1379970       9128
benchmarkSimpleUsizeChannel( 500k_items,   8_senders,   8_receivers )          10      36237      42131    2369551      38927
benchmarkSimpleUsizeChannel(   1m_items,  16_senders,  16_receivers )          10      74344      80740    4831451      76765
benchmarkSimpleUsizeChannel(   5m_items,   4_senders,   4_receivers )          10     416965     505078 1018181618     467792
benchmarkSimpleUsizeChannel(   5m_items,  16_senders,  16_receivers )          10     367941     392741   55627865     385650
---
benchmarkSimplePacketChannel(  10k_items,   1_senders,   1_receivers )         10       1592       3149     285884       2665
benchmarkSimplePacketChannel( 100k_items,   4_senders,   4_receivers )         10      48470      53722    2636601      50924
benchmarkSimplePacketChannel( 500k_items,   8_senders,   8_receivers )         10     220709     275109  200987985     262009
benchmarkSimplePacketChannel(   1m_items,  16_senders,  16_receivers )         10     507506     544079   94580709     531636
benchmarkSimplePacketChannel(   5m_items,   4_senders,   4_receivers )         10    2578779    2698791 1644020628    2636010
benchmarkSimplePacketChannel(   5m_items,  16_senders,  16_receivers )         10    2601328    2741971 2020243775    2676189

Benchmark (Bounded/New)                                                Iterations    Min(us)    Max(us)   Variance   Mean(us)
------------------------------------------------------------------------------------------------------------------------------
benchmarkBoundedUsizeChannel(  10k_items,   1_senders,   1_receivers )          10        375        454        539        423
benchmarkBoundedUsizeChannel( 100k_items,   4_senders,   4_receivers )          10       5098       8107    1018721       6790
benchmarkBoundedUsizeChannel( 500k_items,   8_senders,   8_receivers )          10      29823      47701   29292545      34609
benchmarkBoundedUsizeChannel(   1m_items,  16_senders,  16_receivers )          10      62282      80598   28132118      68877
benchmarkBoundedUsizeChannel(   5m_items,   4_senders,   4_receivers )          10     274868     299741   76854475     288690
benchmarkBoundedUsizeChannel(   5m_items,  16_senders,  16_receivers )          10     306616     337070   75109840     317430
---
benchmarkBoundedPacketChannel(  10k_items,   1_senders,   1_receivers )         10       1577       2643     175760       1898
benchmarkBoundedPacketChannel( 100k_items,   4_senders,   4_receivers )         10      10975      13987     714882      12273
benchmarkBoundedPacketChannel( 500k_items,   8_senders,   8_receivers )         10      54353      70342   16948254      59328
benchmarkBoundedPacketChannel(   1m_items,  16_senders,  16_receivers )         10     111428     125724   17477537     121133
benchmarkBoundedPacketChannel(   5m_items,   4_senders,   4_receivers )         10     524387     617334  768445809     544122
benchmarkBoundedPacketChannel(   5m_items,  16_senders,  16_receivers )         10     560125     609936  270490903     580478

The one bechmark in specific to observe is the benchmarkBoundedUsizeChannel( 5m_items, 4_senders, 4_receivers ) runs at an impressive 288690 us or 0.288 s on average when comparing to the analogous Rust crossbeam benchmark results below (+22% speedup):

    Finished release [optimized] target(s) in 0.02s
     Running `/Users/unlimited/crossbeam/target/release/crossbeam-channel`
bounded_mpmc              Rust crossbeam-channel   0.352 sec

The most meaningful speedups occur in the larger item channel benchmarks such as PacketChannel (~1.2Kb per item). Some stats below:

                                             Speedup:
  10k_items,   1_senders,   1_receivers  --  🟢 +40%
 100k_items,   4_senders,   4_receivers  --  🟢 +414%
 500k_items,   8_senders,   8_receivers  --  🟢 +441%
   1m_items,  16_senders,  16_receivers  --  🟢 +438%
   5m_items,   4_senders,   4_receivers  --  🟢 +484%
   5m_items,  16_senders,  16_receivers  --  🟢 +461%

It's on average 450%+ speedup across the different variations of items, senders and receivers.

Another thing to note, the SimpleChannel (or old/original) is pre-allocating half the items in it's benchmark where as the BoundedChannel (or new) is only allocating a total of 4096 items across all the tests, significantly reducing the memory footprint of the channel. See memory footprint below:

                                             Old:  vs New:
  10k_items,   1_senders,   1_receivers  --  5.8Mb vs 5.1Mb
 100k_items,   4_senders,   4_receivers  --  58Mb  vs 5.1Mb
 500k_items,   8_senders,   8_receivers  --  293Mb vs 5.1Mb
   1m_items,  16_senders,  16_receivers  --  587Mb vs 5.1Mb
   5m_items,   4_senders,   4_receivers  --  2.9Gb vs 5.1Mb
   5m_items,  16_senders,  16_receivers  --  2.9Gb vs 5.1Mb

In addition to the above, the SimpleChannel's underlying receive mechanism is pop() on the std.ArrayList(T) in order to allow for an O(1) operation as opposed to O(N). This means that order is not preserved whereas the BoundedChannel is ordered and always O(1) when receiving being that it utilizes a ring-buffer.

Remaining todos:

Write comprehensive Channel(T) tests (with backing Bounded(T) implementation)
Benchmark other timeout methods like sendTimeout() and receiveTimeout(), etc.
Test channel operations thoroughly to look for deadlocks or data inconsistency/races
Test other utilities like Waker and Parker more thoroughly

0xNineteen · 2024-04-23T15:50:05Z

src/sync/parker.zig

+    parker.unpark();
+}
+
+test "parker untimed" {


can we fix the tests to follow the sync.parker: format

0xNineteen · 2024-04-23T15:54:40Z

src/sync/parker.zig

+    const State = enum {
+        empty,
+        parked,
+        notified,


is there a better name for this? - maybe unpark_requested or keep notified but add a comment above stating this is when a request to unpark is sent (similar to how you have comments on each state of thread_context state)

0xNineteen · 2024-04-23T16:08:05Z

src/sync/parker.zig

+    return &thread_local_parker;
+}
+
+pub fn assertEq(left: anytype, right: @TypeOf(left)) void {


0xNineteen · 2024-04-23T16:27:11Z

src/sync/parker.zig

+    }
+
+    pub fn park(self: *Self) void {
+        // if we were previously notified, we return early


would be nice to add a comment that we do this check in case we can return without locking

0xNineteen · 2024-04-23T16:36:59Z

src/sync/parker.zig

+                    // read from the write it made to `state`.
+                    var old: State = @enumFromInt(self.state.swap(@intFromEnum(State.empty), .SeqCst));
+                    assertEq(c.pthread_mutex_unlock(&self.lock), SUCCESS);
+                    assertEq(old, State.notified);


couldnt this be false if park() is called multiple times because the first compareAndSwap on line 58 doesnt use the mutex?

ie, state = notified at the start of this block (switch(other_state)) but before the self.state.swap() is called, park() is called again, which changes it to empty on line 58, which leads to old = empty and the assert being false?

0xNineteen · 2024-04-25T19:13:26Z

src/sync/chanx.zig

+    }
+}
+
+test "sync.chanx.bounded works" {


can we keep consistent with the other test formats => ie, sync.chanx: bounded works

0xNineteen · 2024-04-25T19:17:44Z

src/sync/thread_context.zig

+    }
+};
+
+test "thread state conversion to/from usize" {


fix test format - sync.thread_context:

0xNineteen · 2024-04-25T22:05:55Z

src/sync/waker.zig

+    }
+
+    pub inline fn notify(self: *Self) void {
+        if (!self.is_empty.load(.SeqCst)) {


why do we do this check, then lock, then check again?

why not lock, then check?

0xNineteen · 2024-04-25T22:07:24Z

src/sync/bounded.zig

+                    }
+                }
+
+                if (timeout != null and (std.time.Instant.now() catch unreachable).order(timeout.?) == .gt)


would probs be more clear with a if (timeout) |time_o| { ... } block

0xNineteen · 2024-04-25T22:08:50Z

src/sync/bounded.zig

+            if (self.n_receivers.load(.seq_cst) == 0)
+                return false;
+            if (self.n_receivers.fetchSub(1, .seq_cst) == 1) {
+                self.disconnect();
+            }


nit: can we be consistent with if statement brackets here

ie, theres no brackets here, but two lines down theres brackets

on a more general note: idk if its worth a discussion but i would vote we always use brackets in if statements across the entire codebase - imo the option to bracket or not bracket single line blocks goes against the simplicity/'one way to do it' of zig which i do not like

In this specific code example, I agree that it's best to use braces for both. Typically, I prefer to use braces in if statements.

In if expressions (when an if-else evaluates to a non-void value), I would argue to only use braces if it is actually required. For example:

const max = if (a > b) a else b;

Using braces, it would need to be be written like this:

const max = if (a > b) blk: { break :blk a; } else blk: { break :blk b; };

Braces would only be required if you need to execute multiple statements for each condition, since multiple statements would need to be executed in a block.

0xNineteen · 2024-07-17T12:48:53Z

this pr has been stale for a while now (merge conflicts will likely be expensive to solve) - gonna close it and move it to this ticket which contains more context and a ref to this pr:

https://github.com/orgs/Syndica/projects/2/views/1?pane=issue&itemId=71252009

ultd added 6 commits April 14, 2024 18:25

initial high performance channel implementation

d2b94bb

merge conflicts resolved

1d59cdc

fixed timers

e43725f

fix ambiguous reference error

c078bdb

minor

2766675

fix linux build

29e9dd2

ultd force-pushed the ultd/high-perf-channel-impl branch from 0d0265f to 29e9dd2 Compare April 15, 2024 13:12

ultd self-assigned this Apr 15, 2024

ultd added 7 commits April 17, 2024 17:09

added more tests, fixed minor bugs

ce4c06a

acquire sender/receiver should not check for chan disconnect

4e15ddf

fixed bounded chan tests

2cfded6

channel fixes, finish tests, add doc comments

8c27a6f

account for jitter in timer

8192b56

fix timer jitter

aaad166

revert checks on initSender() & initReceiver()

a9a7d8d

ultd requested review from 0xNineteen and dnut April 20, 2024 15:12

0.12 changes (will not build)

821934e

0xNineteen reviewed Apr 25, 2024

View reviewed changes

ultd and others added 8 commits April 27, 2024 07:25

Merge branch 'main' into ultd/high-perf-channel-impl

9abe650

0.12 changes implemented

0396eed

chanx test fixes and changes

314a98b

fix

bcb1ebe

more changes, added docs

a243d80

channel changes

bc54886

lint fix

84a751e

changes to timer in test

3cca1f4

0xNineteen closed this Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sync): high performance channel implementation + other sync utilities #127

feat(sync): high performance channel implementation + other sync utilities #127

ultd commented Apr 15, 2024 •

edited

Loading

0xNineteen Apr 23, 2024

0xNineteen Apr 23, 2024

0xNineteen Apr 23, 2024

0xNineteen Apr 23, 2024

0xNineteen Apr 23, 2024

0xNineteen Apr 25, 2024

0xNineteen Apr 25, 2024

0xNineteen Apr 25, 2024

0xNineteen Apr 25, 2024

0xNineteen Apr 25, 2024

dnut Apr 26, 2024

0xNineteen commented Jul 17, 2024

feat(sync): high performance channel implementation + other sync utilities #127

feat(sync): high performance channel implementation + other sync utilities #127

Conversation

ultd commented Apr 15, 2024 • edited Loading

Benchmarks:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xNineteen commented Jul 17, 2024

ultd commented Apr 15, 2024 •

edited

Loading