Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sync): high performance channel implementation + other sync utilities #127

Closed
wants to merge 22 commits into from

Conversation

ultd
Copy link
Contributor

@ultd ultd commented Apr 15, 2024

This PR includes new synchronization utilities including a new channel implementation based on Rust's crossbeam but simplified significantly. Changes include:

  • sync/chanx.zig: a new Channel(T) enum implementation allowing for ease of switching underlying channel implementation if needed for fine-tuning, etc.
  • sync/backoff.zig: a simple backoff utility allowing for exponential backoff when busy-wait spin-looping allowing for spin-loop hinting to the processor and yielding when certain thresholds are reached.
  • sync/bounded.zig: a bounded channel implementation based on Rust's crossbeam crate but simplified and tuned for better performance (~15-20% better than crossbeam)
  • sync/waker.zig: a thread-safe sleeper waker for waking up sleeping threads :)
  • sync/parker.zig: a thread parking utility using libc.pthreads mutexes and conditions
  • sync/thread_context.zig: a thread-local context object which allows for thread state to be shared with waker and holds the Parker

Benchmarks:

Below is a benchmark run comparing the two channel implementations, Simple (old/original) and Bounded (new) and the results are very positive:


Benchmark (Original/Old)                                               Iterations    Min(us)    Max(us)   Variance   Mean(us)
-----------------------------------------------------------------------------------------------------------------------------
benchmarkSimpleUsizeChannel(  10k_items,   1_senders,   1_receivers )          10       1312       2406     192967       1946
benchmarkSimpleUsizeChannel( 100k_items,   4_senders,   4_receivers )          10       7639      12252    1379970       9128
benchmarkSimpleUsizeChannel( 500k_items,   8_senders,   8_receivers )          10      36237      42131    2369551      38927
benchmarkSimpleUsizeChannel(   1m_items,  16_senders,  16_receivers )          10      74344      80740    4831451      76765
benchmarkSimpleUsizeChannel(   5m_items,   4_senders,   4_receivers )          10     416965     505078 1018181618     467792
benchmarkSimpleUsizeChannel(   5m_items,  16_senders,  16_receivers )          10     367941     392741   55627865     385650
---
benchmarkSimplePacketChannel(  10k_items,   1_senders,   1_receivers )         10       1592       3149     285884       2665
benchmarkSimplePacketChannel( 100k_items,   4_senders,   4_receivers )         10      48470      53722    2636601      50924
benchmarkSimplePacketChannel( 500k_items,   8_senders,   8_receivers )         10     220709     275109  200987985     262009
benchmarkSimplePacketChannel(   1m_items,  16_senders,  16_receivers )         10     507506     544079   94580709     531636
benchmarkSimplePacketChannel(   5m_items,   4_senders,   4_receivers )         10    2578779    2698791 1644020628    2636010
benchmarkSimplePacketChannel(   5m_items,  16_senders,  16_receivers )         10    2601328    2741971 2020243775    2676189

Benchmark (Bounded/New)                                                Iterations    Min(us)    Max(us)   Variance   Mean(us)
------------------------------------------------------------------------------------------------------------------------------
benchmarkBoundedUsizeChannel(  10k_items,   1_senders,   1_receivers )          10        375        454        539        423
benchmarkBoundedUsizeChannel( 100k_items,   4_senders,   4_receivers )          10       5098       8107    1018721       6790
benchmarkBoundedUsizeChannel( 500k_items,   8_senders,   8_receivers )          10      29823      47701   29292545      34609
benchmarkBoundedUsizeChannel(   1m_items,  16_senders,  16_receivers )          10      62282      80598   28132118      68877
benchmarkBoundedUsizeChannel(   5m_items,   4_senders,   4_receivers )          10     274868     299741   76854475     288690
benchmarkBoundedUsizeChannel(   5m_items,  16_senders,  16_receivers )          10     306616     337070   75109840     317430
---
benchmarkBoundedPacketChannel(  10k_items,   1_senders,   1_receivers )         10       1577       2643     175760       1898
benchmarkBoundedPacketChannel( 100k_items,   4_senders,   4_receivers )         10      10975      13987     714882      12273
benchmarkBoundedPacketChannel( 500k_items,   8_senders,   8_receivers )         10      54353      70342   16948254      59328
benchmarkBoundedPacketChannel(   1m_items,  16_senders,  16_receivers )         10     111428     125724   17477537     121133
benchmarkBoundedPacketChannel(   5m_items,   4_senders,   4_receivers )         10     524387     617334  768445809     544122
benchmarkBoundedPacketChannel(   5m_items,  16_senders,  16_receivers )         10     560125     609936  270490903     580478

The one bechmark in specific to observe is the benchmarkBoundedUsizeChannel( 5m_items, 4_senders, 4_receivers ) runs at an impressive 288690 us or 0.288 s on average when comparing to the analogous Rust crossbeam benchmark results below (+22% speedup):

    Finished release [optimized] target(s) in 0.02s
     Running `/Users/unlimited/crossbeam/target/release/crossbeam-channel`
bounded_mpmc              Rust crossbeam-channel   0.352 sec

The most meaningful speedups occur in the larger item channel benchmarks such as PacketChannel (~1.2Kb per item). Some stats below:

                                             Speedup:
  10k_items,   1_senders,   1_receivers  --  🟢 +40%
 100k_items,   4_senders,   4_receivers  --  🟢 +414%
 500k_items,   8_senders,   8_receivers  --  🟢 +441%
   1m_items,  16_senders,  16_receivers  --  🟢 +438%
   5m_items,   4_senders,   4_receivers  --  🟢 +484%
   5m_items,  16_senders,  16_receivers  --  🟢 +461%

It's on average 450%+ speedup across the different variations of items, senders and receivers.

Another thing to note, the SimpleChannel (or old/original) is pre-allocating half the items in it's benchmark where as the BoundedChannel (or new) is only allocating a total of 4096 items across all the tests, significantly reducing the memory footprint of the channel. See memory footprint below:

                                             Old:  vs New:
  10k_items,   1_senders,   1_receivers  --  5.8Mb vs 5.1Mb
 100k_items,   4_senders,   4_receivers  --  58Mb  vs 5.1Mb
 500k_items,   8_senders,   8_receivers  --  293Mb vs 5.1Mb
   1m_items,  16_senders,  16_receivers  --  587Mb vs 5.1Mb
   5m_items,   4_senders,   4_receivers  --  2.9Gb vs 5.1Mb
   5m_items,  16_senders,  16_receivers  --  2.9Gb vs 5.1Mb

In addition to the above, the SimpleChannel's underlying receive mechanism is pop() on the std.ArrayList(T) in order to allow for an O(1) operation as opposed to O(N). This means that order is not preserved whereas the BoundedChannel is ordered and always O(1) when receiving being that it utilizes a ring-buffer.

Remaining todos:

  • Write comprehensive Channel(T) tests (with backing Bounded(T) implementation)
  • Benchmark other timeout methods like sendTimeout() and receiveTimeout(), etc.
  • Test channel operations thoroughly to look for deadlocks or data inconsistency/races
  • Test other utilities like Waker and Parker more thoroughly

@ultd ultd force-pushed the ultd/high-perf-channel-impl branch from 0d0265f to 29e9dd2 Compare April 15, 2024 13:12
@ultd ultd self-assigned this Apr 15, 2024
@ultd ultd requested review from 0xNineteen and dnut April 20, 2024 15:12
parker.unpark();
}

test "parker untimed" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix the tests to follow the sync.parker: format

const State = enum {
empty,
parked,
notified,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better name for this? - maybe unpark_requested or keep notified but add a comment above stating this is when a request to unpark is sent (similar to how you have comments on each state of thread_context state)

return &thread_local_parker;
}

pub fn assertEq(left: anytype, right: @TypeOf(left)) void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline?

}

pub fn park(self: *Self) void {
// if we were previously notified, we return early
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add a comment that we do this check in case we can return without locking

// read from the write it made to `state`.
var old: State = @enumFromInt(self.state.swap(@intFromEnum(State.empty), .SeqCst));
assertEq(c.pthread_mutex_unlock(&self.lock), SUCCESS);
assertEq(old, State.notified);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldnt this be false if park() is called multiple times because the first compareAndSwap on line 58 doesnt use the mutex?

ie, state = notified at the start of this block (switch(other_state)) but before the self.state.swap() is called, park() is called again, which changes it to empty on line 58, which leads to old = empty and the assert being false?

}
}

test "sync.chanx.bounded works" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep consistent with the other test formats => ie, sync.chanx: bounded works

}
};

test "thread state conversion to/from usize" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix test format - sync.thread_context:

}

pub inline fn notify(self: *Self) void {
if (!self.is_empty.load(.SeqCst)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we do this check, then lock, then check again?

why not lock, then check?

}
}

if (timeout != null and (std.time.Instant.now() catch unreachable).order(timeout.?) == .gt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would probs be more clear with a if (timeout) |time_o| { ... } block

Comment on lines 364 to 368
if (self.n_receivers.load(.seq_cst) == 0)
return false;
if (self.n_receivers.fetchSub(1, .seq_cst) == 1) {
self.disconnect();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we be consistent with if statement brackets here

ie, theres no brackets here, but two lines down theres brackets

on a more general note: idk if its worth a discussion but i would vote we always use brackets in if statements across the entire codebase - imo the option to bracket or not bracket single line blocks goes against the simplicity/'one way to do it' of zig which i do not like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific code example, I agree that it's best to use braces for both. Typically, I prefer to use braces in if statements.

In if expressions (when an if-else evaluates to a non-void value), I would argue to only use braces if it is actually required. For example:

const max = if (a > b) a else b;

Using braces, it would need to be be written like this:

const max = if (a > b) blk: {
    break :blk a;
} else blk: {
    break :blk b;
};

Braces would only be required if you need to execute multiple statements for each condition, since multiple statements would need to be executed in a block.

@0xNineteen
Copy link
Contributor

this pr has been stale for a while now (merge conflicts will likely be expensive to solve) - gonna close it and move it to this ticket which contains more context and a ref to this pr:

https://github.com/orgs/Syndica/projects/2/views/1?pane=issue&itemId=71252009

@0xNineteen 0xNineteen closed this Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants