client-api: Move websocket sender to its own tokio task #2906

kim · 2025-06-26T18:02:14Z

Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.

This addresses two issues:

The select! loop is not blocked on sending messages, and can thus
react to auxiliary events. Namely, when a module exits, we want to
terminate the connection as soon as possible in order to release any
database handles.
Large outgoing messages should not occupy tokio worker threads, in
particular when there are a large number of clients receiving large
intial updates.

Expected complexity level and risk

4 - The state transitions remain hard to follow.

Testing

Ran a stress test with many clients and large initial updates,
and observed no hangs / delays (which I did before this patch).
In reconnection scenarios, all clients where disconnected timely, but
could reconnect almost immediately.

Split the websocket stream into send and receive halves and spawns a new tokio task to handle the sending. Also move message serialization + compression to a blocking task if the message appears to be large. This addresses two issues: 1. The `select!` loop is not blocked on sending messages, and can thus react to auxiliary events. Namely, when a module exits, we want to terminate the connection as soon as possible in order to release any database handles. 2. Large outgoing messages should not occupy tokio worker threads, in particular when there are a large number of clients receiving large intial updates.

crates/client-api/src/routes/subscribe.rs

gefjon

I'd like to figure out what's going on with the SerializeBuffer and fix it before merging, but otherwise this looks good to me.

Also close the messages queue after the close went through. Accordingly, closed and exited are the same -- we can just drop incoming messages when closed.

kim · 2025-06-27T11:09:02Z

Updated to:

Reclaim the serialize buffer
Not send any more data after sending a Close frame (as mandated by the RFC)

I think that we should also clear the message queue and cancel outstanding execution futures in the latter case, but that can be left to a future change.

jsdt

I looked through this for a while, and I'm still not very confident that I understand the error cases. I think we should do some bot testing with this to see what effect it has, but I think I'd like to try writing some tests for this, so we can trigger some of these tricky cases.

jsdt · 2025-06-27T14:24:32Z

crates/client-api/src/routes/subscribe.rs

+    message: impl ToProtocol<Encoded = SwitchedServerMessage> + Send + 'static,
+) -> (SerializeBuffer, Result<(), WsError>) {
+    let (workload, num_rows) = metrics_metadata.unzip();
+    let start_serialize = Instant::now();


It would be nice to time serialization inside the blocking task, since the time spent switching to a blocking thread can be significant (especially if we use up all of our blocking threads).

jsdt · 2025-06-27T14:35:36Z

crates/client-api/src/routes/subscribe.rs

+    // as serialization and compression can take a long time.
+    // The threshold of 1024 rows is arbitrary, and may need to be refined.
+    let (msg_alloc, msg_data) = if num_rows.is_some_and(|n| n > 1024) {
+        asyncify(move || serialize(serialize_buf, message, config)).await


Adding a blocking task here feels risky. I'd like to remove blocking tasks generally, and I think this would be the first place where the number of blocking tasks isn't tied to the number of requests per second.

What do you think about starting by adding the timing metric, and maybe a warning log message any time that serialization takes longer than some threshold?

An alternative would be sending these to rayon instead of blocking threads, which would limit the number of threads working on serialization, and put the work on pinned cores.

The time this takes can be >1sec for pathological cases, and only after measuring that I added the asyncify.

Using rayon instead seems fine.

jsdt · 2025-06-27T14:50:50Z

crates/client-api/src/routes/subscribe.rs

-                        log::warn!("error sending ping: {e:#}");
-                    }
+                    // If the sender is already gone,
+                    // we'll time out the connection eventually.


Why not break from the loop here instead of waiting for a timeout?

unordered_tx.send fails if either the connection is bad or we already sent a close frame ourselves. Without more rework, I can't distinguish those. In the latter case we need to keep polling the recv end until the other end responds with a close.

jsdt · 2025-06-27T14:51:35Z

crates/client-api/src/routes/subscribe.rs

-                            .expect("should have a unique referent to `msg_alloc`");
-
+                        // Ignoring send errors is apparently fine in this case.
+                        let _ = unordered_tx.send(err.into());


Similar question here. It feels like we should always break if this fails.

This I don't know. It was like this before.

jsdt · 2025-06-27T14:54:43Z

crates/client-api/src/routes/subscribe.rs

+enum UnorderedWsMessage {
+    Close(CloseFrame),
+    Ping(Bytes),
+    Error(MessageExecutionError),


If this includes error like a reducer failing, then we do want it to be order with subscription updates, since the the reason for the reducer could be data-dependent.

It is not an error result of a reducer call (this will appear on the MeteredReceiver), but an error calling the reducer in the first place. For example, if the reducer does not exist or the arguments are wrong. Iow, the reducer wasn't actually called.

I'm less sure about the other message types, e.g. subscribe commands. But I'd like to point out that it was not ordered in the code before this patch.

kim requested review from Centril, gefjon and jsdt June 26, 2025 18:02

kim commented Jun 26, 2025

View reviewed changes

crates/client-api/src/routes/subscribe.rs Outdated Show resolved Hide resolved

gefjon approved these changes Jun 26, 2025

View reviewed changes

kim added 2 commits June 27, 2025 10:07

Reclaim those bytes

038aeb0

Don't send more data after sending a close frame.

b67fbd3

Also close the messages queue after the close went through. Accordingly, closed and exited are the same -- we can just drop incoming messages when closed.

jsdt reviewed Jun 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

client-api: Move websocket sender to its own tokio task #2906

client-api: Move websocket sender to its own tokio task #2906

Uh oh!

kim commented Jun 26, 2025

Uh oh!

Uh oh!

gefjon left a comment

Uh oh!

kim commented Jun 27, 2025

Uh oh!

jsdt left a comment

Uh oh!

jsdt Jun 27, 2025

Uh oh!

jsdt Jun 27, 2025

Uh oh!

kim Jun 27, 2025

Uh oh!

jsdt Jun 27, 2025

Uh oh!

kim Jun 27, 2025

Uh oh!

jsdt Jun 27, 2025

Uh oh!

kim Jun 27, 2025

Uh oh!

jsdt Jun 27, 2025

Uh oh!

kim Jun 27, 2025

Uh oh!

Uh oh!

client-api: Move websocket sender to its own tokio task #2906

Are you sure you want to change the base?

client-api: Move websocket sender to its own tokio task #2906

Uh oh!

Conversation

kim commented Jun 26, 2025

Expected complexity level and risk

Testing

Uh oh!

Uh oh!

gefjon left a comment

Choose a reason for hiding this comment

Uh oh!

kim commented Jun 27, 2025

Uh oh!

jsdt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!