Skip to content

Asynchronous send: stop game thread blocking on slow clients#10535

Merged
tool4ever merged 1 commit intoCard-Forge:masterfrom
MostCromulent:NetworkPlay/async-send
Apr 27, 2026
Merged

Asynchronous send: stop game thread blocking on slow clients#10535
tool4ever merged 1 commit intoCard-Forge:masterfrom
MostCromulent:NetworkPlay/async-send

Conversation

@MostCromulent
Copy link
Copy Markdown
Contributor

What's the issue

In network multiplayer server writes move at the speed of the slowest client. The cost of one bad connection (high latency, low effective throughput, slow client-side processing, anything that delays TCP acknowledgments) is paid by every player in the game, because the game thread is shared.

This is because the server's game thread cannot fire a new event to remote clients until every connected client has received the previous one. Each event goes through channel.writeAndFlush(event).sync() per client, in sequence — and .sync() blocks the calling thread until the write has been accepted by the OS socket buffer, which can only happen after the receiver's TCP stack has acknowledged enough prior bytes. Slow clients hold up the chain.

Bursts of game activity that produce many events in a short time multiply the effect — each event in the burst waits for the full slowest-client round-trip before the next can fire.

What's the history

The synchronous behaviour does not appear to be an intentional design choice. Git history shows .sync() was added in February 2018 (802036f693e, "print a message when unable to send event") so write exceptions would flow into a surrounding try/catch for logging. Without .sync(), writeAndFlush returns a ChannelFuture whose failure is reported asynchronously via listener — which the 2018 patch did not use. The original Forge networking before Feb 2018 was async. The proper fix in 2018 would have been to attach an error listener.

What's the fix

Replace the synchronous channel.writeAndFlush(event).sync() calls in RemoteClient with asynchronous channel.writeAndFlush(event) plus a ChannelFuture listener for error reporting. The game thread fires the write and continues; success/failure is observed off-thread.

User-facing effect

  • The host (running the server): their local UI is driven by the same game thread that today parks on .sync(). With the change, the host's UI updates in real time regardless of any remote client's network. This is often the most visible benefit, since the host is usually actively playing while their game freezes whenever a remote opponent's link stalls.
  • Other remote players in a game with one slow participant: their UI no longer freezes when a different opponent has a bad connection. Each remote client experiences only their own connection's quality, not the slowest player's.
  • The slow remote participant themselves: unchanged in feel — their UI updates as fast as their own connection allows. They simply stop dragging everyone else along when it isn't their decision to make.
  • Whichever player the engine is currently waiting on for a response (priority decisions, target selection, mulligan, blocker declarations, choosing modes, ordering triggers — anything routed through sendAndWait): the game still waits for their reply, by design. The engine cannot progress without the answer, before or after the change. If the player on the bad connection happens to be the one being prompted, they're the bottleneck — but that's current behaviour, not regression.

How it works

Three changes in RemoteClient:

  • send() writes async; errors logged via listener. The old send() blocked X ms telemetry is removed — once the call is async, the calling thread doesn't block, so blocked-ms is meaningless. The equivalent signal (a client falling behind) now comes from saturation logging — see below.
  • sendAndWait() no longer delegates to send(); it does its own async writeAndFlush() with a listener that completes the reply future with null on write failure, so the game thread doesn't park for up to 45 s waiting for heartbeat to detect a dead connection.
  • write() is unchanged in send semantics — it gains a saturation-counter increment, nothing more.

Client synchronisation is preserved:

The sendAndWait() path is the mechanism for "server fires a request and waits for the client's answer" — the engine genuinely cannot progress past a prompt (target selection, mulligan, ordering triggers, paying X) without the response. The reply-future wait inside sendAndWait() is what enforces that synchronisation, and it is unchanged. The async refactor only removes the additional block on the write itself flushing, which today happens on top of the reply wait, contributing nothing to correctness.

Per-request wait goes from (write_flush + client_processing + reply) to (client_processing + reply). The "waiting for the client's answer" part is preserved exactly. If a write actually fails (channel closing, encoder error), the listener now completes the reply future with null immediately, and callers already handle null as "client did not answer" — strictly faster than today, where they parked until the 45s heartbeat noticed.

Saturation error logging (replaces send() blocked X ms)

With async writes, per-call blocking time is no longer a meaningful signal — but we still want to know when a client can't keep up. The replacement observes Netty's outbound buffer directly:

  • WriteBufferWaterMark(64 KB low, 1 MB high) configured on each client channel. Hard-coded constants — same standard everywhere.

  • New SaturationLoggingHandler listens for channelWritabilityChanged events and logs entry/recovery as a paired set of lines:

    [WARN ] RemoteClient: Outbound buffer saturated for PlayerName (pending 1.2 MB)
    [INFO ] RemoteClient: PlayerName recovered after 1.3s outbound buffer saturation (47 server sends)
    

    The "47 server sends" counter is incremented inside send()/write()/sendAndWait() whenever isWritable() is false, so the recovery line shows how many writes piled up during the saturation window.

  • NetworkLogAnalyzer swaps its old send() blocked X ms parser for one that reads the new recovery lines, reporting episode count and total / average / max duration in the network performance section.


🤖 Generated with Claude Code

…mers

Replaces channel.writeAndFlush(event).sync() in send() and sendAndWait()
with addListener on the returned ChannelFuture: the game thread fires
and continues, errors are reported via listener (cause inlined since
tinylog config doesn't render throwables), and sendAndWait completes
its reply with null on write failure so it doesn't park for the 45 s
heartbeat timeout.

FServerManager configures WriteBufferWaterMark(64 KB low / 1 MB high)
on each client channel and adds a SaturationLoggingHandler. When a
channel's outbound buffer crosses the high mark and isWritable() flips
false, the handler logs an entry warning with pending bytes; on
recovery it logs a single line with the elapsed duration and the count
of server sends attempted during the window (tracked via an
AtomicInteger on RemoteClient, incremented inside send/write/
sendAndWait whenever isWritable() is false).

Network log analyzer replaces the now-dead "send() blocked X ms"
parsing with parsing for the new saturation recovery lines, exposing
episode count, total / average / max duration, and total queued sends
in the network performance section of analysis reports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tool4ever
Copy link
Copy Markdown
Contributor

seems good (as long as we don't switch to UDP :D)

@MostCromulent MostCromulent marked this pull request as ready for review April 27, 2026 09:32
@tool4ever tool4ever merged commit 136890f into Card-Forge:master Apr 27, 2026
2 checks passed
@MostCromulent MostCromulent deleted the NetworkPlay/async-send branch April 27, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants