Skip to content

Fix encoder/game thread race; add ID fallback in event refs#10561

Merged
tool4ever merged 4 commits intoCard-Forge:masterfrom
MostCromulent:fix-encoder-race
Apr 30, 2026
Merged

Fix encoder/game thread race; add ID fallback in event refs#10561
tool4ever merged 4 commits intoCard-Forge:masterfrom
MostCromulent:fix-encoder-race

Conversation

@MostCromulent
Copy link
Copy Markdown
Contributor

Background

Investigating user report of mid-game network freezes/disconnects on Android multiplayer.

User logs from master showed EncoderException: ArrayIndexOutOfBoundsException: length=247; index=247 on setGameView messages, surfaced by PR #10535's Network send error listener. These errors are caught in the encoder pipeline and the game continues, so they don't directly explain the game freeze. But they pointed at a real concurrency bug worth investigating.

To dig further, PR #10558 added supplementary instrumentation. Running manual and batch tests with that instrumentation in place revealed a second pattern that wasn't visible in either master or the user logs: Could not resolve IdRef warnings followed by NPEs in EventVisualizer/GameLogFormatter. The NPE is caught by GameProtocolHandler and the game continues — the user-visible cost is a transient UI issue such as a missing sound or game-log entry, not a crash.

Root cause of the user-reported freeze is still unknown, and these bugs aren't proven to cause them — but ruling them out either way is a correctness fix.

What was wrong

1. Encoder/game-thread race on the TrackableObject EnumMap

PR #10535 (Async writes in RemoteClient: stop game thread blocking on slow consumers) replaced synchronous .sync() writes with async listener-based ones.

The change was correct but had a side effect: the Netty MessageToByteEncoder runs on the I/O event-loop thread, and pre-PR-10535 the .sync() implicitly guaranteed the game thread was parked while it ran. Post-PR-10535 the game thread continues mutating TrackableObject state via set() while the encoder serializes the same EnumMap. EnumMap.writeObject uses a size-then-iterate sequence; a concurrent put/remove makes iteration walk past the array bounds → AIOOBE.

Compounding: RemoteClient.write (used by the event-batching code path) had no failure listener — only send did. When the encoder threw on a setGameView queued via write, the failure was silent; the message was dropped on the wire, leaving the client's tracker missing the just-mutated state.

2. Event refs assume the receiver's tracker has every id

The wrap-events mechanism replaces nested CardView/PlayerView refs inside each GameEvent with compact wire markers (IdRef/StaleCardRef). This design predates the current code: the IdRef marker was introduced by PR #10018 (in GameEventProxy), StaleCardRef was added later by PR #9642 (delta sync), and PR #10343 consolidated both into TrackableSerializer.wrapEvents while removing GameEventProxy. Resolution requires the receiver's tracker to have the referenced id.

That assumption is solid for top-level protocol method args (selectCard(cv) — the server only invokes the method after registering the object) but not for refs inside events: events can carry CardViews that aren't reachable from GameView (transient zone-change copies, off-graph refs). On a tracker miss, resolution returns null and downstream visitors throw NPE on card.getCurrentState(). The NPE is caught by GameProtocolHandler and the game continues — the user-visible cost is a missing sound or game-log entry, not a crash.

StaleCardRef already carried name + imageKey for the specific case where the server's tracker held a different incarnation of the same id, but the general "tracker doesn't have this id at all" path was uncovered. PR #10343 didn't introduce this gap — it inherited it from the older GameEventProxy design.

The fix

1. Encode on the calling thread

CompatibleObjectEncoder exposes encodeToBuf(msg, alloc) — runs the encode logic synchronously on the calling thread and returns a fresh ByteBuf. RemoteClient.send/write/sendAndWait and FGameClient.send now call this before channel.writeAndFlush(byteBuf). The network I/O remains async — only the encoding step moves.

The thread race becomes structurally impossible: the calling thread is the only mutator of the data being serialized, so reads are sequential with writes by definition. RemoteClient.write also gains a failure listener so any future encode failures surface instead of silently dropping the message.

Alternative option considered but rejected was to override TrackableObject.writeObject to take a brief synchronized snapshot of the props EnumMap and serialize the snapshot. That removes the EnumMap-level race without changing where encoding runs. However on testing the snapshot was shallow, and EnumMap values that are themselves mutable lists aren't covered. Snapshotting "deep enough" reduces to re-implementing serialization, with a fresh defect at every nested data structure. Solving the root cause of the race is the simpler architectural solution.

2. Event refs carry display fallback data

Replaced IdRef/StaleCardRef for event-internal refs with EventCardRef(id, name, imageKey, preserveSnapshot) and EventPlayerRef(id, name, preserveSnapshot). resolve() tries the tracker first; on a miss, constructs a detached view from the embedded fields. The preserveSnapshot=true flag handles the previous StaleCardRef case (server's tracker holds a different incarnation than the event captured). IdRef is retained for top-level protocol method args, where tracker presence is guaranteed by construction.


🤖 Generated with Claude Code

The encoder previously ran on the Netty I/O thread, reading TrackableObject
state concurrently with the game thread's set() calls. EnumMap.writeObject
would intermittently fail with AIOOBE, and because RemoteClient.write had
no failure listener, the dropped setGameView left the client tracker in an
inconsistent state. Move encoding onto the calling thread (which is also the
only mutator) and hand the resulting ByteBuf to the channel for async write,
preserving non-blocking sends.

Inside wrapped events, replace IdRef/StaleCardRef with a unified EventCardRef/
EventPlayerRef carrying name and imageKey. resolve() prefers the tracker but
falls back to a detached view when the id isn't there — handles CardViews
referenced by events that aren't reachable from GameView, without relying on
cross-message ordering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MostCromulent MostCromulent requested a review from tool4ever April 29, 2026 11:13
@MostCromulent MostCromulent added BUG Something isn't working Netplay labels Apr 29, 2026
Comment thread forge-gui/src/main/java/forge/gamemodes/net/TrackableSerializer.java Outdated
Comment thread forge-gui/src/main/java/forge/gamemodes/net/TrackableSerializer.java Outdated
MostCromulent and others added 2 commits April 30, 2026 05:45
Address review on PR <!-- -->Card-Forge#10561:

- Drop EventPlayerRef and route PlayerView through IdRef in event mode.
  PlayerView is stable in the tracker for the lifetime of a game
  (Player.getView() returns a stable instance, no zone-change copies),
  so the snapshot/fallback path was guarding a case that doesn't fire.
  Top-level protocol args have always trusted the tracker for PlayerView;
  events now do the same. Saves the embedded name on the wire.

- Restore the JavaDoc on shouldReplaceTrackables that was over-trimmed,
  including the rationale for why applyDelta is not in the exclusion
  list.

- Convert IdRef and EventCardRef to records. Same semantics, less
  boilerplate, free toString() for net log output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review on PR <!-- -->Card-Forge#10561 (the JavaDoc-removal thread).

Restores the seven JavaDoc blocks that were dropped in the original
fix commit: class header, replace(), resolve(), measureSize(),
WrappedEvent, wrapEvents(), unwrapEvents(). Four come back verbatim;
three pick up StaleCardRef -> EventCardRef renames, and replace()
adds a paragraph for the new eventMode/preserveSnapshot semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MostCromulent MostCromulent marked this pull request as ready for review April 29, 2026 23:24
@tool4ever tool4ever merged commit ec6f1a6 into Card-Forge:master Apr 30, 2026
2 checks passed
@MostCromulent MostCromulent deleted the fix-encoder-race branch April 30, 2026 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUG Something isn't working Netplay

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants