Skip to content

fix(broker): preserve split multi-byte UTF-8 in worker_stream (#922)#923

Merged
willwashburn merged 2 commits into
mainfrom
claude/fix-issue-922-4REVi
May 20, 2026
Merged

fix(broker): preserve split multi-byte UTF-8 in worker_stream (#922)#923
willwashburn merged 2 commits into
mainfrom
claude/fix-issue-922-4REVi

Conversation

@willwashburn
Copy link
Copy Markdown
Member

Fixes #922.

Problem

crates/broker/src/pty_worker.rs decoded each raw PTY chunk with String::from_utf8_lossy(&chunk) in two places (the steady-state read path and the watchdog drain). When a multi-byte UTF-8 codepoint — box-drawing glyphs, emoji, CJK characters — landed across a PTY read boundary, the partial byte sequence was replaced with U+FFFD. worker_stream consumers then rendered /? even though the broker's VT snapshot path stayed correct, which matches what Pear was seeing in Claude Code box UI output.

Fix

New crates/broker/src/util/utf8_stream.rs exposes Utf8StreamDecoder:

  • Holds incomplete trailing UTF-8 bytes across decode calls and combines them with the next chunk.
  • Emits U+FFFD only for byte sequences std::str::from_utf8 reports as definitively invalid (non-zero error_len), never for incomplete-but-recoverable sequences.
  • flush() drains any remaining buffered bytes at end-of-stream and substitutes U+FFFD for the now-definitive truncation, so we don't silently drop bytes on child exit.

pty_worker.rs owns one decoder for the worker's lifetime and routes both the main pty_rx.recv() arm and the watchdog try_recv drain through it. Raw bytes are still fed to the VT parser unchanged; only the worker_stream text payload changes.

Tests

12 new unit tests in util::utf8_stream::tests cover:

  • Plain ASCII passthrough across multiple decode calls.
  • U+2500 (E2 94 80) split into [E2] then [94 80] — the case from the issue.
  • Every possible byte-boundary split of U+2588 (3-byte), U+4E2D 中 (3-byte CJK), and U+1F600 😀 (4-byte emoji).
  • A mixed string sent one byte at a time, then flushed, with no replacement characters.
  • Mid-stream invalid bytes (0x41 0xFF 0x42A\u{FFFD}B) and invalid continuation after a valid lead byte.
  • flush() emits U+FFFD for a truncated tail and is idempotent.
  • Result matches String::from_utf8_lossy for fully-formed inputs (no behavior change when chunks are already complete).

cargo test -p agent-relay-broker --lib → 625 passed, 0 failed. cargo clippy -p agent-relay-broker --no-deps → clean.

Test plan

  • New unit tests pass (cargo test -p agent-relay-broker --lib utf8_stream)
  • Full broker lib test suite still passes (625 tests)
  • cargo clippy -p agent-relay-broker --no-deps clean
  • Manual verification with a CLI that emits box-drawing characters (Claude Code) over worker_stream

Generated by Claude Code

PTY reads can land between bytes of a multi-byte UTF-8 codepoint, and
`String::from_utf8_lossy` on each chunk replaces the partial sequence
with `U+FFFD`. Downstream `worker_stream` consumers (Pear, attached
CLIs) then render box/line glyphs, emoji, and CJK characters as `?`.

Introduce `Utf8StreamDecoder`, a small stateful decoder that holds the
incomplete trailing bytes for the next chunk and only emits
replacement characters for byte sequences that are definitively
invalid. Use it for both the steady-state PTY read path and the
watchdog drain at child exit, and flush any leftover incomplete bytes
when the stream finally closes.

Unit tests cover ASCII, every byte-boundary split for U+2500, U+2588,
U+4E2D (CJK), and U+1F600 (emoji), full byte-by-byte streaming of a
mixed string, mid-stream invalid bytes, and end-of-stream truncation.
@willwashburn willwashburn requested a review from khaliqgant as a code owner May 20, 2026 01:30
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5e3d581b-69b6-431b-8b20-29abd6b6c8b8

📥 Commits

Reviewing files that changed from the base of the PR and between 08e0e4e and 444edd8.

📒 Files selected for processing (1)
  • crates/broker/src/pty_worker.rs

📝 Walkthrough

Walkthrough

Adds a stateful Utf8StreamDecoder, exports it, and replaces lossy per-chunk decoding in PTY startup and watchdog paths with streaming decoding and a final flush so multi-byte UTF-8 characters split across reads are preserved.

Changes

Streaming UTF-8 Decoder for PTY Output

Layer / File(s) Summary
Streaming UTF-8 Decoder Implementation
crates/broker/src/util/utf8_stream.rs
New Utf8StreamDecoder with decode() and flush(). Buffers incomplete trailing bytes across calls, emits complete decoded text, and substitutes U+FFFD only for invalid sequences. Includes unit tests for ASCII, split multi-byte codepoints (box drawing, CJK, emoji), byte-by-byte streaming, invalid bytes, truncation via flush(), and equivalence to String::from_utf8_lossy on complete inputs.
Module Export and Decoder Initialization
crates/broker/src/util/mod.rs, crates/broker/src/pty_worker.rs
Exports utf8_stream from util, imports Utf8StreamDecoder in pty_worker.rs, and initializes a decoder instance used by PTY read paths.
Startup Output Decoding
crates/broker/src/pty_worker.rs
Replaces String::from_utf8_lossy with utf8_decoder.decode(&chunk) during startup collection. If decode() returns an empty string (incomplete trailing prefix), emission is skipped while prompt-detection/buffering continues; raw byte counters remain unchanged.
Stream Close: Flush into Buffers
crates/broker/src/pty_worker.rs
On PTY reader close, calls utf8_decoder.flush() and appends any flushed tail into stream_buffer and echo_buffer before the final worker_stream flush to preserve output ordering.
Watchdog Late Output Decoding
crates/broker/src/pty_worker.rs
In the watchdog late-output path, drains PTY chunks and decodes via utf8_decoder.decode() (skipping empty results). On stream close, flushes the decoder, appends flushed tail to late_output, and forwards it via extra worker_stream frames before sending agent_exit.
Changelog Entry
CHANGELOG.md
Documents that PTY worker_stream events now preserve multi-byte UTF-8 characters split across read chunks, preventing U+FFFD replacement glyphs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • khaliqgant

Poem

🐰 I nibble bytes in careful rows,
I stitch split codepoints where wind blows.
No rogue � will mar the stream,
Each glyph returns to join the theme.
Hooray—box art and emoji gleam! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(broker): preserve split multi-byte UTF-8 in worker_stream (#922)' clearly and concisely summarizes the main change: fixing UTF-8 corruption in worker_stream output.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the problem, fix, tests, and test plan with detailed explanations matching the template expectations.
Linked Issues check ✅ Passed The PR fully implements all requirements from issue #922: a stateful UTF-8 streaming decoder [utf8_stream.rs], integration into pty_worker.rs for both main and watchdog paths, proper handling of split multi-byte characters, and comprehensive unit tests covering split codepoint scenarios.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing issue #922: new UTF-8 decoder module, pty_worker integration, module export, and CHANGELOG update—no out-of-scope changes present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-issue-922-4REVi

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing utf8_decoder.flush() in pty_closed termination path silently drops trailing bytes

When the PTY channel closes normally (pty_rx.recv() returns None at line 871), the utf8_decoder is not flushed, so any incomplete multi-byte codepoint held in utf8_decoder.pending is silently dropped. This contradicts the PR's intent: the watchdog exit path (crates/broker/src/pty_worker.rs:1134-1143) correctly calls utf8_decoder.flush() to emit the remaining bytes as U+FFFD, but the pty_closed path does not. When the PTY reader closes, no further chunks will arrive to complete the pending bytes, so they are permanently lost — neither emitted via worker_stream nor included in the echo_buffer/last_output.

(Refers to lines 871-875)

Prompt for agents
In the `pty_closed` branch of `pty_rx.recv() => None` (around line 871 in crates/broker/src/pty_worker.rs), the utf8_decoder is not flushed before the stream_buffer flush and exit payload construction. The watchdog path (lines 1134-1143) shows the correct pattern: after draining all chunks through utf8_decoder.decode(), call utf8_decoder.flush() and, if non-empty, append the result to the stream_buffer (so it gets emitted via worker_stream) and to the echo_buffer (so it's included in last_output). The fix should add a utf8_decoder.flush() call before flush_stream_buffer!() in the None branch, appending any non-empty result to stream_buffer and echo_buffer.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08e0e4e3da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1136 to +1137
let tail = utf8_decoder.flush();
if !tail.is_empty() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Flush UTF-8 decoder on normal PTY channel close

Utf8StreamDecoder is flushed only in the watchdog exit path, but not when pty_rx.recv() returns None (the normal reader-closed path). If the last PTY chunk ends with an incomplete multi-byte sequence, decode() buffers those bytes and this branch never emits them, so the final output is silently dropped instead of producing the expected replacement character. This can truncate worker_stream/last_output tails on clean exits where the watchdog branch is not the one that runs.

Useful? React with 👍 / 👎.

When pty_rx.recv() returns None (the PTY reader closed cleanly), the
streaming decoder was not flushed, so any incomplete trailing UTF-8
bytes were silently dropped instead of being emitted to worker_stream
and echo_buffer. The watchdog exit path already handled this; mirror
the same flush-and-append pattern here so clean exits don't truncate
output that the watchdog branch would have surfaced.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/broker/src/pty_worker.rs (1)

871-900: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing decoder flush in normal EOF path causes data loss.

When pty_rx.recv() returns None (PTY reader closed), any incomplete trailing UTF-8 bytes buffered in utf8_decoder are silently dropped. The watchdog path (lines 1134–1143) correctly calls utf8_decoder.flush() before exit, but this path does not.

If the last PTY chunk contained an incomplete multi-byte sequence (e.g., process crashed mid-output), those bytes will be lost rather than emitted as U+FFFD replacement characters. This violates the PR objective: "flushes remaining buffered bytes at end-of-stream."

🔧 Proposed fix: flush decoder before agent_exit
 None => {
     // PTY reader closed — child likely exited. Flush
     // any buffered stream output before sending
     // agent_exit to preserve output ordering.
     flush_stream_buffer!();
+    // Flush any incomplete trailing bytes from the decoder.
+    let tail = utf8_decoder.flush();
+    if !tail.is_empty() {
+        echo_buffer.push_str(&tail);
+        let _ = send_frame(&out_tx, "worker_stream", None, json!({
+            "stream": "stdout",
+            "chunk": tail,
+        })).await;
+    }
     // Emit agent_exit with any echo_buffer tail so the
     // dashboard can surface the CLI's last output.
     let clean = strip_ansi(&echo_buffer);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/pty_worker.rs` around lines 871 - 900, When handling the
PTY EOF branch where pty_rx.recv() returns None (inside the None => block),
flush the UTF-8 decoder (utf8_decoder.flush()) and append any
replacement-decoded text to echo_buffer before building exit_payload and calling
send_frame("agent_exit", ...); specifically, replicate the watchdog logic that
calls utf8_decoder.flush() to push any pending bytes (as U+FFFD) into
echo_buffer so trailing incomplete multi-byte sequences are not dropped prior to
trimming and sending last_output; update the None => path around
echo_buffer/trimmed/exit_payload accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/broker/src/pty_worker.rs`:
- Around line 871-900: When handling the PTY EOF branch where pty_rx.recv()
returns None (inside the None => block), flush the UTF-8 decoder
(utf8_decoder.flush()) and append any replacement-decoded text to echo_buffer
before building exit_payload and calling send_frame("agent_exit", ...);
specifically, replicate the watchdog logic that calls utf8_decoder.flush() to
push any pending bytes (as U+FFFD) into echo_buffer so trailing incomplete
multi-byte sequences are not dropped prior to trimming and sending last_output;
update the None => path around echo_buffer/trimmed/exit_payload accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e74ff2df-792f-4f34-9f57-5998e639237c

📥 Commits

Reviewing files that changed from the base of the PR and between cf82bf5 and 08e0e4e.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • crates/broker/src/pty_worker.rs
  • crates/broker/src/util/mod.rs
  • crates/broker/src/util/utf8_stream.rs

@willwashburn willwashburn merged commit fc709cf into main May 20, 2026
46 of 47 checks passed
@willwashburn willwashburn deleted the claude/fix-issue-922-4REVi branch May 20, 2026 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PTY worker_stream corrupts split multi-byte UTF-8 characters

2 participants