Skip to content

Fix tail UTF-8 chunk decoding#402

Merged
RtlZeroMemory merged 2 commits intomainfrom
fix/tail-utf8-decoding
Apr 26, 2026
Merged

Fix tail UTF-8 chunk decoding#402
RtlZeroMemory merged 2 commits intomainfrom
fix/tail-utf8-decoding

Conversation

@RtlZeroMemory
Copy link
Copy Markdown
Owner

@RtlZeroMemory RtlZeroMemory commented Apr 26, 2026

Summary

  • Decode tailed file slices with a streaming UTF-8 decoder.
  • Add a regression for a multi-byte character split across the 64 KiB read boundary.

Rationale

The tail source decoded each read chunk independently. If a UTF-8 character crossed the chunk boundary, Node inserted replacement characters and the yielded line was corrupted.

Validation

  • npm run build
  • node scripts/run-tests.mjs --filter tailSource
  • npm run typecheck -- --pretty false

Summary by CodeRabbit

  • Bug Fixes

    • Resolved UTF-8 character corruption in tail streaming when characters span internal read boundaries.
  • Tests

    • Added unit tests ensuring multi-byte UTF-8 characters are preserved across read/poll chunk boundaries.
  • Chores

    • Minor template formatting cleanup and exposed a streaming chunk-size constant for external use.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43702263-63a1-4a6f-87c7-69ea9e5d0af7

📥 Commits

Reviewing files that changed from the base of the PR and between 61310d8 and 95afe6e.

📒 Files selected for processing (3)
  • packages/create-rezi/src/scaffold.ts
  • packages/node/src/__tests__/tailSource.test.ts
  • packages/node/src/streams/tail.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/create-rezi/src/scaffold.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/node/src/tests/tailSource.test.ts
  • packages/node/src/streams/tail.ts

📝 Walkthrough

Walkthrough

Implements incremental UTF-8 decoding in the tail stream reader using a persistent StringDecoder, exports READ_CHUNK_BYTES, and adds tests that verify multi-byte UTF-8 characters (emoji) are preserved when their bytes are split across internal read chunks or poll intervals.

Changes

Cohort / File(s) Summary
Tail stream implementation
packages/node/src/streams/tail.ts
Replace per-chunk Buffer.toString("utf8", ...) with an incremental StringDecoder passed/kept by the iterator; reset decoder on truncation/rotation; export READ_CHUNK_BYTES.
Tail source tests
packages/node/src/__tests__/tailSource.test.ts
Add tests writing emoji bytes split across internal read chunk boundaries and across poll intervals to assert emitted lines contain the full UTF-8 character and no \uFFFD; check iterator completes after source.close().
Scaffold formatting
packages/create-rezi/src/scaffold.ts
Reformat long starship template description string into a multi-line expression without changing content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble bytes both near and far,

split emojis — I stitch each part.
No � in my garden bed,
StringDecoder keeps them fed,
Hopping code with a happy heart. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix tail UTF-8 chunk decoding' directly and accurately summarizes the main change: fixing UTF-8 decoding issues when multibyte UTF-8 characters span read chunk boundaries in the tail implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/tail-utf8-decoding

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@RtlZeroMemory RtlZeroMemory marked this pull request as ready for review April 26, 2026 05:53
@RtlZeroMemory
Copy link
Copy Markdown
Owner Author

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/node/src/streams/tail.ts (1)

24-45: ⚠️ Potential issue | 🟠 Major

Decoder state is not preserved across polling iterations — multi-byte chars can still be corrupted at slice boundaries.

readUtf8Slice constructs a fresh StringDecoder per call and finalizes it with decoder.end() before returning. The slice ends at fileSize from the most recent stat() (line 91/112), which is not guaranteed to align with a UTF‑8 character boundary if a writer is mid-flush when we snapshot the size. In that case:

  1. decoder.end() flushes the partial leading byte(s) as \uFFFD, and
  2. the next poll starts reading at offset = fileSize (line 113), so the trailing continuation bytes are decoded by a fresh decoder and produce additional \uFFFD.

The fix that is currently in place only addresses splits across the internal 64 KiB chunks within a single slice; splits across readUtf8Slice invocations remain corrupted, which is the same failure mode described in the PR rationale. The new regression test does not cover this case because the file is fully written before iteration begins.

Recommend hoisting the decoder into the iterator’s scope (alongside carry) so multibyte sequences that straddle a poll boundary are buffered until the next read, and only finalizing on close or on truncation/rotation.

♻️ Proposed refactor: per-iterator decoder, finalized only on close/truncation
-async function readUtf8Slice(filePath: string, start: number, end: number): Promise<string> {
-  if (end <= start) return "";
-
-  const handle = await open(filePath, "r");
-  let offset = start;
-  let output = "";
-  const decoder = new StringDecoder("utf8");
-
-  try {
-    while (offset < end) {
-      const bytesToRead = Math.min(READ_CHUNK_BYTES, end - offset);
-      const buffer = Buffer.allocUnsafe(bytesToRead);
-      const { bytesRead } = await handle.read(buffer, 0, bytesToRead, offset);
-      if (bytesRead <= 0) break;
-      output += decoder.write(buffer.subarray(0, bytesRead));
-      offset += bytesRead;
-    }
-    return output + decoder.end();
-  } finally {
-    await handle.close();
-  }
-}
+async function readUtf8Slice(
+  filePath: string,
+  start: number,
+  end: number,
+  decoder: StringDecoder,
+): Promise<string> {
+  if (end <= start) return "";
+
+  const handle = await open(filePath, "r");
+  let offset = start;
+  let output = "";
+
+  try {
+    while (offset < end) {
+      const bytesToRead = Math.min(READ_CHUNK_BYTES, end - offset);
+      const buffer = Buffer.allocUnsafe(bytesToRead);
+      const { bytesRead } = await handle.read(buffer, 0, bytesToRead, offset);
+      if (bytesRead <= 0) break;
+      output += decoder.write(buffer.subarray(0, bytesRead));
+      offset += bytesRead;
+    }
+    return output;
+  } finally {
+    await handle.close();
+  }
+}

And in the iterator (around lines 82–124):

   async function* iterator(): AsyncGenerator<string> {
     let initialized = false;
     let offset = 0;
     let carry = "";
+    let decoder = new StringDecoder("utf8");
@@
       if (fileSize < offset) {
         // File was truncated/rotated.
         offset = 0;
         carry = "";
+        decoder = new StringDecoder("utf8");
       }

       if (fileSize > offset) {
-        const delta = await readUtf8Slice(filePath, offset, fileSize);
+        const delta = await readUtf8Slice(filePath, offset, fileSize, decoder);
         offset = fileSize;
         const segments = `${carry}${delta}`.split(/\r?\n/);
         carry = segments.pop() ?? "";
         for (const line of segments) {
           if (closed) return;
           yield line;
         }
       }

       await sleep(pollMs);
     }
+    // Flush any pending bytes as replacement chars on terminal close.
+    const tail = decoder.end();
+    if (tail.length > 0) {
+      const segments = `${carry}${tail}`.split(/\r?\n/);
+      carry = segments.pop() ?? "";
+      for (const line of segments) yield line;
+    }
+    if (carry.length > 0) yield carry;
   }

(Whether to yield the trailing carry on close is a separate behavior decision — the existing code does not, so you may want to omit the final block.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/node/src/streams/tail.ts` around lines 24 - 45, The bug is that
readUtf8Slice creates and finalizes a new StringDecoder per call so multi-byte
UTF‑8 sequences that cross poll boundaries get double-decoded as replacement
chars; hoist a single StringDecoder instance into the tail iterator scope (next
to the existing carry buffer) and reuse it across readUtf8Slice invocations so
partial sequences are buffered between polls, call decoder.end() only on
iterator close or when truncation/rotation requires flushing, and update
readUtf8Slice to accept and use the shared decoder (or return raw bytes to be
fed into the shared decoder) instead of creating its own.
🧹 Nitpick comments (1)
packages/node/src/__tests__/tailSource.test.ts (1)

68-69: Optional: couple the test to READ_CHUNK_BYTES to keep the boundary assertion meaningful.

The split is engineered against the current value of READ_CHUNK_BYTES (64 KiB), but the constant is private to tail.ts. If anyone changes READ_CHUNK_BYTES, this test will silently stop exercising a chunk-boundary split (the emoji will fit inside one chunk again) while still passing. Consider exporting READ_CHUNK_BYTES from ../streams/tail.js (e.g., export const READ_CHUNK_BYTES = ...) and computing the padding from it so the test remains a true boundary case.

♻️ Suggested change
-import { createNodeTailSource } from "../streams/tail.js";
+import { createNodeTailSource, READ_CHUNK_BYTES } from "../streams/tail.js";
@@
-    const line = `${"a".repeat(64 * 1024 - 1)}😀`;
+    // Position the emoji so its 4 UTF-8 bytes straddle the read-chunk boundary.
+    const line = `${"a".repeat(READ_CHUNK_BYTES - 1)}😀`;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/node/src/__tests__/tailSource.test.ts` around lines 68 - 69, The
test constructs a line sized to the current private READ_CHUNK_BYTES (64 KiB) so
it only tests the chunk-boundary split as long as that internal constant stays
unchanged; export the constant from the tail implementation (e.g., export const
READ_CHUNK_BYTES) and update the test to compute the padding using that exported
READ_CHUNK_BYTES (use the exported READ_CHUNK_BYTES from ../streams/tail.js when
building the line variable) so the emoji remains positioned at the boundary
regardless of future changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/node/src/streams/tail.ts`:
- Around line 24-45: The bug is that readUtf8Slice creates and finalizes a new
StringDecoder per call so multi-byte UTF‑8 sequences that cross poll boundaries
get double-decoded as replacement chars; hoist a single StringDecoder instance
into the tail iterator scope (next to the existing carry buffer) and reuse it
across readUtf8Slice invocations so partial sequences are buffered between
polls, call decoder.end() only on iterator close or when truncation/rotation
requires flushing, and update readUtf8Slice to accept and use the shared decoder
(or return raw bytes to be fed into the shared decoder) instead of creating its
own.

---

Nitpick comments:
In `@packages/node/src/__tests__/tailSource.test.ts`:
- Around line 68-69: The test constructs a line sized to the current private
READ_CHUNK_BYTES (64 KiB) so it only tests the chunk-boundary split as long as
that internal constant stays unchanged; export the constant from the tail
implementation (e.g., export const READ_CHUNK_BYTES) and update the test to
compute the padding using that exported READ_CHUNK_BYTES (use the exported
READ_CHUNK_BYTES from ../streams/tail.js when building the line variable) so the
emoji remains positioned at the boundary regardless of future changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f78fa1f-9287-497d-afbc-834e4119019b

📥 Commits

Reviewing files that changed from the base of the PR and between a8f2c3b and 61310d8.

📒 Files selected for processing (2)
  • packages/node/src/__tests__/tailSource.test.ts
  • packages/node/src/streams/tail.ts

@RtlZeroMemory RtlZeroMemory merged commit 846c6af into main Apr 26, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant