Fix tail UTF-8 chunk decoding by RtlZeroMemory · Pull Request #402 · RtlZeroMemory/Rezi

RtlZeroMemory · 2026-04-26T05:52:26Z

Summary

Decode tailed file slices with a streaming UTF-8 decoder.
Add a regression for a multi-byte character split across the 64 KiB read boundary.

Rationale

The tail source decoded each read chunk independently. If a UTF-8 character crossed the chunk boundary, Node inserted replacement characters and the yielded line was corrupted.

Validation

npm run build
node scripts/run-tests.mjs --filter tailSource
npm run typecheck -- --pretty false

Summary by CodeRabbit

Bug Fixes
- Resolved UTF-8 character corruption in tail streaming when characters span internal read boundaries.
Tests
- Added unit tests ensuring multi-byte UTF-8 characters are preserved across read/poll chunk boundaries.
Chores
- Minor template formatting cleanup and exposed a streaming chunk-size constant for external use.

coderabbitai · 2026-04-26T05:52:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43702263-63a1-4a6f-87c7-69ea9e5d0af7

📥 Commits

Reviewing files that changed from the base of the PR and between 61310d8 and 95afe6e.

📒 Files selected for processing (3)

packages/create-rezi/src/scaffold.ts
packages/node/src/__tests__/tailSource.test.ts
packages/node/src/streams/tail.ts

✅ Files skipped from review due to trivial changes (1)

packages/create-rezi/src/scaffold.ts

🚧 Files skipped from review as they are similar to previous changes (2)

packages/node/src/tests/tailSource.test.ts
packages/node/src/streams/tail.ts

📝 Walkthrough

Walkthrough

Implements incremental UTF-8 decoding in the tail stream reader using a persistent StringDecoder, exports READ_CHUNK_BYTES, and adds tests that verify multi-byte UTF-8 characters (emoji) are preserved when their bytes are split across internal read chunks or poll intervals.

Changes

Cohort / File(s)	Summary
Tail stream implementation `packages/node/src/streams/tail.ts`	Replace per-chunk `Buffer.toString("utf8", ...)` with an incremental `StringDecoder` passed/kept by the iterator; reset decoder on truncation/rotation; export `READ_CHUNK_BYTES`.
Tail source tests `packages/node/src/__tests__/tailSource.test.ts`	Add tests writing emoji bytes split across internal read chunk boundaries and across poll intervals to assert emitted lines contain the full UTF-8 character and no `\uFFFD`; check iterator completes after `source.close()`.
Scaffold formatting `packages/create-rezi/src/scaffold.ts`	Reformat long `starship` template `description` string into a multi-line expression without changing content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble bytes both near and far,

split emojis — I stitch each part.
No � in my garden bed,
StringDecoder keeps them fed,
Hopping code with a happy heart. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix tail UTF-8 chunk decoding' directly and accurately summarizes the main change: fixing UTF-8 decoding issues when multibyte UTF-8 characters span read chunk boundaries in the tail implementation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/tail-utf8-decoding

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

RtlZeroMemory · 2026-04-26T05:53:56Z

@CodeRabbit review

coderabbitai · 2026-04-26T05:54:00Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/node/src/streams/tail.ts (1)

24-45: ⚠️ Potential issue | 🟠 Major

Decoder state is not preserved across polling iterations — multi-byte chars can still be corrupted at slice boundaries.

readUtf8Slice constructs a fresh StringDecoder per call and finalizes it with decoder.end() before returning. The slice ends at fileSize from the most recent stat() (line 91/112), which is not guaranteed to align with a UTF‑8 character boundary if a writer is mid-flush when we snapshot the size. In that case:

decoder.end() flushes the partial leading byte(s) as \uFFFD, and
the next poll starts reading at offset = fileSize (line 113), so the trailing continuation bytes are decoded by a fresh decoder and produce additional \uFFFD.

The fix that is currently in place only addresses splits across the internal 64 KiB chunks within a single slice; splits across readUtf8Slice invocations remain corrupted, which is the same failure mode described in the PR rationale. The new regression test does not cover this case because the file is fully written before iteration begins.

Recommend hoisting the decoder into the iterator’s scope (alongside carry) so multibyte sequences that straddle a poll boundary are buffered until the next read, and only finalizing on close or on truncation/rotation.

♻️ Proposed refactor: per-iterator decoder, finalized only on close/truncation

-async function readUtf8Slice(filePath: string, start: number, end: number): Promise<string> {
-  if (end <= start) return "";
-
-  const handle = await open(filePath, "r");
-  let offset = start;
-  let output = "";
-  const decoder = new StringDecoder("utf8");
-
-  try {
-    while (offset < end) {
-      const bytesToRead = Math.min(READ_CHUNK_BYTES, end - offset);
-      const buffer = Buffer.allocUnsafe(bytesToRead);
-      const { bytesRead } = await handle.read(buffer, 0, bytesToRead, offset);
-      if (bytesRead <= 0) break;
-      output += decoder.write(buffer.subarray(0, bytesRead));
-      offset += bytesRead;
-    }
-    return output + decoder.end();
-  } finally {
-    await handle.close();
-  }
-}
+async function readUtf8Slice(
+  filePath: string,
+  start: number,
+  end: number,
+  decoder: StringDecoder,
+): Promise<string> {
+  if (end <= start) return "";
+
+  const handle = await open(filePath, "r");
+  let offset = start;
+  let output = "";
+
+  try {
+    while (offset < end) {
+      const bytesToRead = Math.min(READ_CHUNK_BYTES, end - offset);
+      const buffer = Buffer.allocUnsafe(bytesToRead);
+      const { bytesRead } = await handle.read(buffer, 0, bytesToRead, offset);
+      if (bytesRead <= 0) break;
+      output += decoder.write(buffer.subarray(0, bytesRead));
+      offset += bytesRead;
+    }
+    return output;
+  } finally {
+    await handle.close();
+  }
+}

And in the iterator (around lines 82–124):

   async function* iterator(): AsyncGenerator<string> {
     let initialized = false;
     let offset = 0;
     let carry = "";
+    let decoder = new StringDecoder("utf8");
@@
       if (fileSize < offset) {
         // File was truncated/rotated.
         offset = 0;
         carry = "";
+        decoder = new StringDecoder("utf8");
       }

       if (fileSize > offset) {
-        const delta = await readUtf8Slice(filePath, offset, fileSize);
+        const delta = await readUtf8Slice(filePath, offset, fileSize, decoder);
         offset = fileSize;
         const segments = `${carry}${delta}`.split(/\r?\n/);
         carry = segments.pop() ?? "";
         for (const line of segments) {
           if (closed) return;
           yield line;
         }
       }

       await sleep(pollMs);
     }
+    // Flush any pending bytes as replacement chars on terminal close.
+    const tail = decoder.end();
+    if (tail.length > 0) {
+      const segments = `${carry}${tail}`.split(/\r?\n/);
+      carry = segments.pop() ?? "";
+      for (const line of segments) yield line;
+    }
+    if (carry.length > 0) yield carry;
   }

(Whether to yield the trailing carry on close is a separate behavior decision — the existing code does not, so you may want to omit the final block.)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/node/src/streams/tail.ts` around lines 24 - 45, The bug is that
readUtf8Slice creates and finalizes a new StringDecoder per call so multi-byte
UTF‑8 sequences that cross poll boundaries get double-decoded as replacement
chars; hoist a single StringDecoder instance into the tail iterator scope (next
to the existing carry buffer) and reuse it across readUtf8Slice invocations so
partial sequences are buffered between polls, call decoder.end() only on
iterator close or when truncation/rotation requires flushing, and update
readUtf8Slice to accept and use the shared decoder (or return raw bytes to be
fed into the shared decoder) instead of creating its own.

🧹 Nitpick comments (1)

packages/node/src/__tests__/tailSource.test.ts (1)
68-69: Optional: couple the test to READ_CHUNK_BYTES to keep the boundary assertion meaningful.

The split is engineered against the current value of READ_CHUNK_BYTES (64 KiB), but the constant is private to tail.ts. If anyone changes READ_CHUNK_BYTES, this test will silently stop exercising a chunk-boundary split (the emoji will fit inside one chunk again) while still passing. Consider exporting READ_CHUNK_BYTES from ../streams/tail.js (e.g., export const READ_CHUNK_BYTES = ...) and computing the padding from it so the test remains a true boundary case.
♻️ Suggested change
-import { createNodeTailSource } from "../streams/tail.js";
+import { createNodeTailSource, READ_CHUNK_BYTES } from "../streams/tail.js";
@@
-    const line = `${"a".repeat(64 * 1024 - 1)}😀`;
+    // Position the emoji so its 4 UTF-8 bytes straddle the read-chunk boundary.
+    const line = `${"a".repeat(READ_CHUNK_BYTES - 1)}😀`;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/node/src/__tests__/tailSource.test.ts` around lines 68 - 69, The
test constructs a line sized to the current private READ_CHUNK_BYTES (64 KiB) so
it only tests the chunk-boundary split as long as that internal constant stays
unchanged; export the constant from the tail implementation (e.g., export const
READ_CHUNK_BYTES) and update the test to compute the padding using that exported
READ_CHUNK_BYTES (use the exported READ_CHUNK_BYTES from ../streams/tail.js when
building the line variable) so the emoji remains positioned at the boundary
regardless of future changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/node/src/streams/tail.ts`:
- Around line 24-45: The bug is that readUtf8Slice creates and finalizes a new
StringDecoder per call so multi-byte UTF‑8 sequences that cross poll boundaries
get double-decoded as replacement chars; hoist a single StringDecoder instance
into the tail iterator scope (next to the existing carry buffer) and reuse it
across readUtf8Slice invocations so partial sequences are buffered between
polls, call decoder.end() only on iterator close or when truncation/rotation
requires flushing, and update readUtf8Slice to accept and use the shared decoder
(or return raw bytes to be fed into the shared decoder) instead of creating its
own.

---

Nitpick comments:
In `@packages/node/src/__tests__/tailSource.test.ts`:
- Around line 68-69: The test constructs a line sized to the current private
READ_CHUNK_BYTES (64 KiB) so it only tests the chunk-boundary split as long as
that internal constant stays unchanged; export the constant from the tail
implementation (e.g., export const READ_CHUNK_BYTES) and update the test to
compute the padding using that exported READ_CHUNK_BYTES (use the exported
READ_CHUNK_BYTES from ../streams/tail.js when building the line variable) so the
emoji remains positioned at the boundary regardless of future changes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f78fa1f-9287-497d-afbc-834e4119019b

📥 Commits

Reviewing files that changed from the base of the PR and between a8f2c3b and 61310d8.

📒 Files selected for processing (2)

packages/node/src/__tests__/tailSource.test.ts
packages/node/src/streams/tail.ts

fix tail utf8 decoding

61310d8

RtlZeroMemory marked this pull request as ready for review April 26, 2026 05:53

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

fix: preserve tail UTF-8 across polls

95afe6e

RtlZeroMemory merged commit 846c6af into main Apr 26, 2026
14 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tail UTF-8 chunk decoding#402

Fix tail UTF-8 chunk decoding#402
RtlZeroMemory merged 2 commits intomainfrom
fix/tail-utf8-decoding

RtlZeroMemory commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

RtlZeroMemory commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RtlZeroMemory commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rationale

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

RtlZeroMemory commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RtlZeroMemory commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading