Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughImplements incremental UTF-8 decoding in the tail stream reader using a persistent StringDecoder, exports READ_CHUNK_BYTES, and adds tests that verify multi-byte UTF-8 characters (emoji) are preserved when their bytes are split across internal read chunks or poll intervals. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/node/src/streams/tail.ts (1)
24-45:⚠️ Potential issue | 🟠 MajorDecoder state is not preserved across polling iterations — multi-byte chars can still be corrupted at slice boundaries.
readUtf8Sliceconstructs a freshStringDecoderper call and finalizes it withdecoder.end()before returning. The slice ends atfileSizefrom the most recentstat()(line 91/112), which is not guaranteed to align with a UTF‑8 character boundary if a writer is mid-flush when we snapshot the size. In that case:
decoder.end()flushes the partial leading byte(s) as\uFFFD, and- the next poll starts reading at
offset = fileSize(line 113), so the trailing continuation bytes are decoded by a fresh decoder and produce additional\uFFFD.The fix that is currently in place only addresses splits across the internal 64 KiB chunks within a single slice; splits across
readUtf8Sliceinvocations remain corrupted, which is the same failure mode described in the PR rationale. The new regression test does not cover this case because the file is fully written before iteration begins.Recommend hoisting the decoder into the iterator’s scope (alongside
carry) so multibyte sequences that straddle a poll boundary are buffered until the next read, and only finalizing on close or on truncation/rotation.♻️ Proposed refactor: per-iterator decoder, finalized only on close/truncation
-async function readUtf8Slice(filePath: string, start: number, end: number): Promise<string> { - if (end <= start) return ""; - - const handle = await open(filePath, "r"); - let offset = start; - let output = ""; - const decoder = new StringDecoder("utf8"); - - try { - while (offset < end) { - const bytesToRead = Math.min(READ_CHUNK_BYTES, end - offset); - const buffer = Buffer.allocUnsafe(bytesToRead); - const { bytesRead } = await handle.read(buffer, 0, bytesToRead, offset); - if (bytesRead <= 0) break; - output += decoder.write(buffer.subarray(0, bytesRead)); - offset += bytesRead; - } - return output + decoder.end(); - } finally { - await handle.close(); - } -} +async function readUtf8Slice( + filePath: string, + start: number, + end: number, + decoder: StringDecoder, +): Promise<string> { + if (end <= start) return ""; + + const handle = await open(filePath, "r"); + let offset = start; + let output = ""; + + try { + while (offset < end) { + const bytesToRead = Math.min(READ_CHUNK_BYTES, end - offset); + const buffer = Buffer.allocUnsafe(bytesToRead); + const { bytesRead } = await handle.read(buffer, 0, bytesToRead, offset); + if (bytesRead <= 0) break; + output += decoder.write(buffer.subarray(0, bytesRead)); + offset += bytesRead; + } + return output; + } finally { + await handle.close(); + } +}And in the iterator (around lines 82–124):
async function* iterator(): AsyncGenerator<string> { let initialized = false; let offset = 0; let carry = ""; + let decoder = new StringDecoder("utf8"); @@ if (fileSize < offset) { // File was truncated/rotated. offset = 0; carry = ""; + decoder = new StringDecoder("utf8"); } if (fileSize > offset) { - const delta = await readUtf8Slice(filePath, offset, fileSize); + const delta = await readUtf8Slice(filePath, offset, fileSize, decoder); offset = fileSize; const segments = `${carry}${delta}`.split(/\r?\n/); carry = segments.pop() ?? ""; for (const line of segments) { if (closed) return; yield line; } } await sleep(pollMs); } + // Flush any pending bytes as replacement chars on terminal close. + const tail = decoder.end(); + if (tail.length > 0) { + const segments = `${carry}${tail}`.split(/\r?\n/); + carry = segments.pop() ?? ""; + for (const line of segments) yield line; + } + if (carry.length > 0) yield carry; }(Whether to yield the trailing
carryon close is a separate behavior decision — the existing code does not, so you may want to omit the final block.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/node/src/streams/tail.ts` around lines 24 - 45, The bug is that readUtf8Slice creates and finalizes a new StringDecoder per call so multi-byte UTF‑8 sequences that cross poll boundaries get double-decoded as replacement chars; hoist a single StringDecoder instance into the tail iterator scope (next to the existing carry buffer) and reuse it across readUtf8Slice invocations so partial sequences are buffered between polls, call decoder.end() only on iterator close or when truncation/rotation requires flushing, and update readUtf8Slice to accept and use the shared decoder (or return raw bytes to be fed into the shared decoder) instead of creating its own.
🧹 Nitpick comments (1)
packages/node/src/__tests__/tailSource.test.ts (1)
68-69: Optional: couple the test toREAD_CHUNK_BYTESto keep the boundary assertion meaningful.The split is engineered against the current value of
READ_CHUNK_BYTES(64 KiB), but the constant is private totail.ts. If anyone changesREAD_CHUNK_BYTES, this test will silently stop exercising a chunk-boundary split (the emoji will fit inside one chunk again) while still passing. Consider exportingREAD_CHUNK_BYTESfrom../streams/tail.js(e.g.,export const READ_CHUNK_BYTES = ...) and computing the padding from it so the test remains a true boundary case.♻️ Suggested change
-import { createNodeTailSource } from "../streams/tail.js"; +import { createNodeTailSource, READ_CHUNK_BYTES } from "../streams/tail.js"; @@ - const line = `${"a".repeat(64 * 1024 - 1)}😀`; + // Position the emoji so its 4 UTF-8 bytes straddle the read-chunk boundary. + const line = `${"a".repeat(READ_CHUNK_BYTES - 1)}😀`;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/node/src/__tests__/tailSource.test.ts` around lines 68 - 69, The test constructs a line sized to the current private READ_CHUNK_BYTES (64 KiB) so it only tests the chunk-boundary split as long as that internal constant stays unchanged; export the constant from the tail implementation (e.g., export const READ_CHUNK_BYTES) and update the test to compute the padding using that exported READ_CHUNK_BYTES (use the exported READ_CHUNK_BYTES from ../streams/tail.js when building the line variable) so the emoji remains positioned at the boundary regardless of future changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@packages/node/src/streams/tail.ts`:
- Around line 24-45: The bug is that readUtf8Slice creates and finalizes a new
StringDecoder per call so multi-byte UTF‑8 sequences that cross poll boundaries
get double-decoded as replacement chars; hoist a single StringDecoder instance
into the tail iterator scope (next to the existing carry buffer) and reuse it
across readUtf8Slice invocations so partial sequences are buffered between
polls, call decoder.end() only on iterator close or when truncation/rotation
requires flushing, and update readUtf8Slice to accept and use the shared decoder
(or return raw bytes to be fed into the shared decoder) instead of creating its
own.
---
Nitpick comments:
In `@packages/node/src/__tests__/tailSource.test.ts`:
- Around line 68-69: The test constructs a line sized to the current private
READ_CHUNK_BYTES (64 KiB) so it only tests the chunk-boundary split as long as
that internal constant stays unchanged; export the constant from the tail
implementation (e.g., export const READ_CHUNK_BYTES) and update the test to
compute the padding using that exported READ_CHUNK_BYTES (use the exported
READ_CHUNK_BYTES from ../streams/tail.js when building the line variable) so the
emoji remains positioned at the boundary regardless of future changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f78fa1f-9287-497d-afbc-834e4119019b
📒 Files selected for processing (2)
packages/node/src/__tests__/tailSource.test.tspackages/node/src/streams/tail.ts
Summary
Rationale
The tail source decoded each read chunk independently. If a UTF-8 character crossed the chunk boundary, Node inserted replacement characters and the yielded line was corrupted.
Validation
npm run buildnode scripts/run-tests.mjs --filter tailSourcenpm run typecheck -- --pretty falseSummary by CodeRabbit
Bug Fixes
Tests
Chores