reader (Rust): stream JSONL hot loops; use memchr crate by willwashburn · Pull Request #372 · AgentWorkforce/burn

willwashburn · 2026-05-08T01:10:26Z

Summary

Tackles the two adjacent wins called out in #323. The full typed-shell refactor (the issue's primary "single biggest reader perf win" item) is not in this PR — the existing Claude/Codex parsers thread &serde_json::Value and &Map<String, Value> deeply through their helpers, so a meaningful shell would either need the helper signatures rewritten or a deserialize / re-serialize round trip that's a perf regression for matched lines. The streaming + memchr changes here are independently valuable and unblock the larger refactor.

Multi-GB up-front allocations → BufReader streaming

The Claude incremental and prescan paths built a (size - start_offset)-byte buffer up front:

// run_incremental — multi-GB Vec for a multi-GB session
let mut buf: Vec<u8> = Vec::with_capacity((size - start_offset) as usize);
file.read_to_end(&mut buf)?;

// prescan_nodes — same pattern
let mut buf = vec![0u8; length as usize];
file.read_exact(&mut buf)?;

Codex incremental had the same shape (vec![0u8; (size - start_offset) as usize] + read_exact). All three switch to BufReader + read_until(b'\n', &mut line_buf) into a reused line buffer, so only the longest single line stays resident regardless of file size. Cursor offsets are tracked by accumulating n returned by read_until; partial trailing lines (no \n) break the loop so the next incremental call resumes from a committed \n boundary, matching the prior semantics.

parse_codex_buffer switched from buf: &[u8] to <R: BufRead>(mut reader: R, ...). Only parse_codex_session_incremental calls it.

`BufReader::lines()` → reused `String`

The main parse_claude_session_with_counter loop allocated a fresh String per line via BufReader::lines(). Replaced with read_line(&mut line) into a single reused buffer.

`memchr_newline` actually uses `memchr`

reader/codex.rs:1231 was named for memchr but did buf.iter().position(|&b| b == b'\n'). Now wired to memchr::memchr(b'\n', buf). The memchr crate is already pulled in transitively through regex; added at the workspace root and depended on directly from relayburn-sdk so we own the version.

Test plan

cargo test --workspace — all 618 SDK unit tests + 2 integration tests + bindings + doctests pass on the streaming paths (incremental cursor handling, prescan, full-file parse all exercised by existing tests).
Smoke a real multi-GB Claude session through burn ingest to confirm RSS no longer scales with file size (manual; reviewer can defer).

Out of scope

Typed #[derive(Deserialize)] shells for the per-line shape — see issue body. Worth a follow-up that also refactors the &Value-taking helpers, since the win comes from skipping the BTreeMap build, not from the deserialize itself.

Closes #323.

https://claude.ai/code/session_01Mu7DKodN2MX4qWD55FW5zg

Generated by Claude Code

coderabbitai · 2026-05-08T01:10:36Z

📝 Walkthrough

Walkthrough

Performance optimization of relayburn-sdk JSONL reader hot paths: adds memchr dependency, refactors Claude and Codex session readers to stream lines incrementally with reusable buffers instead of pre-allocating multi-GB allocations, and accelerates newline detection via memchr.

Changes

Reader streaming and memchr optimization

Layer / File(s)	Summary
Workspace dependencies `Cargo.toml`, `crates/relayburn-sdk/Cargo.toml`	`memchr = "2"` added to workspace and explicitly declared in relayburn-sdk for SIMD-accelerated newline detection.
Claude reader streaming — non-incremental `crates/relayburn-sdk/src/reader/claude.rs`	`parse_claude_session_with_counter` switches from `BufReader::lines()` to `read_line` with a reused `String`.
Claude reader streaming — prescan & incremental `crates/relayburn-sdk/src/reader/claude.rs`	`prescan_nodes` and `run_incremental` replaced whole-file buffering with `BufReader::take`/`read_until` streaming using a reusable `Vec<u8>`, computing `line_start_offset`/`line_end_offset` and skipping trailing partial (non-`\n`-terminated) lines.
Codex reader streaming `crates/relayburn-sdk/src/reader/codex.rs`	`parse_codex_session_incremental` seeks to `start_offset` and uses a generic `parse_codex_buffer<R: BufRead>` that reads records with `read_until(b'\n')`, tracks `current_offset` per-line for committed filtering, trims newline before serde parsing, and stops advancement on unterminated final lines.
Documentation `CHANGELOG.md`	Added Unreleased entry describing JSONL streaming via reusable buffers, memchr optimization, and removal of `BufReader::lines()` per-line allocation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

Rust refactor: reader code cleanup (parser dup, helper consolidation, dead let_ blocks) #346: Changes to BufRead-based incremental reads and memchr usage map to the cleanup items noted in that issue.

Possibly related PRs

AgentWorkforce/burn#371: Modifies newline handling in crates/relayburn-sdk/src/reader/codex.rs; directly related.
AgentWorkforce/burn#293: Prior refactor to Claude/Codex parsing and buffering; related to the same readers.

Poem

🐰
I nibble lines in tidy rows,
Reuse my bowls where memory flows,
Memchr hops to mark each break,
No bulky heaps left in my wake,
Readers sprint — hooray, we go!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'reader (Rust): stream JSONL hot loops; use memchr crate' directly and clearly describes the main changes—streaming JSONL parsing and memchr integration for newline scanning.
Description check	✅ Passed	The description relates directly to the changeset, detailing the streaming refactor from up-front allocations to BufReader, the switch from BufReader::lines() to reused String buffers, and the memchr integration, all of which match the file-level changes.
Linked Issues check	✅ Passed	The PR addresses core objectives from `#323`: it replaces multi-GB up-front buffer allocations with BufReader streaming [claude.rs, codex.rs], replaces BufReader::lines() per-line allocations with reused buffers, and integrates memchr for newline scans; the typed-shell refactor is explicitly deferred as out-of-scope.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to streaming refactor, buffer reuse, and memchr integration as outlined in `#323`; the typed-shell refactor for per-line deserialize is correctly identified as out-of-scope and not included in this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/resolve-issue-323-yR9X9

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab56ca69ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-08T01:13:03Z

+        let n = match reader.read_until(b'\n', &mut line_buf) {
+            Ok(0) => break,
+            Ok(n) => n,
+            Err(_) => break,
        };


Propagate incremental read failures instead of truncating

The new streaming loop swallows read_until I/O errors by breaking and returning a normal parse result, which can silently drop the unread tail of a session if a mid-file read fails (for example on transient filesystem/network errors). Because parse_codex_session_incremental returns std::io::Result, callers expect real read failures to surface as errors rather than as partial success with an advanced resume state.

Useful? React with 👍 / 👎.

Good catch — fixed in 8df59ba. parse_codex_buffer now returns std::io::Result<ParseCodexIncrementalResult> and the loop uses ? to propagate, matching what claude.rs already did. Transient mid-file read failures now surface as errors instead of silently advancing the resume cursor.

Generated by Claude Code

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/relayburn-sdk/src/reader/claude.rs (1)

2502-2502: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Return last_user_text from the committed boundary, not the scan tail.

end_offset can back up before the last lines scanned, but the result always returns the final current_user_text from the whole read window. If a later plain-text user line lands after that backed-up boundary, the next incremental call is seeded with prompt text from beyond its start cursor and can classify the resumed turn against the wrong prompt. Please snapshot the latest plain-text user text at the same boundary you commit to end_offset.

Suggested approach

+    let mut user_text_updates: Vec<(u64, String)> = Vec::new();
     let mut current_user_text = options.last_user_text.clone().unwrap_or_default();
...
                 if let Some(text) = extract_plain_user_text_from_obj(&obj) {
                     if !text.is_empty() {
-                        current_user_text = text;
+                        current_user_text = text.clone();
+                        user_text_updates.push((line_start_offset, text));
                     }
                 }
...
     let end_offset = earliest_incomplete.unwrap_or(cursor_offset);
+    let committed_last_user_text = user_text_updates
+        .iter()
+        .rfind(|(off, _)| *off < end_offset)
+        .map(|(_, text)| text.clone())
+        .unwrap_or_else(|| options.last_user_text.clone().unwrap_or_default());
...
         end_offset,
-        last_user_text: current_user_text,
+        last_user_text: committed_last_user_text,
         evidence,

Also applies to: 2615-2618, 2847-2848

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/relayburn-sdk/src/reader/claude.rs` at line 2502, The code currently
seeds current_user_text from options.last_user_text or the scan tail, which can
include text beyond the committed end_offset; instead capture (snapshot) the
latest plain-text user text at the same boundary where you commit end_offset and
use that snapshot to set current_user_text. Concretely, when you compute and
commit end_offset, also compute committed_last_user_text (from the scan results
up to that boundary) and replace the assignment to current_user_text (and the
analogous spots referenced at lines ~2615-2618 and ~2847-2848) to use
committed_last_user_text.unwrap_or_default() so incremental calls are seeded
only with text within the committed window.

crates/relayburn-sdk/src/reader/codex.rs (1)

337-342: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate read_until failures instead of treating them as EOF.

The new streaming loop turns any mid-stream read error into a clean break and returns a partial parse as success. That can silently truncate turns/events after transient FS errors while still advancing callers with a bogus end_offset.

Suggested fix

-fn parse_codex_buffer<R: BufRead>(
+fn parse_codex_buffer<R: BufRead>(
     mut reader: R,
     start_offset: u64,
     options: &ParseCodexIncrementalOptions,
     project_resolver: &ProjectResolver,
-) -> ParseCodexIncrementalResult {
+) -> std::io::Result<ParseCodexIncrementalResult> {
...
     loop {
         line_buf.clear();
-        let n = match reader.read_until(b'\n', &mut line_buf) {
-            Ok(0) => break,
-            Ok(n) => n,
-            Err(_) => break,
-        };
+        let n = reader.read_until(b'\n', &mut line_buf)?;
+        if n == 0 {
+            break;
+        }
...
-    ParseCodexIncrementalResult {
+    Ok(ParseCodexIncrementalResult {
         turns,
         content: content_out,
         events: events_out,
         user_turns: user_turns_out,
         relationships: relationships_out,
         tool_result_events: tool_events_out,
         end_offset: committed_end_offset,
         resume,
-    }
+    })
 }

And at the call site:

-    Ok(parse_codex_buffer(
+    parse_codex_buffer(
         reader,
         start_offset,
         options,
         &project_resolver,
-    ))
+    )

Also applies to: 398-401

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/relayburn-sdk/src/reader/codex.rs` around lines 337 - 342, In
parse_codex_buffer, don’t treat read_until failures as EOF—propagate the I/O
error back to the caller instead of breaking the streaming loop and returning a
partial success; locate the loop that calls reader.read_until(...) inside
parse_codex_buffer (and the similar block around lines referenced 398-401) and
change the handling so any Err(e) from read_until is returned (or converted into
the ParseCodexIncrementalResult error variant) rather than being interpreted as
Ok(0) or a normal loop break, ensuring end_offset is only advanced on successful
reads and remains unchanged on I/O errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/relayburn-sdk/src/reader/claude.rs`:
- Line 2502: The code currently seeds current_user_text from
options.last_user_text or the scan tail, which can include text beyond the
committed end_offset; instead capture (snapshot) the latest plain-text user text
at the same boundary where you commit end_offset and use that snapshot to set
current_user_text. Concretely, when you compute and commit end_offset, also
compute committed_last_user_text (from the scan results up to that boundary) and
replace the assignment to current_user_text (and the analogous spots referenced
at lines ~2615-2618 and ~2847-2848) to use
committed_last_user_text.unwrap_or_default() so incremental calls are seeded
only with text within the committed window.

In `@crates/relayburn-sdk/src/reader/codex.rs`:
- Around line 337-342: In parse_codex_buffer, don’t treat read_until failures as
EOF—propagate the I/O error back to the caller instead of breaking the streaming
loop and returning a partial success; locate the loop that calls
reader.read_until(...) inside parse_codex_buffer (and the similar block around
lines referenced 398-401) and change the handling so any Err(e) from read_until
is returned (or converted into the ParseCodexIncrementalResult error variant)
rather than being interpreted as Ok(0) or a normal loop break, ensuring
end_offset is only advanced on successful reads and remains unchanged on I/O
errors.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: bfaff264-5c75-4410-a430-f9744fff00be

📥 Commits

Reviewing files that changed from the base of the PR and between 7f7d9d2 and ab56ca6.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

CHANGELOG.md
Cargo.toml
crates/relayburn-sdk/Cargo.toml
crates/relayburn-sdk/src/reader/claude.rs
crates/relayburn-sdk/src/reader/codex.rs

Address review feedback on #372: the codex streaming loop was swallowing `read_until` failures via `Err(_) => break`, which would silently truncate the parse at a transient mid-file read error and advance the resume cursor as if the bytes had been processed. The claude.rs equivalents already use `?` to propagate. Bubble the error by changing `parse_codex_buffer` to return `std::io::Result<ParseCodexIncrementalResult>` and using `?` in the loop, matching `parse_codex_session_incremental`'s outer signature.

willwashburn · 2026-05-08T01:39:34Z

@copilot resolve the merge conflicts in this pull request

The Claude / Codex incremental and prescan paths previously built a multi-GB up-front buffer (`vec![0u8; (size - start_offset) as usize]`, `Vec::with_capacity((size - start_offset) as usize)` + `read_to_end`) to scan a session log. Switch to BufReader + `read_until(b'\n', ...)` into a reused line buffer so only the longest single line stays resident, regardless of file size. The main `parse_claude_session` loop also moves off `BufReader::lines()` (a fresh `String` per line) onto `read_line` into a reused `String`, keeping per-line allocation bounded by the longest line for sessions with tens of thousands of turns. `memchr_newline` in the codex parser was named for `memchr` but did `buf.iter().position(|&b| b == b'\n')`. Wire the actual `memchr` crate (already a transitive dep through `regex`) for SIMD-accelerated line splits; pinned at the workspace root and depended on directly from `relayburn-sdk` so we own the version. Closes #323.

Address review feedback on #372: the codex streaming loop was swallowing `read_until` failures via `Err(_) => break`, which would silently truncate the parse at a transient mid-file read error and advance the resume cursor as if the bytes had been processed. The claude.rs equivalents already use `?` to propagate. Bubble the error by changing `parse_codex_buffer` to return `std::io::Result<ParseCodexIncrementalResult>` and using `?` in the loop, matching `parse_codex_session_incremental`'s outer signature.

Copilot · 2026-05-08T01:47:39Z

@copilot resolve the merge conflicts in this pull request

Resolved — this branch now merges cleanly with main (no conflicts). Verified at commit 459e1cf.

coderabbitai

🧹 Nitpick comments (1)

crates/relayburn-sdk/src/reader/claude.rs (1)
2310-2332: ⚡ Quick win

Add a regression fixture for newline-boundary EOF behavior.

These loops now define the incremental commit boundary. A fixture that distinguishes a complete final JSON object without \n from a genuinely truncated tail would make later refactors much safer, especially since the sync and incremental entry points currently take different EOF paths.

Also applies to: 2521-2545
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/relayburn-sdk/src/reader/claude.rs` around lines 2310 - 2332, Create a
regression fixture that verifies EOF handling at newline boundaries for the
prescan loops that use reader.read_until and line_buf: write tests that feed a
complete JSON object without a trailing '\n' and assert it is treated as a
complete final entry (not truncated), and also feed a genuinely truncated tail
and assert it is rejected/ignored; add assertions around the incremental commit
behavior (using the same variables/logic as in the loop that references reader,
line_buf, last_assistant_message_id, next_event_index) and duplicate the fixture
for the similar loop later in the file that uses the same read_until<<newline>>
logic so future refactors won’t change EOF semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/relayburn-sdk/src/reader/claude.rs`:
- Around line 2310-2332: Create a regression fixture that verifies EOF handling
at newline boundaries for the prescan loops that use reader.read_until and
line_buf: write tests that feed a complete JSON object without a trailing '\n'
and assert it is treated as a complete final entry (not truncated), and also
feed a genuinely truncated tail and assert it is rejected/ignored; add
assertions around the incremental commit behavior (using the same
variables/logic as in the loop that references reader, line_buf,
last_assistant_message_id, next_event_index) and duplicate the fixture for the
similar loop later in the file that uses the same read_until<<newline>> logic so
future refactors won’t change EOF semantics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9487cdd0-c160-413c-9656-9c337de6cf24

📥 Commits

Reviewing files that changed from the base of the PR and between 8df59ba and 459e1cf.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

CHANGELOG.md
Cargo.toml
crates/relayburn-sdk/Cargo.toml
crates/relayburn-sdk/src/reader/claude.rs
crates/relayburn-sdk/src/reader/codex.rs

✅ Files skipped from review due to trivial changes (3)

Cargo.toml
crates/relayburn-sdk/Cargo.toml
CHANGELOG.md

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Copilot started work on behalf of willwashburn May 8, 2026 01:39 View session

claude added 2 commits May 8, 2026 01:42

willwashburn force-pushed the claude/resolve-issue-323-yR9X9 branch from 8df59ba to 459e1cf Compare May 8, 2026 01:44

Copilot finished work on behalf of willwashburn May 8, 2026 01:48

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

willwashburn merged commit dc35146 into main May 8, 2026
11 checks passed

willwashburn deleted the claude/resolve-issue-323-yR9X9 branch May 8, 2026 03:28

coderabbitai Bot mentioned this pull request May 8, 2026

Hoist summary aggregation into SDK #386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reader (Rust): stream JSONL hot loops; use memchr crate#372

reader (Rust): stream JSONL hot loops; use memchr crate#372
willwashburn merged 2 commits intomainfrom
claude/resolve-issue-323-yR9X9

willwashburn commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Uh oh!

willwashburn May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

willwashburn commented May 8, 2026

Uh oh!

Copilot AI commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

willwashburn commented May 8, 2026

Summary

Multi-GB up-front allocations → BufReader streaming

BufReader::lines() → reused String

memchr_newline actually uses memchr

Test plan

Out of scope

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

willwashburn May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

willwashburn commented May 8, 2026

Uh oh!

Copilot AI commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`BufReader::lines()` → reused `String`

`memchr_newline` actually uses `memchr`

coderabbitai Bot commented May 8, 2026 •

edited

Loading