Skip to content

perf(save): skip backward-byte scan on hot path#1376

Merged
mikasenghaas merged 3 commits into
mainfrom
perf/save-truncate-fast-path
May 14, 2026
Merged

perf(save): skip backward-byte scan on hot path#1376
mikasenghaas merged 3 commits into
mainfrom
perf/save-truncate-fast-path

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 14, 2026

Summary

save_new_outputs calls truncate_malformed_trailing_line on every append. The helper walks the file backwards one byte at a time (seek(); read(1) per byte) to find the previous \n and validate the trailing row.

That walk only matters once per run — to clean up a partial row a crashed prior write may have left behind. After the first successful append the file ends in \n and the walk does nothing.

On a networked filesystem (BeeGFS / NFS) each read(1) is a network round trip. For an ~80 KB multi-turn bfcl-v3 rollout the scan does ~80,000 syscalls per save → tens of seconds of wall-time. Because save_new_outputs is awaited inline via asyncio.to_thread inside the as_completed loop, the client's main coroutine parks while the scan runs and no new dispatches happen.

Change

Hoist the truncate call out of save_new_outputs and run it exactly once, at the resume site in Environment.generate, right after load_outputs. Steady-state appends no longer pay the scan cost; crash recovery still works because the resume path runs it before any append.

🤖 Generated with Claude Code


Note

Medium Risk
Changes when malformed trailing JSONL cleanup occurs (now only on resume), which could affect correctness if callers append to an existing partially-written results.jsonl without going through the resume path. Otherwise the change is localized and primarily performance-motivated.

Overview
Speeds up incremental result saving by removing the backward byte-scan/truncation step from the hot append path in save_new_outputs, so steady-state results.jsonl appends no longer pay an O(file-size) tail check.

The malformed-trailing-line cleanup helper is promoted to public (truncate_malformed_trailing_line) and is now invoked once during Environment.generate resume (after load_outputs) to repair any partial final row before future appends; tests are updated to reflect that the caller performs the truncation explicitly.

Reviewed by Cursor Bugbot for commit b7f4df0. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Move malformed trailing-line truncation out of save_new_outputs onto the call site

  • truncate_malformed_trailing_line is renamed from a private helper to a public function in save_utils.py, and save_new_outputs no longer calls it internally.
  • environment.py now calls truncate_malformed_trailing_line explicitly on results.jsonl during the resume-evaluation path, before appending new outputs.
  • Behavioral Change: callers of save_new_outputs are now responsible for ensuring a valid JSONL boundary before appending; the function itself no longer truncates.

Macroscope summarized b7f4df0.

mikasenghaas and others added 3 commits May 15, 2026 04:02
`save_new_outputs` runs `_truncate_malformed_trailing_line` on every
append. That function walks the file backwards one byte at a time
(`seek(); read(1)` per byte) to find the previous `\n` so it can
validate the trailing JSONL row.

That's only relevant once per run — to clean up a partial row left
behind by a crashed previous write. Once the eval is appending complete
lines, every subsequent file state already ends in `\n` and the scan
does nothing.

On a networked filesystem (BeeGFS / NFS), each `read(1)` is a network
round trip. For a multi_turn bfcl-v3 rollout (~80 KB trailing line),
the scan runs ~80k syscalls per save — tens of seconds wall-time. The
client's main asyncio coroutine awaits `asyncio.to_thread(save_new_outputs,
...)` inline in the `as_completed` loop, so dispatch stalls until the
save returns. Observed in practice: `vllm.running` drains to single
digits, `start_time` of newly-saved rollouts ages out to >30 min while
results.jsonl keeps growing from the in-memory backlog.

Fix: cheap precheck — if the last byte is `\n`, skip the expensive
scan. Common path becomes 3 syscalls (stat / seek / read 1 byte). The
original backward scan still runs in the rare crash-recovery case
when the file does NOT end in `\n`, preserving correctness.

Added `test_well_formed_file_skips_backward_scan` to lock in the
perf invariant: a results.jsonl ending in `\n` must not trigger the
scan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review: rather than guarding `_truncate_malformed_trailing_line`
with a cheap last-byte precheck inside `save_new_outputs`, drop the
call from the hot path entirely and run it ONCE on the resume site,
where it's the only place a partial trailing row could exist.

- `save_new_outputs` is now a thin wrapper around `save_outputs` in
  append mode. Its docstring spells out the new contract: callers
  resuming from a possibly-crashed prior run must invoke
  `_truncate_malformed_trailing_line(results_path / "results.jsonl")`
  once before the first append.
- `Environment.evaluate`'s resume branch now does exactly that, right
  after `load_outputs`.
- Updated the existing crash-recovery test to match the new contract
  (explicit truncate before append). All 60 save_utils tests pass.

Only in-tree caller of `save_new_outputs` is the eval's resume path,
so this is a localized contract change with no callsites left to
update.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title perf(save): skip backward-byte scan when results.jsonl ends cleanly perf(save): skip backward-byte scan on hot path May 14, 2026
@mikasenghaas mikasenghaas marked this pull request as ready for review May 14, 2026 22:59
@mikasenghaas mikasenghaas merged commit 4f3516e into main May 14, 2026
15 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b7f4df0. Configure here.

outputs = load_outputs(results_path)
# Drop any partial trailing row left by a crashed prior write
# so subsequent appends start from a valid JSONL boundary.
truncate_malformed_trailing_line(results_path / "results.jsonl")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncation skipped when crash precedes metadata creation

Medium Severity

truncate_malformed_trailing_line now only runs inside the resume branch, which requires is_valid_eval_results_path to return True (both results.jsonl and metadata.json must exist). If a crash occurs after save_new_outputs partially writes to results.jsonl but before save_metadata creates metadata.json, a retry with the same results_path takes the fresh-start else branch, skips truncation, and appends new rows directly after the partial data — creating an unrecoverable mid-file corruption that load_outputs will raise on rather than skip.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b7f4df0. Configure here.

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 14, 2026

Approvability

Verdict: Needs human review

This PR changes crash-recovery behavior by moving truncation logic to only run in the resume branch. An open review comment identifies a potential data corruption scenario when crashes occur before metadata creation. The behavioral change and unresolved bug concern warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants