perf(save): skip backward-byte scan on hot path#1376
Conversation
`save_new_outputs` runs `_truncate_malformed_trailing_line` on every append. That function walks the file backwards one byte at a time (`seek(); read(1)` per byte) to find the previous `\n` so it can validate the trailing JSONL row. That's only relevant once per run — to clean up a partial row left behind by a crashed previous write. Once the eval is appending complete lines, every subsequent file state already ends in `\n` and the scan does nothing. On a networked filesystem (BeeGFS / NFS), each `read(1)` is a network round trip. For a multi_turn bfcl-v3 rollout (~80 KB trailing line), the scan runs ~80k syscalls per save — tens of seconds wall-time. The client's main asyncio coroutine awaits `asyncio.to_thread(save_new_outputs, ...)` inline in the `as_completed` loop, so dispatch stalls until the save returns. Observed in practice: `vllm.running` drains to single digits, `start_time` of newly-saved rollouts ages out to >30 min while results.jsonl keeps growing from the in-memory backlog. Fix: cheap precheck — if the last byte is `\n`, skip the expensive scan. Common path becomes 3 syscalls (stat / seek / read 1 byte). The original backward scan still runs in the rare crash-recovery case when the file does NOT end in `\n`, preserving correctness. Added `test_well_formed_file_skips_backward_scan` to lock in the perf invariant: a results.jsonl ending in `\n` must not trigger the scan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review: rather than guarding `_truncate_malformed_trailing_line` with a cheap last-byte precheck inside `save_new_outputs`, drop the call from the hot path entirely and run it ONCE on the resume site, where it's the only place a partial trailing row could exist. - `save_new_outputs` is now a thin wrapper around `save_outputs` in append mode. Its docstring spells out the new contract: callers resuming from a possibly-crashed prior run must invoke `_truncate_malformed_trailing_line(results_path / "results.jsonl")` once before the first append. - `Environment.evaluate`'s resume branch now does exactly that, right after `load_outputs`. - Updated the existing crash-recovery test to match the new contract (explicit truncate before append). All 60 save_utils tests pass. Only in-tree caller of `save_new_outputs` is the eval's resume path, so this is a localized contract change with no callsites left to update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b7f4df0. Configure here.
| outputs = load_outputs(results_path) | ||
| # Drop any partial trailing row left by a crashed prior write | ||
| # so subsequent appends start from a valid JSONL boundary. | ||
| truncate_malformed_trailing_line(results_path / "results.jsonl") |
There was a problem hiding this comment.
Truncation skipped when crash precedes metadata creation
Medium Severity
truncate_malformed_trailing_line now only runs inside the resume branch, which requires is_valid_eval_results_path to return True (both results.jsonl and metadata.json must exist). If a crash occurs after save_new_outputs partially writes to results.jsonl but before save_metadata creates metadata.json, a retry with the same results_path takes the fresh-start else branch, skips truncation, and appends new rows directly after the partial data — creating an unrecoverable mid-file corruption that load_outputs will raise on rather than skip.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b7f4df0. Configure here.
ApprovabilityVerdict: Needs human review This PR changes crash-recovery behavior by moving truncation logic to only run in the resume branch. An open review comment identifies a potential data corruption scenario when crashes occur before metadata creation. The behavioral change and unresolved bug concern warrant human review. You can customize Macroscope's approvability policy. Learn more. |


Summary
save_new_outputscallstruncate_malformed_trailing_lineon every append. The helper walks the file backwards one byte at a time (seek(); read(1)per byte) to find the previous\nand validate the trailing row.That walk only matters once per run — to clean up a partial row a crashed prior write may have left behind. After the first successful append the file ends in
\nand the walk does nothing.On a networked filesystem (BeeGFS / NFS) each
read(1)is a network round trip. For an ~80 KB multi-turn bfcl-v3 rollout the scan does ~80,000 syscalls per save → tens of seconds of wall-time. Becausesave_new_outputsis awaited inline viaasyncio.to_threadinside theas_completedloop, the client's main coroutine parks while the scan runs and no new dispatches happen.Change
Hoist the truncate call out of
save_new_outputsand run it exactly once, at the resume site inEnvironment.generate, right afterload_outputs. Steady-state appends no longer pay the scan cost; crash recovery still works because the resume path runs it before any append.🤖 Generated with Claude Code
Note
Medium Risk
Changes when malformed trailing JSONL cleanup occurs (now only on resume), which could affect correctness if callers append to an existing partially-written
results.jsonlwithout going through the resume path. Otherwise the change is localized and primarily performance-motivated.Overview
Speeds up incremental result saving by removing the backward byte-scan/truncation step from the hot append path in
save_new_outputs, so steady-stateresults.jsonlappends no longer pay an O(file-size) tail check.The malformed-trailing-line cleanup helper is promoted to public (
truncate_malformed_trailing_line) and is now invoked once duringEnvironment.generateresume (afterload_outputs) to repair any partial final row before future appends; tests are updated to reflect that the caller performs the truncation explicitly.Reviewed by Cursor Bugbot for commit b7f4df0. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Move malformed trailing-line truncation out of
save_new_outputsonto the call sitetruncate_malformed_trailing_lineis renamed from a private helper to a public function in save_utils.py, andsave_new_outputsno longer calls it internally.truncate_malformed_trailing_lineexplicitly onresults.jsonlduring the resume-evaluation path, before appending new outputs.save_new_outputsare now responsible for ensuring a valid JSONL boundary before appending; the function itself no longer truncates.Macroscope summarized b7f4df0.