Skip to content

Traces overwritten/corrupted in long-running sessions (multiple concurrency vectors) #865

@sahrizvi

Description

@sahrizvi

Summary

In long-running altimate-code sessions, the local trace file at ~/.local/share/altimate-code/traces/<sessionId>.json can end up:

  • Missing the most recent events (stale content), or
  • Showing a stale terminating status (e.g. "completed" after a crash that should have written "crashed")

Concurrency bugs in the trace pipeline (packages/opencode/src/altimate/observability/tracing.ts). At least three distinct mechanisms have been confirmed; two of them fire 100% of the time under realistic conditions with no synthetic timing injection.

This is the umbrella issue for the bug class. M1 (the one originally found via CI flake and partially mitigated in commit 38463876b) is documented here as one vector; M2 and M3 are the load-bearing production failure paths.

Reproductions

All tests live on branch investigate/trace-corruption at packages/opencode/test/altimate/tracing-rename-race.test.ts.

# Mechanism Setup Trials Hit rate Notes
1 M1 — flushSync content clobbered by in-flight snapshot rename fs.rename patched to 200 ms 1 1/1 Deterministic only
2 M1 natural (basic) 2k preloaded spans + N-1 worker_threads busy loops 5 0/5 crashed flag from commit 38463876b protects
3 M1 natural (aggressive) 5k spans × 5 KB payload (~25 MB JSON) + 2× CPU oversubscription + 200 concurrent 1 MB I/O noise writes 10 0/10 Still doesn't fire — kernel rename on APFS/local SSD is too fast
4 M2 — debounce drop + no follow-up snapshot fs.rename patched to 150 ms 1 1/1 Deterministic
5 M2 natural 800 preloaded bulky spans, 8-event burst, no fs patch 20 20/20 Every trial loses 7 of 8 burst events
6 M3FileExporter.export() race vs flushSync (no crashed guard at all) fs.rename patched to 200 ms 1 1/1 Deterministic
7 M3 natural 5k × 5 KB preload, no patch, ~52 MB JSON 10 10/10 Every trial: file ends as status: "completed" with no error
8 Baseline (flushSync alone, no race) n/a 1 n/a Passes — correct behavior when uncontested

To run locally:

git fetch origin investigate/trace-corruption
git checkout investigate/trace-corruption
cd packages/opencode && bun test test/altimate/tracing-rename-race.test.ts

Expected: baseline passes, all M*-natural tests fail by design (they assert the bug fires).

Mechanism 1 — flushSync vs in-flight snapshot rename

tracing.ts:765-774:

this.snapshotPromise = fs
  .mkdir(this.snapshotDir, { recursive: true })
  .then(() => fs.writeFile(tmpPath, JSON.stringify(trace, null, 2)))
  .then(() => {
    if (this.crashed) return fs.unlink(tmpPath).catch(() => {})   // ← TOCTOU
    return fs.rename(tmpPath, filePath)                            // ← kernel rename runs later
  })

The crashed check is synchronous; fs.rename is async. Once the check has passed and the rename syscall is queued, flushSync running synchronously can't recall it — when the kernel completes the rename, it overwrites flushSync's crashed file with the snapshot's pre-crash content.

Status: documented design hazard. Doesn't fire naturally on common hardware (local SSD, APFS, ext4) because the rename completes in microseconds and the crashed flag closes the JS-level window. Could plausibly fire on slow filesystems (FUSE, network mounts, iCloud-synced trace dirs, encrypted FS).

Mechanism 2 — debounce drops events with no follow-up snapshot

tracing.ts:751-783. When a snapshot is in flight, subsequent snapshot() calls return immediately and are never re-scheduled for after the in-flight write:

private snapshot() {
  if (!this.snapshotDir || !this.sessionId) return
  if (this.snapshotPending) return   // ← dropped, no follow-up registered
  if (this.crashed) return
  this.snapshotPending = true
  // ... async write ...
}

In a real session, an LLM turn fires a burst of tool calls back-to-back. The first triggers a snapshot. While the writeFile (~50-100 ms on a multi-MB JSON) is in flight, the next 7-8 tool calls all see snapshotPending = true and return without scheduling anything. When the in-flight snapshot completes, snapshotPending is reset to false, but no catch-up snapshot is run. The disk file remains at the in-flight snapshot's state until a fresh event arrives in a future turn.

If the process exits between the in-flight snapshot completing and the next event:

This is the dominant production failure path. Fires 20/20 naturally with no synthetic timing.

Mechanism 20 — FileExporter.export() race vs flushSync (no guard)

tracing.ts:174-196. endTrace() calls exporter.export(trace), which does:

await fs.mkdir(this.dir, { recursive: true })
await fs.writeFile(tmpPath, JSON.stringify(trace, null, 2))   // ← long on big JSON
await fs.rename(tmpPath, filePath)

There is no this.crashed check anywhere in this method. The crashed flag added in commit 38463876b was applied to snapshot() only. If flushSync runs during endTrace's export (e.g. SIGTERM during shutdown), it writes crashed content synchronously — and then the export's rename lands and overwrites it.

The window here is the writeFile step (tens to hundreds of ms on a multi-MB trace), not the rename (microseconds). That's why M3 fires naturally on common hardware while M1 does not.

Status: confirmed production hazard. Fires 10/10 naturally.

Fix sketch

Each mechanism needs its own fix; they're partially independent.

Vector Fix direction
M2 In snapshot()'s .finally, check whether spans.length or a version counter advanced during the in-flight write. If yes, schedule one follow-up snapshot. Bounded: at most one extra write per "burst → quiet" cycle.
M3 Add the same crashed check to FileExporter.export() (and any other write path). Or — preferable — refactor so a single canonical writer owns the atomic-write contract; the exporter just hands off the TraceFile object.
M1 The TOCTOU is architecturally unfixable while sharing the canonical filename. Options: (a) write the crashed snapshot to <sessionId>.crashed.json and have readers prefer that file when present; (b) use a write-then-stat verification in flushSync; (c) accept it as a known hazard on local SSD and document the non-POSIX FS risk.

Filing the PR against this issue once M2 + M3 fixes are landed and re-tested.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions