Traces overwritten/corrupted in long-running sessions (multiple concurrency vectors)

## Summary

In long-running altimate-code sessions, the local trace file at `~/.local/share/altimate-code/traces/<sessionId>.json` can end up:

- **Missing the most recent events** (stale content), or
- **Showing a stale terminating status** (e.g. `"completed"` after a crash that should have written `"crashed"`)

Concurrency bugs in the trace pipeline (`packages/opencode/src/altimate/observability/tracing.ts`). At least three distinct mechanisms have been confirmed; two of them fire 100% of the time under realistic conditions with no synthetic timing injection.

This is the umbrella issue for the bug class. M1 (the one originally found via CI flake and partially mitigated in commit `38463876b`) is documented here as one vector; M2 and M3 are the load-bearing production failure paths.

## Reproductions

All tests live on branch [`investigate/trace-corruption`](https://github.com/AltimateAI/altimate-code/tree/investigate/trace-corruption) at `packages/opencode/test/altimate/tracing-rename-race.test.ts`.

| # | Mechanism | Setup | Trials | Hit rate | Notes |
|---|---|---|---|---|---|
| 1 | **M1** — flushSync content clobbered by in-flight snapshot rename | `fs.rename` patched to 200 ms | 1 | 1/1 | Deterministic only |
| 2 | M1 natural (basic) | 2k preloaded spans + N-1 worker_threads busy loops | 5 | 0/5 | `crashed` flag from commit `38463876b` protects |
| 3 | M1 natural (aggressive) | 5k spans × 5 KB payload (~25 MB JSON) + 2× CPU oversubscription + 200 concurrent 1 MB I/O noise writes | 10 | 0/10 | Still doesn't fire — kernel rename on APFS/local SSD is too fast |
| 4 | **M2** — debounce drop + no follow-up snapshot | `fs.rename` patched to 150 ms | 1 | 1/1 | Deterministic |
| 5 | **M2 natural** | 800 preloaded bulky spans, 8-event burst, no fs patch | 20 | **20/20** | Every trial loses 7 of 8 burst events |
| 6 | **M3** — `FileExporter.export()` race vs flushSync (no `crashed` guard at all) | `fs.rename` patched to 200 ms | 1 | 1/1 | Deterministic |
| 7 | **M3 natural** | 5k × 5 KB preload, no patch, ~52 MB JSON | 10 | **10/10** | Every trial: file ends as `status: "completed"` with no error |
| 8 | Baseline (flushSync alone, no race) | n/a | 1 | n/a | Passes — correct behavior when uncontested |

To run locally:

```bash
git fetch origin investigate/trace-corruption
git checkout investigate/trace-corruption
cd packages/opencode && bun test test/altimate/tracing-rename-race.test.ts
```

Expected: baseline passes, all `M*-natural` tests *fail by design* (they assert the bug fires).

## Mechanism 1 — flushSync vs in-flight snapshot rename

`tracing.ts:765-774`:

```ts
this.snapshotPromise = fs
  .mkdir(this.snapshotDir, { recursive: true })
  .then(() => fs.writeFile(tmpPath, JSON.stringify(trace, null, 2)))
  .then(() => {
    if (this.crashed) return fs.unlink(tmpPath).catch(() => {})   // ← TOCTOU
    return fs.rename(tmpPath, filePath)                            // ← kernel rename runs later
  })
```

The `crashed` check is synchronous; `fs.rename` is async. Once the check has passed and the rename syscall is queued, `flushSync` running synchronously can't recall it — when the kernel completes the rename, it overwrites flushSync's crashed file with the snapshot's pre-crash content.

**Status: documented design hazard.** Doesn't fire naturally on common hardware (local SSD, APFS, ext4) because the rename completes in microseconds and the `crashed` flag closes the JS-level window. Could plausibly fire on slow filesystems (FUSE, network mounts, iCloud-synced trace dirs, encrypted FS).

## Mechanism 2 — debounce drops events with no follow-up snapshot

`tracing.ts:751-783`. When a snapshot is in flight, subsequent `snapshot()` calls return immediately and **are never re-scheduled** for after the in-flight write:

```ts
private snapshot() {
  if (!this.snapshotDir || !this.sessionId) return
  if (this.snapshotPending) return   // ← dropped, no follow-up registered
  if (this.crashed) return
  this.snapshotPending = true
  // ... async write ...
}
```

In a real session, an LLM turn fires a burst of tool calls back-to-back. The first triggers a snapshot. While the writeFile (~50-100 ms on a multi-MB JSON) is in flight, the next 7-8 tool calls all see `snapshotPending = true` and return without scheduling anything. When the in-flight snapshot completes, `snapshotPending` is reset to `false`, but **no catch-up snapshot is run**. The disk file remains at the in-flight snapshot's state until a fresh event arrives in a future turn.

If the process exits between the in-flight snapshot completing and the next event:
- `flushSync` may or may not run (see #593).
- If `flushSync` runs, M3 may overwrite it (see below).
- If `flushSync` doesn't run, the file is permanently stale, missing all burst events after the first.

**This is the dominant production failure path.** Fires 20/20 naturally with no synthetic timing.

## Mechanism 20 — `FileExporter.export()` race vs flushSync (no guard)

`tracing.ts:174-196`. `endTrace()` calls `exporter.export(trace)`, which does:

```ts
await fs.mkdir(this.dir, { recursive: true })
await fs.writeFile(tmpPath, JSON.stringify(trace, null, 2))   // ← long on big JSON
await fs.rename(tmpPath, filePath)
```

There is **no `this.crashed` check anywhere in this method**. The `crashed` flag added in commit `38463876b` was applied to `snapshot()` only. If `flushSync` runs during `endTrace`'s export (e.g. SIGTERM during shutdown), it writes crashed content synchronously — and then the export's rename lands and overwrites it.

The window here is the `writeFile` step (tens to hundreds of ms on a multi-MB trace), not the rename (microseconds). That's why M3 fires *naturally* on common hardware while M1 does not.

**Status: confirmed production hazard.** Fires 10/10 naturally.

## Fix sketch

Each mechanism needs its own fix; they're partially independent.

| Vector | Fix direction |
|---|---|
| M2 | In `snapshot()`'s `.finally`, check whether `spans.length` or a version counter advanced during the in-flight write. If yes, schedule one follow-up snapshot. Bounded: at most one extra write per "burst → quiet" cycle. |
| M3 | Add the same `crashed` check to `FileExporter.export()` (and any other write path). Or — preferable — refactor so a single canonical writer owns the atomic-write contract; the exporter just hands off the `TraceFile` object. |
| M1 | The TOCTOU is architecturally unfixable while sharing the canonical filename. Options: (a) write the crashed snapshot to `<sessionId>.crashed.json` and have readers prefer that file when present; (b) use a write-then-stat verification in flushSync; (c) accept it as a known hazard on local SSD and document the non-POSIX FS risk. |

Filing the PR against this issue once M2 + M3 fixes are landed and re-tested.

## Related

- Commit [`38463876b`](https://github.com/AltimateAI/altimate-code/commit/38463876b) — partial M1 fix (the `crashed` flag)
- #593 / PR #594 — orthogonal: whether `flushSync` ever runs at all on TUI crashes
- #644 / PR #646 — different race in this area (was `export()` vs `snapshot()` non-atomic writes; fixed by making `export()` atomic). Note that this PR introduced the `tmp + rename` pattern in `export()` that M3 races against.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traces overwritten/corrupted in long-running sessions (multiple concurrency vectors) #865

Summary

Reproductions

Mechanism 1 — flushSync vs in-flight snapshot rename

Mechanism 2 — debounce drops events with no follow-up snapshot

Mechanism 20 — `FileExporter.export()` race vs flushSync (no guard)

Fix sketch

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Mechanism	Setup	Trials	Hit rate	Notes
1	M1 — flushSync content clobbered by in-flight snapshot rename	`fs.rename` patched to 200 ms	1	1/1	Deterministic only
2	M1 natural (basic)	2k preloaded spans + N-1 worker_threads busy loops	5	0/5	`crashed` flag from commit `38463876b` protects
3	M1 natural (aggressive)	5k spans × 5 KB payload (~25 MB JSON) + 2× CPU oversubscription + 200 concurrent 1 MB I/O noise writes	10	0/10	Still doesn't fire — kernel rename on APFS/local SSD is too fast
4	M2 — debounce drop + no follow-up snapshot	`fs.rename` patched to 150 ms	1	1/1	Deterministic
5	M2 natural	800 preloaded bulky spans, 8-event burst, no fs patch	20	20/20	Every trial loses 7 of 8 burst events
6	M3 — `FileExporter.export()` race vs flushSync (no `crashed` guard at all)	`fs.rename` patched to 200 ms	1	1/1	Deterministic
7	M3 natural	5k × 5 KB preload, no patch, ~52 MB JSON	10	10/10	Every trial: file ends as `status: "completed"` with no error
8	Baseline (flushSync alone, no race)	n/a	1	n/a	Passes — correct behavior when uncontested

Vector	Fix direction
M2	In `snapshot()`'s `.finally`, check whether `spans.length` or a version counter advanced during the in-flight write. If yes, schedule one follow-up snapshot. Bounded: at most one extra write per "burst → quiet" cycle.
M3	Add the same `crashed` check to `FileExporter.export()` (and any other write path). Or — preferable — refactor so a single canonical writer owns the atomic-write contract; the exporter just hands off the `TraceFile` object.
M1	The TOCTOU is architecturally unfixable while sharing the canonical filename. Options: (a) write the crashed snapshot to `<sessionId>.crashed.json` and have readers prefer that file when present; (b) use a write-then-stat verification in flushSync; (c) accept it as a known hazard on local SSD and document the non-POSIX FS risk.

Traces overwritten/corrupted in long-running sessions (multiple concurrency vectors) #865

Description

Summary

Reproductions

Mechanism 1 — flushSync vs in-flight snapshot rename

Mechanism 2 — debounce drops events with no follow-up snapshot

Mechanism 20 — FileExporter.export() race vs flushSync (no guard)

Fix sketch

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Mechanism 20 — `FileExporter.export()` race vs flushSync (no guard)