Skip to content

Commit 3846387

Browse files
anandgupta42claude
andcommitted
fix(tracing): de-flake CI by guarding async snapshot vs sync flushSync race
CI run 25143783932 (commit ee73863, agent_outcome cherry-pick) failed with `flushSync — crash recovery > flushSync preserves all accumulated data` after passing locally and on prior + subsequent runs. Root cause analysis surfaced a real race in the tracer: The `Recap` (tracing.ts) `snapshot()` method writes asynchronously via `fs.writeFile(tmpPath) → fs.rename(tmpPath, filePath)` for atomicity, storing the in-flight promise in `this.snapshotPromise` but never awaiting it. `flushSync()` writes synchronously to the SAME `filePath` via `fsSync.writeFileSync`. On slow CI runners with disk contention, a snapshot that was in-flight when flushSync ran could complete its `fs.rename` AFTER flushSync's sync write — clobbering the crashed-trace file with stale (non-crashed) content. The test then read the wrong content and failed assertions like `summary.status === "crashed"`. Fix: - Add a `private crashed = false` flag on the Recap class. - `flushSync()` sets `this.crashed = true` BEFORE its sync write so any in-flight async snapshot() can detect the takeover. - `snapshot()` checks `this.crashed` at two points: - Entry: bail before starting a new write. - Just before `fs.rename`: if a flushSync ran during the write, drop the temp file and skip the rename (so the crashed file stays canonical). Why not just await snapshotPromise in flushSync: flushSync is invoked from SIGINT/SIGTERM/beforeExit handlers where async code may not run. The flag-based guard makes the race deterministic without depending on event-loop progression after the signal. Verification: - The flaky test now passes 10/10 times locally. - Broader regression run: 3554/3554 tests pass across test/altimate/ + test/upstream/ + test/branding/. - Single-file run: 27/27 pass in tracing-display-crash.test.ts every time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3aaf064 commit 3846387

1 file changed

Lines changed: 19 additions & 2 deletions

File tree

  • packages/opencode/src/altimate/observability

packages/opencode/src/altimate/observability/tracing.ts

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -328,6 +328,11 @@ export class Trace {
328328
private snapshotDir: string | undefined
329329
private snapshotPending = false
330330
private snapshotPromise: Promise<void> | undefined
331+
// Set true when flushSync runs. Prevents in-flight async snapshot()
332+
// calls from racing with the synchronous crash write — without this,
333+
// a still-pending snapshot's `fs.rename` can overwrite flushSync's
334+
// crashed-trace content. Round-3 audit + CI flake on slow runners.
335+
private crashed = false
331336

332337
private constructor(exporters: TraceExporter[]) {
333338
this.traceId = randomUUIDv7()
@@ -745,17 +750,26 @@ export class Trace {
745750
private snapshot() {
746751
if (!this.snapshotDir || !this.sessionId) return
747752
if (this.snapshotPending) return // Debounce — only one in flight at a time
753+
if (this.crashed) return // flushSync wrote the canonical crashed file; do NOT race it
748754
this.snapshotPending = true
749755

750756
const trace = this.buildTraceFile()
751757
const safeId = (this.sessionId || "unknown").replace(/[/\\.:]/g, "_") || "unknown"
752758
const filePath = path.join(this.snapshotDir, `${safeId}.json`)
753759
const tmpPath = filePath + `.tmp.${Date.now()}.${Math.random().toString(36).slice(2, 8)}`
754760

755-
// Atomic write: write to temp file, then rename (prevents partial reads)
761+
// Atomic write: write to temp file, then rename (prevents partial reads).
762+
// We re-check `this.crashed` immediately before rename so a flushSync that
763+
// ran *during* the write doesn't get clobbered.
756764
this.snapshotPromise = fs.mkdir(this.snapshotDir, { recursive: true })
757765
.then(() => fs.writeFile(tmpPath, JSON.stringify(trace, null, 2)))
758-
.then(() => fs.rename(tmpPath, filePath))
766+
.then(() => {
767+
if (this.crashed) {
768+
// flushSync took over — drop the temp and bail
769+
return fs.unlink(tmpPath).catch(() => {})
770+
}
771+
return fs.rename(tmpPath, filePath)
772+
})
759773
.catch((err) => {
760774
Log.Default.debug(`[tracing] failed to write trace snapshot: ${err}`)
761775
fs.unlink(tmpPath).catch(() => {})
@@ -948,6 +962,9 @@ export class Trace {
948962
flushSync(error?: string) {
949963
try {
950964
if (!this.snapshotDir || !this.sessionId) return
965+
// Set crashed BEFORE writing so any in-flight async snapshot() will
966+
// bail out at its rename step instead of clobbering our crashed file.
967+
this.crashed = true
951968
this.currentGenerationSpanId = undefined
952969
const rootSpan = this.spans.find((s) => s.spanId === this.rootSpanId)
953970
if (rootSpan) {

0 commit comments

Comments
 (0)