Skip to content

fix: write Harper config files atomically#493

Merged
kriszyp merged 2 commits into
mainfrom
fix/clone-config-atomic-write
May 8, 2026
Merged

fix: write Harper config files atomically#493
kriszyp merged 2 commits into
mainfrom
fix/clone-config-atomic-write

Conversation

@Devin-Holland
Copy link
Copy Markdown
Member

Summary

Replaces direct fs.writeFileSync in config/configUtils.js with a temp-file + fs.renameSync helper so concurrent readers never observe a truncated or empty file during a rewrite.

Why

fs.writeFileSync opens the target with O_TRUNC and then writes — between those two calls the file is zero-length. If a reader reads in that window it gets an empty buffer; YAML.parseDocument('') returns a document with no top-level keys, and downstream configValidator then throws rootPath config parameter is undefined and the worker calls process.exit(1).

This was hit in the field on a six-node cluster: one node's clone wedged with the worker logs:

[http/4] [warn]: Cannot apply runtime env config: rootPath not found in config
[http/4] [error]: Error initializing environment manager
[http/4] [error]: rootPath config parameter is undefined

The race window is opened by cloneNode/cloneNode.ts calling cloneConfig() (which rewrites the config file via createConfigFile) after main() has already started spawning HTTP workers — workers running env.initSync() can land their readFileSync exactly during the truncate-to-write interval. Since process.exit from a worker only kills that worker (not the parent), the clone process continued running but in a wedged state with no exit code, so the container never restarted.

What changed

  • New internal helper atomicWriteFile(filePath, content) in config/configUtils.js that writes to <filePath>.<pid>.<timestamp>.tmp and fs.renameSyncs over the target.
  • Replaced six writers: createConfigFile, checkForUpdatedConfig, updateConfigValue, applyRuntimeEnvVarConfig, addConfig, deleteConfigFromFile. The race exists for any concurrent reader, not only clone.
  • Test coverage: 4 new unit tests for the helper itself; updated the existing applyRuntimeEnvVarConfig write test to verify the temp+rename pair.

Areas to look at

  • Helper placement and signature. Single internal function in configUtils.js, not exported. If you'd prefer it lives in a shared utility instead, easy to move.
  • Temp file naming (<filePath>.<pid>.<Date.now()>.tmp). Same directory as target so the rename is atomic on the same filesystem. Pid+timestamp avoids collisions across processes/threads. Worth confirming this is acceptable on the deployment filesystems we care about.
  • The addConfig async → sync conversion. Line 966 was await fs.writeFile(...); now uses the sync helper. addConfig is in an async function but called rarely (config-add ops API), so blocking briefly is fine. Flag if you'd rather an async variant.
  • Out of scope, intentionally. Did NOT touch cloneNode.ts ordering or environmentManager.initSync's process.exit. The atomic write closes the actual race; reordering clone vs main() is a structural change with broader risk surface, and process.exit from a worker is a separate footgun worth its own PR.

Test plan

  • npm run test:unit:config passes locally (51 passing)
  • Manually verify a fresh clone on a multi-node cluster no longer wedges

Replaces direct fs.writeFileSync calls in configUtils with a temp-file +
rename helper so concurrent readers never observe a truncated or empty
file during a rewrite. fs.writeFileSync truncates before writing, which
let worker threads read a zero-byte config and exit during clone — one
node out of a six-node cluster failed with "rootPath config parameter
is undefined" and the clone never completed.

Applies to all six config writers (createConfigFile, checkForUpdatedConfig,
updateConfigValue, applyRuntimeEnvVarConfig, addConfig, deleteConfigFromFile)
since the same race exists for any concurrent reader.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Devin-Holland Devin-Holland requested a review from a team as a code owner May 7, 2026 19:06
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

Reviewed; no blockers found.

Adds coverage that chokidar (used by RootConfigWatcher) emits a change
event when the watched file is replaced via temp-file + rename rather
than direct truncate-write. Confirms the atomic-write change does not
silently break runtime config reload paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Devin-Holland Devin-Holland requested a review from kriszyp May 7, 2026 20:30
Comment thread config/configUtils.js
function atomicWriteFile(filePath, content) {
const tempPath = `${filePath}.${process.pid}.${Date.now()}.tmp`;
fs.writeFileSync(tempPath, content);
fs.renameSync(tempPath, filePath);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great trick. I used it like 12 years ago as it was the only way to atomically write a file. Nice.

Comment thread config/configUtils.js
// Write atomically via temp file + rename so readers don't observe a truncated/empty file
function atomicWriteFile(filePath, content) {
const tempPath = `${filePath}.${process.pid}.${Date.now()}.tmp`;
fs.writeFileSync(tempPath, content);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put the encoding here, then you don't need the String() calls all over the place.

Suggested change
fs.writeFileSync(tempPath, content);
fs.writeFileSync(tempPath, content, 'utf-8');

Copy link
Copy Markdown
Member

@kriszyp kriszyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should protect against partial files, but if there is still a race condition of writing to a config file at the same time you are hoping workers read the write, that still... a race condition.

@cb1kenobi
Copy link
Copy Markdown
Member

This should protect against partial files, but if there is still a race condition of writing to a config file at the same time you are hoping workers read the write, that still... a race condition.

Are you suggesting ${filePath}.${process.pid}.${Date.now()}.tmp may conflict? I've seen that happen and why I now use randomBytes(8).toString('hex') for temp filenames.

@kriszyp
Copy link
Copy Markdown
Member

kriszyp commented May 8, 2026

This should protect against partial files, but if there is still a race condition of writing to a config file at the same time you are hoping workers read the write, that still... a race condition.

Are you suggesting ${filePath}.${process.pid}.${Date.now()}.tmp may conflict? I've seen that happen and why I now use randomBytes(8).toString('hex') for temp filenames.

No, I meant that if the main thread needs to write something to harper-config.yaml that the child thread needs to read, using an atomic write just prevents a partial file, but if there was a race condition it still might not see whatever the main thread is concurrently working to write that it wants to communicate. I don't know what that is though.

@kriszyp kriszyp merged commit dea2df9 into main May 8, 2026
28 of 29 checks passed
@kriszyp kriszyp deleted the fix/clone-config-atomic-write branch May 8, 2026 13:16
github-actions Bot pushed a commit that referenced this pull request May 8, 2026
fix: write Harper config files atomically
kriszyp pushed a commit that referenced this pull request May 12, 2026
PR #493's atomicWriteFile generated temp paths from
${filePath}.${process.pid}.${Date.now()}.tmp. In Docker, the main Harper
process is pid 1 and worker_threads share process.pid (they're threads,
not separate processes). When multiple workers spawn and each runs
env.initSync -> applyRuntimeEnvVarConfig -> atomicWriteFile in the same
millisecond, they generate the same temp path:

  worker A: writeFileSync(temp) -> exists
  worker B: writeFileSync(temp) -> exists (overwrites)
  worker A: renameSync(temp, target) -> ok, temp now gone
  worker B: renameSync(temp, target) -> ENOENT

Reported as:
  [http/1] [error]: Failed to write config file after applying runtime
    env vars: ENOENT: no such file or directory, rename '...
    .yaml.1.1778606941110.tmp' -> '...yaml'

Fix: include worker thread id and 4 random bytes in the temp path so
concurrent writers in the same process never collide. Added a regression
test that asserts 100 sequential calls produce 100 distinct temp paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot pushed a commit that referenced this pull request May 12, 2026
PR #493's atomicWriteFile generated temp paths from
${filePath}.${process.pid}.${Date.now()}.tmp. In Docker, the main Harper
process is pid 1 and worker_threads share process.pid (they're threads,
not separate processes). When multiple workers spawn and each runs
env.initSync -> applyRuntimeEnvVarConfig -> atomicWriteFile in the same
millisecond, they generate the same temp path:

  worker A: writeFileSync(temp) -> exists
  worker B: writeFileSync(temp) -> exists (overwrites)
  worker A: renameSync(temp, target) -> ok, temp now gone
  worker B: renameSync(temp, target) -> ENOENT

Reported as:
  [http/1] [error]: Failed to write config file after applying runtime
    env vars: ENOENT: no such file or directory, rename '...
    .yaml.1.1778606941110.tmp' -> '...yaml'

Fix: include worker thread id and 4 random bytes in the temp path so
concurrent writers in the same process never collide. Added a regression
test that asserts 100 sequential calls produce 100 distinct temp paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants