fix: write Harper config files atomically#493
Conversation
Replaces direct fs.writeFileSync calls in configUtils with a temp-file + rename helper so concurrent readers never observe a truncated or empty file during a rewrite. fs.writeFileSync truncates before writing, which let worker threads read a zero-byte config and exit during clone — one node out of a six-node cluster failed with "rootPath config parameter is undefined" and the clone never completed. Applies to all six config writers (createConfigFile, checkForUpdatedConfig, updateConfigValue, applyRuntimeEnvVarConfig, addConfig, deleteConfigFromFile) since the same race exists for any concurrent reader. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Reviewed; no blockers found. |
Adds coverage that chokidar (used by RootConfigWatcher) emits a change event when the watched file is replaced via temp-file + rename rather than direct truncate-write. Confirms the atomic-write change does not silently break runtime config reload paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| function atomicWriteFile(filePath, content) { | ||
| const tempPath = `${filePath}.${process.pid}.${Date.now()}.tmp`; | ||
| fs.writeFileSync(tempPath, content); | ||
| fs.renameSync(tempPath, filePath); |
There was a problem hiding this comment.
This is a great trick. I used it like 12 years ago as it was the only way to atomically write a file. Nice.
| // Write atomically via temp file + rename so readers don't observe a truncated/empty file | ||
| function atomicWriteFile(filePath, content) { | ||
| const tempPath = `${filePath}.${process.pid}.${Date.now()}.tmp`; | ||
| fs.writeFileSync(tempPath, content); |
There was a problem hiding this comment.
If you put the encoding here, then you don't need the String() calls all over the place.
| fs.writeFileSync(tempPath, content); | |
| fs.writeFileSync(tempPath, content, 'utf-8'); |
kriszyp
left a comment
There was a problem hiding this comment.
This should protect against partial files, but if there is still a race condition of writing to a config file at the same time you are hoping workers read the write, that still... a race condition.
Are you suggesting |
No, I meant that if the main thread needs to write something to harper-config.yaml that the child thread needs to read, using an atomic write just prevents a partial file, but if there was a race condition it still might not see whatever the main thread is concurrently working to write that it wants to communicate. I don't know what that is though. |
fix: write Harper config files atomically
PR #493's atomicWriteFile generated temp paths from ${filePath}.${process.pid}.${Date.now()}.tmp. In Docker, the main Harper process is pid 1 and worker_threads share process.pid (they're threads, not separate processes). When multiple workers spawn and each runs env.initSync -> applyRuntimeEnvVarConfig -> atomicWriteFile in the same millisecond, they generate the same temp path: worker A: writeFileSync(temp) -> exists worker B: writeFileSync(temp) -> exists (overwrites) worker A: renameSync(temp, target) -> ok, temp now gone worker B: renameSync(temp, target) -> ENOENT Reported as: [http/1] [error]: Failed to write config file after applying runtime env vars: ENOENT: no such file or directory, rename '... .yaml.1.1778606941110.tmp' -> '...yaml' Fix: include worker thread id and 4 random bytes in the temp path so concurrent writers in the same process never collide. Added a regression test that asserts 100 sequential calls produce 100 distinct temp paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #493's atomicWriteFile generated temp paths from ${filePath}.${process.pid}.${Date.now()}.tmp. In Docker, the main Harper process is pid 1 and worker_threads share process.pid (they're threads, not separate processes). When multiple workers spawn and each runs env.initSync -> applyRuntimeEnvVarConfig -> atomicWriteFile in the same millisecond, they generate the same temp path: worker A: writeFileSync(temp) -> exists worker B: writeFileSync(temp) -> exists (overwrites) worker A: renameSync(temp, target) -> ok, temp now gone worker B: renameSync(temp, target) -> ENOENT Reported as: [http/1] [error]: Failed to write config file after applying runtime env vars: ENOENT: no such file or directory, rename '... .yaml.1.1778606941110.tmp' -> '...yaml' Fix: include worker thread id and 4 random bytes in the temp path so concurrent writers in the same process never collide. Added a regression test that asserts 100 sequential calls produce 100 distinct temp paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Replaces direct
fs.writeFileSyncinconfig/configUtils.jswith a temp-file +fs.renameSynchelper so concurrent readers never observe a truncated or empty file during a rewrite.Why
fs.writeFileSyncopens the target withO_TRUNCand then writes — between those two calls the file is zero-length. If a reader reads in that window it gets an empty buffer;YAML.parseDocument('')returns a document with no top-level keys, and downstreamconfigValidatorthen throwsrootPath config parameter is undefinedand the worker callsprocess.exit(1).This was hit in the field on a six-node cluster: one node's clone wedged with the worker logs:
The race window is opened by
cloneNode/cloneNode.tscallingcloneConfig()(which rewrites the config file viacreateConfigFile) aftermain()has already started spawning HTTP workers — workers runningenv.initSync()can land theirreadFileSyncexactly during the truncate-to-write interval. Sinceprocess.exitfrom a worker only kills that worker (not the parent), the clone process continued running but in a wedged state with no exit code, so the container never restarted.What changed
atomicWriteFile(filePath, content)inconfig/configUtils.jsthat writes to<filePath>.<pid>.<timestamp>.tmpandfs.renameSyncs over the target.createConfigFile,checkForUpdatedConfig,updateConfigValue,applyRuntimeEnvVarConfig,addConfig,deleteConfigFromFile. The race exists for any concurrent reader, not only clone.applyRuntimeEnvVarConfigwrite test to verify the temp+rename pair.Areas to look at
configUtils.js, not exported. If you'd prefer it lives in a shared utility instead, easy to move.<filePath>.<pid>.<Date.now()>.tmp). Same directory as target so the rename is atomic on the same filesystem. Pid+timestamp avoids collisions across processes/threads. Worth confirming this is acceptable on the deployment filesystems we care about.addConfigasync → sync conversion. Line 966 wasawait fs.writeFile(...); now uses the sync helper.addConfigis in an async function but called rarely (config-add ops API), so blocking briefly is fine. Flag if you'd rather an async variant.cloneNode.tsordering orenvironmentManager.initSync'sprocess.exit. The atomic write closes the actual race; reordering clone vsmain()is a structural change with broader risk surface, andprocess.exitfrom a worker is a separate footgun worth its own PR.Test plan
npm run test:unit:configpasses locally (51 passing)