Skip to content

fix(cache): atomic strong+weak alias via hardlink to eliminate concurrent put race (#70)#81

Merged
sgwannabe merged 4 commits intomainfrom
fix/i70-cache-concurrency-race
Apr 25, 2026
Merged

fix(cache): atomic strong+weak alias via hardlink to eliminate concurrent put race (#70)#81
sgwannabe merged 4 commits intomainfrom
fix/i70-cache-concurrency-race

Conversation

@sgwannabe
Copy link
Copy Markdown
Contributor

@sgwannabe sgwannabe commented Apr 25, 2026

Summary

  • scripts/preview-cache.sh cmd_put now publishes the weak alias as a hardlink (ln -f) against the strong-key inode instead of as an independent cp+mv. Strong and weak names flip in lock-step for every concurrent cmd_get observer, closing the I-8 race that occasionally produced strong-HIT / weak-MISS and silently re-triggered the Socratic interview.
  • New cmd_get_with_fallback helper (CLI: get-fallback STRONG WEAK) implements Option C self-heal: legacy v1.6.1 entries written before this fix, or any selectively-invalidated half-pair, repair the missing side on first read by hardlinking against the surviving file.
  • Two fixtures under tests/fixtures/cache-concurrency/ pin the invariant: 5-way concurrent put (all 5 pairs share inode, byte-identical) and self-heal (strong-only, then weak-only seeded → missing side restored, inodes match).

Root cause

The previous cmd_put wrote the strong key and the weak alias as two independent mktemp+mv sequences. Between the two mv calls, a concurrent runner could observe the strong file on disk while the weak alias did not yet exist, satisfying cmd_get(strong) but failing cmd_get(weak). No data corruption, but UX degradation: the one-click replay promise broke whenever two runs raced. Hardlink (link(2) on the same FS) is atomic from readers' perspective and produces a single shared inode, so the alias name and content become visible in one step.

Test plan

  • bash tests/fixtures/cache-concurrency/test-5way.sh (5 concurrent puts, all pairs share inode)
  • bash tests/fixtures/cache-concurrency/test-self-heal.sh (strong-only and weak-only half-pair recovery)
  • bash tests/fixtures/security/verify-security.sh — full Phase 1 suite still green (T-9.4 atomic concurrent-put unaffected)
  • bash -n scripts/preview-cache.sh syntax check
  • CI macOS + Linux runs (T-12 matrix)

Closes #70

Summary by CodeRabbit

릴리스 노트

  • 새 기능

    • 캐시 폴백 검색을 위한 새로운 get-fallback 하위 명령 추가 (강력한 캐시 키 우선 시도, 약한 별칭 폴백 지원)
    • 약한 별칭 자동 복구 기능 구현
    • 동시 캐시 읽기/쓰기 중 경쟁 조건 방지를 위한 발행 순서 개선
  • 테스트

    • 캐시 동시성 및 경쟁 조건 검증을 위한 회귀 테스트 추가
    • 하드링크 불변성 및 바이트 일관성 확인

sgwannabe and others added 2 commits April 25, 2026 16:10
Replace the second `mv` in cmd_put with `ln -f` so the strong key and
weak alias share a single inode — concurrent cmd_get observers can no
longer see strong-HIT / weak-MISS, which was silently re-triggering the
Socratic interview. Add cmd_get_with_fallback (Option C) plus a
`get-fallback` dispatch entry to self-heal half-pair states (legacy
v1.6.1 entries written before this fix, or selectively-invalidated
aliases) by recreating the missing side as a hardlink on first read.

Closes #70

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two fixtures pinning the W1.4 hardlink fix:
- test-5way.sh spawns 5 concurrent cmd_put calls and asserts every
  strong/weak pair shares a single inode and identical content.
- test-self-heal.sh seeds half-pair states (strong-only, then
  weak-only) and asserts get-fallback restores the missing side with
  a matching inode.
Both detect macOS vs Linux at runtime to pick the portable
`stat -f %i` / `stat -c %i` invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d67b2cf3-387f-4664-beb0-d56d57b71d85

📥 Commits

Reviewing files that changed from the base of the PR and between 8b57ed2 and 6c4b92a.

📒 Files selected for processing (5)
  • scripts/preview-cache.sh
  • tests/fixtures/cache-concurrency/README.md
  • tests/fixtures/cache-concurrency/test-5way.sh
  • tests/fixtures/cache-concurrency/test-race-window.sh
  • tests/fixtures/cache-concurrency/test-self-heal.sh

🔍 동작 개요

scripts/preview-cache.sh에 새로운 get-fallback 서브커맨드를 추가하여 강력한 캐시 키를 먼저 시도한 후 약한 별칭으로 폴백하고, 서로 다른 종료 코드를 반환합니다. cmd_put의 약한 별칭 동작을 변경하여 원자적 tmp 스테이징(hardlink 선호, copy 폴백)을 사용하고 강력한 키 앞에 약한 별칭을 먼저 발행하여 동시 실행 중 부분 쓰기로 인한 race 조건을 방지합니다.

📋 변경 사항

내용 / 파일(들) 요약
캐시 서브커맨드 및 동시 실행 처리
scripts/preview-cache.sh
새로운 cmd_get_with_fallback() 함수 추가 (강력한 키 → 약한 별칭 → 자동 self-heal 폴백). cmd_put을 재구성하여 tmp 디렉토리에서 두 파일을 작성한 후 원자적 mv로 별칭을 먼저 발행한 후 강력한 키를 발행하도록 변경. 키 정규화 및 선택적 self-heal 로직 포함. 강력한 hit(종료 0) / 약한 hit(종료 2) / miss(종료 1)에 대한 명확한 종료 코드.
동시 실행 테스트 문서 및 스위트
tests/fixtures/cache-concurrency/README.md, tests/fixtures/cache-concurrency/test-*.sh
별칭 우선 발행 순서, cmd_get_with_fallback 계약(3가지 종료 코드), hardlink 불변식 검증을 설명하는 새로운 문서. 5개 병렬 put 작업의 동시 실행 회귀 테스트(test-5way.sh), 별칭 우선 발행 순서의 정적 소스 수준 검증(test-race-window.sh), 4가지 캐시 상태에서 self-heal 검증 및 inode/바이트 동등성 검사(test-self-heal.sh).

🔄 시나리오 다이어그램

sequenceDiagram
    participant Caller as 호출자
    participant GF as get-fallback
    participant SK as 강력한<br/>키 조회
    participant WK as 약한<br/>별칭 조회
    participant SH as Self-heal<br/>로직
    participant Disk as 디스크

    Caller->>GF: get-fallback(strong_key, weak_key)
    GF->>SK: cmd_get(strong_key) 시도
    SK->>Disk: 강력한 키 파일 읽기
    alt 강력한 키 존재
        Disk-->>SK: 콘텐츠 반환
        SK-->>GF: 성공 (exit 0)
        GF->>SH: 약한 별칭<br/>self-heal?
        SH->>Disk: hardlink<br/>생성
        GF-->>Caller: 콘텐츠 반환<br/>(exit 0)
    else 강력한 키 부재
        Disk-->>SK: 실패
        SK-->>GF: 실패
        GF->>WK: cmd_get(weak_key) 시도
        WK->>Disk: 약한 별칭 파일 읽기
        alt 약한 별칭 존재
            Disk-->>WK: 콘텐츠 반환
            WK-->>GF: 소프트 hit (exit 2)
            GF-->>Caller: 콘텐츠 반환<br/>(exit 2)
        else 약한 별칭도 부재
            Disk-->>WK: 실패
            WK-->>GF: 실패
            GF-->>Caller: 모두 miss<br/>(exit 1)
        end
    end
Loading

⏱️ 예상 코드 리뷰 소요 시간

🎯 3 (Moderate) | ⏱️ ~25 분

🔗 관련 가능성 있는 PR

🐰 축제의 시

🏃‍♂️ 우리가 경쟁할 때, race를 피해야지
⚛️ Atomic moves와 hardlinks로 안전하게
🔄 강한 키 먼저, 약한 별칭은 나중에
🎯 Self-heal로 부상자도 구한다
✨ 한 번의 클릭이 정말 된다!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/i70-cache-concurrency-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition in the preview cache by using hardlinks for weak aliases, ensuring that strong and weak keys are published atomically. It also introduces a get-fallback command to self-heal incomplete cache entries and includes new concurrency tests. While the hardlink implementation in cmd_put is sound, the new cmd_get_with_fallback function has several issues: it lacks input sanitization for the keys (leading to potential path traversal), strips trailing newlines from the JSON output due to command substitution, and uses a non-atomic cp operation for repairs.

Comment thread scripts/preview-cache.sh
Comment on lines +336 to +371
cmd_get_with_fallback() {
# I-8 / issue #70 (W1.4) — Option C self-heal. If a v1.6.1 put landed
# the strong key but the weak alias is missing (e.g. legacy entry
# written before the hardlink fix, or an alias that was independently
# invalidated), restore the alias on the fly so subsequent
# pre-Socratic probes hit immediately. Conversely, if the strong key
# is missing but the weak alias resolves (rare but possible after a
# selective `invalidate STRONG`), repair the strong key from the weak
# entry. Either path returns the cached JSON on stdout, exit 0; full
# miss returns exit 1 (matches cmd_get).
local strong="$1"
local weak="$2"
local out
if out=$(cmd_get "$strong"); then
# Strong hit — opportunistically ensure the weak alias is in place
# so the NEXT pre-Socratic probe on this idea avoids the cmd_get
# round-trip on the strong key entirely.
if [[ -n "$weak" && "$weak" != "$strong" && ! -f "$CACHE_DIR/$weak.json" ]]; then
ln -f "$CACHE_DIR/$strong.json" "$CACHE_DIR/$weak.json" 2>/dev/null || true
fi
printf '%s' "$out"
return 0
fi
if out=$(cmd_get "$weak"); then
# Weak hit but strong miss — repair the strong key from the weak
# file. Hardlink keeps both names pointed at the same inode so
# later puts and TTL checks stay consistent.
if [[ -n "$strong" && "$strong" != "$weak" ]]; then
ln -f "$CACHE_DIR/$weak.json" "$CACHE_DIR/$strong.json" 2>/dev/null || \
cp "$CACHE_DIR/$weak.json" "$CACHE_DIR/$strong.json"
fi
printf '%s' "$out"
return 0
fi
return 1
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The cmd_get_with_fallback implementation has several issues that should be addressed:

  1. Security (Path Traversal): The strong and weak keys are used directly in file paths without sanitization. This violates the repository's rules. Variables used in file paths should be sanitized with a strict allowlist (e.g., tr -dc '[:alnum:]._-') and IDs should use strict regex validation to prevent potential injection or traversal issues.
  2. Correctness (Newline Preservation): Capturing the output of cmd_get using out=$(...) and then printing it with printf '%s' "$out" strips all trailing newlines from the cached JSON. This can break tools that expect a trailing newline. Calling cmd_get directly avoids this and is more efficient as it streams the output.
  3. Atomicity: The fallback to cp (line 365) is non-atomic. If a concurrent process reads the file while cp is in progress, it may receive truncated or corrupted JSON.

I recommend refactoring the function to sanitize inputs using a strict allowlist, call cmd_get directly to preserve output formatting, and leverage cmd_put for the repair logic to ensure atomicity and reuse existing sanitization logic.

References
  1. Sanitize variables used in file paths with a strict allowlist (e.g., tr -dc '[:alnum:]._-') to prevent potential injection issues.
  2. Implement defense-in-depth sanitization for user-provided data, including removing Unicode control characters and applying strict allowlist validation (e.g., regex) for IDs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 6c4b92a — P1/P3 반영, P2는 이미 처리됨.

  • P1 (path traversal): cmd_get_with_fallbackcmd_put과 동일한 [:alnum:]._- 허용목록으로 strong/weak 키를 정규화합니다 (자체 호출자는 16-hex 해시만 발행하지만 defense-in-depth 차원).
  • P3 (atomicity): weak alias self-heal을 private tmp(.heal-${weak}.$$.tmp)에 stage한 뒤 mv -f rename(2)으로 게시하도록 변경. 동시 reader가 절반만 복사된 alias를 관측할 수 없습니다 (cmd_put의 alias-first 게시와 동일한 atomicity 보장).
  • P2 (trailing newline): 지적하신 out=$(cmd_get ...); printf '%s' "$out" 패턴은 현재 코드에 존재하지 않습니다. cmd_get이 직접 cat "$file"로 stream하므로 command-substitution capture 자체가 없고 trailing newline은 보존됩니다 (codex R2 P3에서 이미 처리됨). 후속 리뷰어를 위해 주석으로 invariant를 명시했습니다.

3개 concurrency fixture (race-window / self-heal / 5way) 모두 통과.

sgwannabe and others added 2 commits April 25, 2026 16:59
… fallback (#70)

Codex review followups on PR #81 W1.4:

R2 P1: ln -f against existing alias is NOT atomic (link(2) cannot
overwrite, so -f does unlink+link with a window where the alias name
disappears). Real fix: build inode at primary_tmp, hardlink alias_tmp
to it, publish ALIAS first via rename(2), then publish strong via
rename(2). Both names share the inode; only weak-HIT/strong-MISS is
reachable from a partial write — the strong-HIT/weak-MISS bug is gone.

R2 P2: cmd_get_with_fallback no longer rebuilds strong from weak
(weak_key omits idea_spec_hash; rebuilding strong from weak could let
stale-spec content satisfy a spec-specific lookup).

R2 P3: stream via cmd_get directly (cat) instead of out=$() capture —
trailing newlines are now preserved byte-for-byte.

R3 P2-A: fall back to cp when ln fails (exFAT/SMB/NFS without
hardlink support). Publish status echo now indicates "via link" or
"via copy" so the mode is visible.

R3 P2-B: distinct exit codes for cmd_get_with_fallback —
  0 = strong HIT (authoritative)
  2 = soft hit via weak (caller MUST regen previews)
  1 = both miss
Test fixture asserts all three codes.

R3 P3: replaced runtime polling probe (which couldn't reliably catch
sub-µs rename(2) races) with a STATIC source-level invariant test
that asserts alias publish line < strong publish line in the source.
Mutation-tested: swapping the publish blocks fails the test.

Files:
- scripts/preview-cache.sh — restructure cmd_put to alias-first
  publish; add cp fallback; narrow cmd_get_with_fallback (no
  strong-from-weak repair); exit-code contract documented in usage.
- tests/fixtures/cache-concurrency/test-race-window.sh (new) —
  static invariant guard.
- tests/fixtures/cache-concurrency/test-self-heal.sh — 4 cases
  (strong-only, weak-only soft-hit, full miss, byte-equivalence).
- tests/fixtures/cache-concurrency/README.md — updated to reflect
  exit-code contract and fixture rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cmd_get_with_fallback now mirrors cmd_put's hardening posture:

- Sanitise strong/weak keys via the same `[:alnum:]._-` allowlist
  (defence-in-depth path-traversal guard; gemini HIGH P1).
- Stage weak-alias self-heal into a private tmp + mv -f so a
  concurrent reader can never observe a half-copied alias
  (atomicity parity with cmd_put; gemini HIGH P3).
- Newline-preservation concern (P2) was already addressed in the
  initial PR (cmd_get streams via cat — no command-substitution
  capture), so no code change there; comment expanded to make the
  invariant explicit for future reviewers.

All three concurrency fixtures still pass (race-window / self-heal /
5way) and bash -n is clean.

Refs PR #81 review comments (gemini-code-assist)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sgwannabe sgwannabe marked this pull request as ready for review April 25, 2026 08:41
@sgwannabe sgwannabe merged commit 798ae02 into main Apr 25, 2026
6 of 7 checks passed
@sgwannabe sgwannabe deleted the fix/i70-cache-concurrency-race branch April 25, 2026 08:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c4b92a615

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/preview-cache.sh
strong=$(printf '%s' "$strong_in" | tr -dc '[:alnum:]._-')
if [[ -z "$strong" ]]; then
echo "preview-cache.sh: refusing get-fallback with empty/unsafe strong key: '$strong_in'" >&2
return 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use non-soft-hit status for invalid strong keys

get-fallback documents exit 2 as a soft hit via weak alias, but the input-validation branch also returns 2 when strong_key sanitizes to empty. Any caller that branches on 2 to take the soft-hit path (e.g., skip Socratic and regenerate previews) will mis-handle malformed input as a cache hit even though no weak payload was streamed. Return a distinct error code here so 2 remains unambiguous.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🟡 I-8: concurrent cmd_put race — strong-key hit / weak-alias miss → Socratic re-prompt

1 participant