Skip to content

fix(s3stream): fix double-release race in LogCache.tryMerge (#3307)#3315

Merged
superhx merged 1 commit into
mainfrom
cp_3307
Apr 14, 2026
Merged

fix(s3stream): fix double-release race in LogCache.tryMerge (#3307)#3315
superhx merged 1 commit into
mainfrom
cp_3307

Conversation

@superhx
Copy link
Copy Markdown
Collaborator

@superhx superhx commented Apr 14, 2026

Bug

A rare IllegalReferenceCountException: refCnt: 0, decrement: 1 occurs when LogCacheBlock.free() tries to release a StreamRecordBatch whose refCnt is already 0.

Root cause: tryMerge performs the expensive mergeBlock call outside the write lock for performance. During that window, clearStreamRecords can acquire the write lock and call block.free(streamId), which removes a StreamCache from the block and schedules an async release of all its StreamRecordBatch records (refCnt → 0). When tryMerge then re-acquires the write lock, it only checks that left/right are still the same block objects in blocks — it does not detect that the blocks were mutated. So the merged newBlock (which shares the same record references via mergeBlock) replaces left/right in blocks. Later, when newBlock.free() is called, it tries to release records that are already at refCnt 0, causing the crash.

Fix

Add a freeOpsModCount field to LogCacheBlock that is incremented whenever the block is mutated by a free operation (free() or free(long streamId)). In tryMerge, snapshot both blocks' freeOpsModCount inside the same write lock that selects left and right. After mergeBlock completes, re-acquire the write lock and verify the counts are unchanged before committing the swap. If either block was mutated in between, the merge is aborted and newBlock is simply discarded — preventing stale record references from entering blocks and being double-freed.

## Bug

A rare `IllegalReferenceCountException: refCnt: 0, decrement: 1` occurs
when `LogCacheBlock.free()` tries to release a `StreamRecordBatch` whose
refCnt is already 0.

Root cause: `tryMerge` performs the expensive `mergeBlock` call outside
the write lock for performance. During that window, `clearStreamRecords`
can acquire the write lock and call `block.free(streamId)`, which removes
a `StreamCache` from the block and schedules an async release of all its
`StreamRecordBatch` records (refCnt → 0). When `tryMerge` then re-acquires
the write lock, it only checks that `left`/`right` are still the same
block objects in `blocks` — it does not detect that the blocks were
mutated. So the merged `newBlock` (which shares the same record references
via `mergeBlock`) replaces `left`/`right` in `blocks`. Later, when
`newBlock.free()` is called, it tries to release records that are already
at refCnt 0, causing the crash.

## Fix

Add a `freeOpsModCount` field to `LogCacheBlock` that is incremented
whenever the block is mutated by a free operation (`free()` or
`free(long streamId)`). In `tryMerge`, snapshot both blocks'
`freeOpsModCount` inside the same write lock that selects `left` and
`right`. After `mergeBlock` completes, re-acquire the write lock and
verify the counts are unchanged before committing the swap. If either
block was mutated in between, the merge is aborted and `newBlock` is
simply discarded — preventing stale record references from entering
`blocks` and being double-freed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(s3stream): add tests for freeOpsModCount merge guard in LogCache

Add two tests to cover the race condition fix in LogCache.tryMerge:

- testMergeAbortedWhenBlockMutatedDuringMerge: verifies that
  freeOpsModCount is incremented by clearStreamRecords and that
  tryMerge detects the mutation and aborts the block swap.

- testNoDoubleReleaseUnderConcurrentClearAndMerge: stress test that
  runs markFree and clearStreamRecords concurrently across 200
  iterations, asserting no IllegalReferenceCountException occurs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@superhx superhx enabled auto-merge (squash) April 14, 2026 11:25
@superhx superhx merged commit 25eb90f into main Apr 14, 2026
6 checks passed
@superhx superhx deleted the cp_3307 branch April 14, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants