Skip to content

fix(s3stream): fix double-release race in LogCache.tryMerge#3307

Merged
superhx merged 2 commits into
1.7from
fix_leak
Apr 14, 2026
Merged

fix(s3stream): fix double-release race in LogCache.tryMerge#3307
superhx merged 2 commits into
1.7from
fix_leak

Conversation

@superhx
Copy link
Copy Markdown
Collaborator

@superhx superhx commented Apr 13, 2026

Bug

A rare IllegalReferenceCountException: refCnt: 0, decrement: 1 occurs when LogCacheBlock.free() tries to release a StreamRecordBatch whose refCnt is already 0.

tryMerge performs mergeBlock outside the write lock for performance. During that window, clearStreamRecords can acquire the write lock and call block.free(streamId), which removes a StreamCache from the block and schedules an async release of all its StreamRecordBatch records (refCnt → 0). When tryMerge re-acquires the write lock, it only checks that left/right are still the same block objects in blocks — it does not detect that the blocks were mutated. So the merged newBlock (which shares the same record references via mergeBlock) replaces left/right in blocks. Later, when newBlock.free() is called, it tries to release records already at refCnt 0, causing the crash.

Fix

Add a freeOpsModCount field to LogCacheBlock that is incremented whenever the block is mutated by a free operation (free() or free(long streamId)). In tryMerge, snapshot both blocks' freeOpsModCount inside the same write lock that selects left and right. After mergeBlock completes, re-acquire the write lock and verify the counts are unchanged before committing the swap. If either block was mutated in between, the merge is aborted — preventing stale record references from entering blocks and being double-freed.

Test plan

  • Existing LogCache unit tests pass
  • Verify no IllegalReferenceCountException under concurrent clearStreamRecords + markFree workloads

🤖 Generated with Claude Code

## Bug

A rare `IllegalReferenceCountException: refCnt: 0, decrement: 1` occurs
when `LogCacheBlock.free()` tries to release a `StreamRecordBatch` whose
refCnt is already 0.

Root cause: `tryMerge` performs the expensive `mergeBlock` call outside
the write lock for performance. During that window, `clearStreamRecords`
can acquire the write lock and call `block.free(streamId)`, which removes
a `StreamCache` from the block and schedules an async release of all its
`StreamRecordBatch` records (refCnt → 0). When `tryMerge` then re-acquires
the write lock, it only checks that `left`/`right` are still the same
block objects in `blocks` — it does not detect that the blocks were
mutated. So the merged `newBlock` (which shares the same record references
via `mergeBlock`) replaces `left`/`right` in `blocks`. Later, when
`newBlock.free()` is called, it tries to release records that are already
at refCnt 0, causing the crash.

## Fix

Add a `freeOpsModCount` field to `LogCacheBlock` that is incremented
whenever the block is mutated by a free operation (`free()` or
`free(long streamId)`). In `tryMerge`, snapshot both blocks'
`freeOpsModCount` inside the same write lock that selects `left` and
`right`. After `mergeBlock` completes, re-acquire the write lock and
verify the counts are unchanged before committing the swap. If either
block was mutated in between, the merge is aborted and `newBlock` is
simply discarded — preventing stale record references from entering
`blocks` and being double-freed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a rare ref-count double-release race in LogCache.tryMerge() by detecting block mutations that can occur while the merge work runs outside the write lock.

Changes:

  • Added freeOpsModCount to LogCacheBlock, incremented on free() and free(streamId) mutations.
  • Snapshotted freeOpsModCount for merge candidates under the write lock, and validated it again before committing the merged block swap.
  • Aborted merges when either candidate block was mutated during the merge window to prevent stale references from entering blocks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread s3stream/src/main/java/com/automq/stream/s3/cache/LogCache.java
Add two tests to cover the race condition fix in LogCache.tryMerge:

- testMergeAbortedWhenBlockMutatedDuringMerge: verifies that
  freeOpsModCount is incremented by clearStreamRecords and that
  tryMerge detects the mutation and aborts the block swap.

- testNoDoubleReleaseUnderConcurrentClearAndMerge: stress test that
  runs markFree and clearStreamRecords concurrently across 200
  iterations, asserting no IllegalReferenceCountException occurs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@superhx superhx merged commit 2e69641 into 1.7 Apr 14, 2026
6 checks passed
@superhx superhx deleted the fix_leak branch April 14, 2026 11:13
superhx added a commit that referenced this pull request Apr 14, 2026
## Bug

A rare `IllegalReferenceCountException: refCnt: 0, decrement: 1` occurs
when `LogCacheBlock.free()` tries to release a `StreamRecordBatch` whose
refCnt is already 0.

Root cause: `tryMerge` performs the expensive `mergeBlock` call outside
the write lock for performance. During that window, `clearStreamRecords`
can acquire the write lock and call `block.free(streamId)`, which removes
a `StreamCache` from the block and schedules an async release of all its
`StreamRecordBatch` records (refCnt → 0). When `tryMerge` then re-acquires
the write lock, it only checks that `left`/`right` are still the same
block objects in `blocks` — it does not detect that the blocks were
mutated. So the merged `newBlock` (which shares the same record references
via `mergeBlock`) replaces `left`/`right` in `blocks`. Later, when
`newBlock.free()` is called, it tries to release records that are already
at refCnt 0, causing the crash.

## Fix

Add a `freeOpsModCount` field to `LogCacheBlock` that is incremented
whenever the block is mutated by a free operation (`free()` or
`free(long streamId)`). In `tryMerge`, snapshot both blocks'
`freeOpsModCount` inside the same write lock that selects `left` and
`right`. After `mergeBlock` completes, re-acquire the write lock and
verify the counts are unchanged before committing the swap. If either
block was mutated in between, the merge is aborted and `newBlock` is
simply discarded — preventing stale record references from entering
`blocks` and being double-freed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(s3stream): add tests for freeOpsModCount merge guard in LogCache

Add two tests to cover the race condition fix in LogCache.tryMerge:

- testMergeAbortedWhenBlockMutatedDuringMerge: verifies that
  freeOpsModCount is incremented by clearStreamRecords and that
  tryMerge detects the mutation and aborts the block swap.

- testNoDoubleReleaseUnderConcurrentClearAndMerge: stress test that
  runs markFree and clearStreamRecords concurrently across 200
  iterations, asserting no IllegalReferenceCountException occurs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
superhx added a commit that referenced this pull request Apr 14, 2026
## Bug

A rare `IllegalReferenceCountException: refCnt: 0, decrement: 1` occurs
when `LogCacheBlock.free()` tries to release a `StreamRecordBatch` whose
refCnt is already 0.

Root cause: `tryMerge` performs the expensive `mergeBlock` call outside
the write lock for performance. During that window, `clearStreamRecords`
can acquire the write lock and call `block.free(streamId)`, which removes
a `StreamCache` from the block and schedules an async release of all its
`StreamRecordBatch` records (refCnt → 0). When `tryMerge` then re-acquires
the write lock, it only checks that `left`/`right` are still the same
block objects in `blocks` — it does not detect that the blocks were
mutated. So the merged `newBlock` (which shares the same record references
via `mergeBlock`) replaces `left`/`right` in `blocks`. Later, when
`newBlock.free()` is called, it tries to release records that are already
at refCnt 0, causing the crash.

## Fix

Add a `freeOpsModCount` field to `LogCacheBlock` that is incremented
whenever the block is mutated by a free operation (`free()` or
`free(long streamId)`). In `tryMerge`, snapshot both blocks'
`freeOpsModCount` inside the same write lock that selects `left` and
`right`. After `mergeBlock` completes, re-acquire the write lock and
verify the counts are unchanged before committing the swap. If either
block was mutated in between, the merge is aborted and `newBlock` is
simply discarded — preventing stale record references from entering
`blocks` and being double-freed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(s3stream): add tests for freeOpsModCount merge guard in LogCache

Add two tests to cover the race condition fix in LogCache.tryMerge:

- testMergeAbortedWhenBlockMutatedDuringMerge: verifies that
  freeOpsModCount is incremented by clearStreamRecords and that
  tryMerge detects the mutation and aborts the block swap.

- testNoDoubleReleaseUnderConcurrentClearAndMerge: stress test that
  runs markFree and clearStreamRecords concurrently across 200
  iterations, asserting no IllegalReferenceCountException occurs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
superhx added a commit that referenced this pull request Apr 14, 2026
…3316)

## Bug

A rare `IllegalReferenceCountException: refCnt: 0, decrement: 1` occurs
when `LogCacheBlock.free()` tries to release a `StreamRecordBatch` whose
refCnt is already 0.

Root cause: `tryMerge` performs the expensive `mergeBlock` call outside
the write lock for performance. During that window, `clearStreamRecords`
can acquire the write lock and call `block.free(streamId)`, which removes
a `StreamCache` from the block and schedules an async release of all its
`StreamRecordBatch` records (refCnt → 0). When `tryMerge` then re-acquires
the write lock, it only checks that `left`/`right` are still the same
block objects in `blocks` — it does not detect that the blocks were
mutated. So the merged `newBlock` (which shares the same record references
via `mergeBlock`) replaces `left`/`right` in `blocks`. Later, when
`newBlock.free()` is called, it tries to release records that are already
at refCnt 0, causing the crash.

## Fix

Add a `freeOpsModCount` field to `LogCacheBlock` that is incremented
whenever the block is mutated by a free operation (`free()` or
`free(long streamId)`). In `tryMerge`, snapshot both blocks'
`freeOpsModCount` inside the same write lock that selects `left` and
`right`. After `mergeBlock` completes, re-acquire the write lock and
verify the counts are unchanged before committing the swap. If either
block was mutated in between, the merge is aborted and `newBlock` is
simply discarded — preventing stale record references from entering
`blocks` and being double-freed.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
superhx added a commit that referenced this pull request Apr 14, 2026
…3315)

## Bug

A rare `IllegalReferenceCountException: refCnt: 0, decrement: 1` occurs
when `LogCacheBlock.free()` tries to release a `StreamRecordBatch` whose
refCnt is already 0.

Root cause: `tryMerge` performs the expensive `mergeBlock` call outside
the write lock for performance. During that window, `clearStreamRecords`
can acquire the write lock and call `block.free(streamId)`, which removes
a `StreamCache` from the block and schedules an async release of all its
`StreamRecordBatch` records (refCnt → 0). When `tryMerge` then re-acquires
the write lock, it only checks that `left`/`right` are still the same
block objects in `blocks` — it does not detect that the blocks were
mutated. So the merged `newBlock` (which shares the same record references
via `mergeBlock`) replaces `left`/`right` in `blocks`. Later, when
`newBlock.free()` is called, it tries to release records that are already
at refCnt 0, causing the crash.

## Fix

Add a `freeOpsModCount` field to `LogCacheBlock` that is incremented
whenever the block is mutated by a free operation (`free()` or
`free(long streamId)`). In `tryMerge`, snapshot both blocks'
`freeOpsModCount` inside the same write lock that selects `left` and
`right`. After `mergeBlock` completes, re-acquire the write lock and
verify the counts are unchanged before committing the swap. If either
block was mutated in between, the merge is aborted and `newBlock` is
simply discarded — preventing stale record references from entering
`blocks` and being double-freed.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants