fix: LSM_VECTOR inactivity rebuild timer never re-armed after skip (#4215)#4272
Conversation
…4215) The small-graph inactivity-rebuild task used REBUILD_SEMAPHORE.tryAcquire() and exited silently when another rebuild held the single permit. Only put()/putBatch() re-armed the timer, so once write traffic stopped any index that had been skipped stayed stuck with pending mutations forever. Re-schedule the timer from the skipping branch so the index polls again after timeoutMs and rebuilds as soon as the semaphore is free.
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Code Review
This pull request ensures that inactivity-based index rebuilds are rescheduled if they are skipped due to another rebuild being in progress. It includes a new regression test, 'skippedInactivityRebuildShouldRetryUntilServed', which verifies that pending mutations are eventually processed when multiple indexes compete for the rebuild semaphore. A review comment highlights a critical thread-safety concern in the 'scheduleInactivityRebuild()' method, pointing out that concurrent access from the timer thread and mutation operations could lead to race conditions and leaked timer threads.
| indexName); | ||
| "Skipping inactivity rebuild for index %s: another rebuild is already in progress, will retry in %d ms", | ||
| indexName, timeoutMs); | ||
| scheduleInactivityRebuild(); |
There was a problem hiding this comment.
The scheduleInactivityRebuild() method is not thread-safe. It lacks synchronization when accessing and modifying the shared inactivityRebuildTask and inactivityTimer fields. Calling it from the timer thread here, while concurrent put() or putBatch() operations might also be calling it, can lead to a race condition where multiple Timer objects are created or multiple tasks are scheduled, potentially leaking timer threads and triggering redundant rebuilds. It is highly recommended to make the scheduleInactivityRebuild() method synchronized and have it check isValid() before scheduling to ensure lifecycle safety.
There was a problem hiding this comment.
Verified: the fields are volatile (visibility OK), but the compound if (inactivityTimer == null) inactivityTimer = new Timer(...) is a TOCTOU race - concurrent writers (the method is called outside the write lock at LSMVectorIndex.java:3557) could each see null and leak a Timer/daemon thread. The race pre-existed this PR but my added timer-thread caller widened the surface.
Fixed in 3c664e5:
scheduleInactivityRebuild()andcancelInactivityRebuildTimer()are nowsynchronized.- Added an
isValid()guard so a late timer thread that wins the lock afterclose()does not attempt to schedule on a cancelled Timer (which would throwIllegalStateException).
Reentrancy is fine - intrinsic locks are reentrant, so the timer task calling scheduleInactivityRebuild() from inside run() works when the same task path also goes through other synchronized(this) methods like startAsyncGraphRebuild.
Code ReviewClean, minimal fix for a real bug. The root cause analysis in the PR description is accurate and the one-line fix is exactly right. Fix correctness (LSMVectorIndex.java)The added Calling Stale comment (minor) - The comment immediately above the fix at line 4862-4863 now says: // If a large rebuild is already running, skip this small one - the next
// inactivity timeout or mutation threshold will pick it up.This is now misleading since the fix explicitly reschedules the timer rather than relying on a future mutation to pick it up. Worth updating to something like: // If a large rebuild is already running, re-arm the timer so this small
// index retries once the semaphore is free.No retry on exception - If Log message improvementThe updated log message (including Test (LSMVectorIndexRebuildTest.java)The regression test is well-structured:
One small note: if either index happens to accumulate no mutations before the assertion (e.g., schema creation races), the initial SummaryThe fix is correct and the test covers the exact failure scenario. The only actionable item before merge is updating the stale comment at line 4862-4863 - everything else is minor observation. |
Address review feedback on #4272: - scheduleInactivityRebuild() and cancelInactivityRebuildTimer() are now synchronized. The previous check-then-create on inactivityTimer was a TOCTOU race: concurrent writers (the method is called outside the write lock) could each see the field null and create two Timer instances, leaking the loser as a daemon thread. - Added isValid() guard so a late timer thread that wins the lock after close() does not attempt to schedule a new task on a cancelled Timer (Timer.schedule() throws IllegalStateException post-cancel). - Updated the now-misleading comment in the small-graph timer task: the skip path re-arms the timer instead of relying on a future mutation to pick it up.
|
Code Review - PR 4272: fix LSM_VECTOR inactivity rebuild timer never re-armed after skip Overview This fix addresses a real and reproducible bug: when a small-graph LSM_VECTOR index's inactivity-rebuild timer fired and Correctness The fix is correct. The retry path reuses the existing One subtle point: when Thread Safety Making both Pre-existing concern (not introduced by this PR): The Potential Log Spam (Actionable) The retry log uses Test Coverage The regression test is well-designed:
One suggestion: a polling approach with a hard deadline (instead of fixed Summary The bug is genuine, the fix is correct and minimal, thread safety is improved, and the regression test properly validates the retry behavior. The one actionable item is the |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4272 +/- ##
============================================
- Coverage 64.69% 64.47% -0.23%
- Complexity 0 558 +558
============================================
Files 1644 1645 +1
Lines 127412 127456 +44
Branches 27299 27316 +17
============================================
- Hits 82425 82172 -253
- Misses 33446 33705 +259
- Partials 11541 11579 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Per review on #4272: valid was a plain boolean written outside the synchronized(this) monitor (in the drop finally block after the ReadWriteLock release). isValid() reads it from inside the monitor, so JMM gave no happens-before between the write and the timer-thread read. Combined with cancelInactivityRebuildTimer() nulling inactivityTimer, a stale valid==true read could let the timer task create a fresh Timer that nobody will cancel - a leaked daemon thread. Making valid volatile is the minimal fix and directly closes a race the previous commit exposed by adding a timer-thread caller to scheduleInactivityRebuild().
|
Verified the pre-existing concern and decided to fix it in this PR rather than defer to a follow-up: the race is exposed by the timer-thread caller added in this PR. Without volatile, a timer task could win the monitor after Fixed in ffe9fad (one-word change: marked |
Code ReviewOverviewThis PR fixes a genuine bug (#4215): when the inactivity-rebuild timer fired on a small-graph What the PR does well
Issues and suggestions1. Duplicate // Line 4851 - outer guard
if (mutationsSinceSerialize.get() <= 0)
return;
// ...
// Line 4867 - inner guard, always true here
if (mutationsSinceSerialize.get() > 0) {
if (REBUILD_SEMAPHORE.tryAcquire()) {The inner check on line 4867 is always true when reached (the outer check already returned if 2. No In the retry path, In the normal case (semaphore freed after a handful of retries) this is harmless. But if the semaphore were held for an extended period on an active index, timer queue growth is unbounded. A call to final java.util.TimerTask existing = inactivityRebuildTask;
if (existing != null) {
existing.cancel();
inactivityTimer.purge(); // reclaim cancelled-task memory
}3. Test comment references an issue number // Issue #4215: when the inactivity-rebuild timer fires on a small graph...Per project conventions, issue references belong in the PR description, not in source comments. The comment should describe why the retry matters, not which ticket it fixes: // If tryAcquire() fails, the timer must re-arm so the index is not stuck with pending
// mutations during a quiet period (no mutations = no one else to re-arm it).4. Test sleep window may be tight on a heavily loaded CI runner
Consider bumping to Awaitility.await("both indexes should rebuild")
.atMost(Duration.ofMillis(timeoutMs * 20L))
.pollInterval(Duration.ofMillis(50))
.untilAsserted(() -> {
assertThat(indexA.getStats().get("mutationsSinceRebuild")).isEqualTo(0L);
assertThat(indexB.getStats().get("mutationsSinceRebuild")).isEqualTo(0L);
});SummaryThe fix is correct and the synchronization changes are sound. The four items above are: one pre-existing dead-code smell worth cleaning up, one defensive memory practice ( |
Per review on #4272: the retry message fires every timeoutMs per contending index, which floods INFO logs in high-contention setups (several vector indexes, low timeout). The initial "Inactivity timeout expired ..." log at INFO above the skip already signals the cycle to operators, so the skip+retry detail belongs at FINE.
|
Lowered the skip log to |
Review items on #4272: 1. Remove the dead `if (mutationsSinceSerialize.get() > 0)` guard inside the small-graph branch - the outer check at the top of run() already returned for <= 0 and nothing between the two checks decrements the counter. 2. Call inactivityTimer.purge() after cancelling the previous task in scheduleInactivityRebuild(). The reset-on-write path can otherwise leave many cancelled tasks in the Timer's internal queue until their scheduled time fires; purge() reclaims them immediately. 3. Drop the issue-number reference from the regression-test comment; describe the behavioural invariant instead, per project convention. 4. Replace the Thread.sleep(timeoutMs * 15L) wait in the new test with Awaitility.await(...).atMost(...).untilAsserted(...). This gives a 25x ceiling and short polls, so the test still fails fast when the bug is present but is resilient to GC pauses on loaded CI runners.
Code ReviewOverviewThis PR fixes a real bug: when a small-graph LSM_VECTOR index's inactivity-rebuild timer fired and REBUILD_SEMAPHORE.tryAcquire() failed, the task silently terminated without re-arming itself. Since only put()/putBatch() re-arm the timer during a quiet period, any index that lost the semaphore race stayed stuck with pending mutations indefinitely. The fix is minimal and well-targeted. What's Correct and Well Done
Issues and Suggestions1. Self-cancel in scheduleInactivityRebuild() when called from within run() - works but worth notingWhen the timer task's run() calls scheduleInactivityRebuild(), the method cancels existing (the currently-running task itself) before scheduling a new one. Java's TimerTask.cancel() guarantees this is safe. However, since the task is already executing and is a one-shot schedule, the cancel() call is effectively a no-op. It is not harmful, but a short comment explaining the intent would help future readers. 2. Fixed sleep in the regression test is fragileThread.sleep(timeoutMs * 15L); // 4500 ms total On a heavily loaded CI runner, 15x the timer interval may not be enough for the retry cycle to complete when two rebuilds are contending on the same semaphore. Consider replacing the fixed sleep with a polling loop that breaks as soon as both indexes reach 0 pending mutations, with a generous outer deadline (e.g., timeoutMs * 30L). This makes the test fast on idle hardware, robust on slow hardware, and self-documenting about what it is waiting for. 3. Test comment references the issue number directlyThe comment "Issue #4215: when the inactivity-rebuild timer fires..." references the issue number directly. Per the project's coding conventions, issue references belong in the PR description rather than source code since they rot as the codebase evolves. The comment could be rephrased to describe the invariant being tested instead. 4. No retry backoff (minor, acceptable)The retry re-schedules with the same timeoutMs. If a rebuild holds the semaphore for an unusually long time, this creates a steady stream of re-arms at the base interval. The FINE log level prevents log spam, so this is acceptable for now. A TODO comment would set expectations for a potential future improvement. 5. Redundant inner mutationsSinceSerialize.get() > 0 check (pre-existing, not blocking)The guard at the top of the timer's run() already returns early if there are no pending mutations, making the inner check inside the small-graph path redundant. Pre-existing code, worth a follow-up cleanup. Security / PerformanceNo security concerns. The retry loop is bounded naturally by the semaphore eventually being released or the index being dropped/closed (guarded by isValid()). No new allocations on the hot path. SummaryThe fix is correct, the concurrency reasoning is sound, and the regression test validates the right scenario. The main asks before merging: (1) update the test comment to remove the issue reference per project conventions, and (2) consider replacing the fixed Thread.sleep with a polling loop to reduce CI flakiness risk. Everything else is minor or pre-existing. |
|
All four items addressed in b46e6a8:
All 13 |
Code ReviewOverviewThis PR fixes a real bug where a small-graph What's Good
Issues / Suggestions1. Close-vs-rearm race (minor)
A new Simplest fix: also set 2. Possible tight-retry loop if semaphore is never released (cosmetic concern)If the semaphore is permanently held (e.g., 3. Test timeout multiplier lacks a comment.atMost(Duration.ofMillis(timeoutMs * 25L))25× isn't obvious. A brief comment ("two indexes × up to 10 retries each + 5× slack") would help future readers understand why this bound is safe. 4. Pre-existing: fully-qualified Timer/TimerTask namesThe surrounding code (and this PR) references Test CoverageThe regression test correctly exercises the exact failure mode (two small-graph indexes, single-permit semaphore, quiet period after mutations). It is the right shape for this bug. No functional gaps found. SummaryThe fix is correct and the approach is sound. The close-vs-rearm race (#1) is the only concern worth addressing before merge; the others are minor polish items. |
Review on #4272: - close() now sets `valid = false` before cancelInactivityRebuildTimer(). Previously only drop() set valid=false (after releasing the write lock), so a plain close() left isValid() returning true forever. A timer task that ran between cancelInactivityRebuildTimer() (which nulls inactivityTimer) and any subsequent state change could see isValid()==true and inactivityTimer==null, resurrect a fresh Timer, and leak its daemon thread. - Add a one-line note next to the 25x Awaitility ceiling in the regression test explaining where the multiplier comes from.
|
Four items, addressed e6fedec for the substantive ones: 1. Close-vs-rearm race: Verified - 3. 25x multiplier comment: Added a one-liner explaining the budget (first cycle + retry cycle + two synchronous rebuilds + slack). 2. Tight-retry loop with permanently-held semaphore: Deferring. The scenario presumes 4. Fully-qualified All 155 vector tests pass. |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Coverage variation | ✅ -7.63% coverage variation |
| Diff coverage | ✅ 94.44% diff coverage |
Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (cb372f3) 127412 93995 73.77% Head commit (3767f56) 159134 (+31722) 105249 (+11254) 66.14% (-7.63%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>
Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4272) 18 17 94.44% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Per CLAUDE.md: classes should be imported and referenced by simple name. The fully-qualified java.util.Timer/TimerTask usage in this block was an oversight - there is no name conflict in the import set.
|
Correction on item 4: I claimed the fully-qualified |
|
test |
Code ReviewOverviewThis PR fixes a real bug: when a small-graph The approach is correct, the synchronization analysis is sound, and the regression test demonstrates the failure clearly. Correctness and SynchronizationGood:
Minor concern - livelock under sustained semaphore pressure: If a large-graph async rebuild (which does a blocking
|
Code Review - PR #4272: fix LSM_VECTOR inactivity rebuild timer never re-armed after skipOverviewTargeted, minimal fix for a real liveness bug: when Fix CorrectnessCore one-liner - Calling Self-cancel in timer Shutdown ordering - Setting ObservationsRemoved inner mutation guard (minor, worth a comment): The original code had: if (mutationsSinceSerialize.get() > 0) {
if (REBUILD_SEMAPHORE.tryAcquire()) { ... }
}The new code drops the inner Exception path does not reschedule (intentional asymmetry): The
The log message inside
Thread Safety
Test CoverageThe regression test is well-structured:
One minor nit: the comment above the test method is a good explanation of the bug scenario - consider promoting it to an inline Javadoc block with SummaryThe fix is correct, the test covers the exact failure scenario, and the defensive improvements around
Everything else is minor observation. Good PR. |
Summary
LSM_VECTORindex's inactivity-rebuild timer fired andREBUILD_SEMAPHORE.tryAcquire()failed (another rebuild was holding the single permit), the task logged "Skipping..." and terminated without rescheduling itself. Since onlyput()/putBatch()re-arm the timer, any index that was skipped during a quiet period stayed stuck with pending mutations indefinitely.timeoutMsand rebuilds as soon as the semaphore is free.startAsyncGraphRebuild) was unaffected because it starts a background thread that does a blockingacquire()and always eventually runs.Test plan
LSMVectorIndexRebuildTest#skippedInactivityRebuildShouldRetryUntilServedcreates two small-graph indexes that compete for the single-permit semaphore; without the fix one index stays stuck at 50 pending mutations, with the fix both clear to 0 within ~15x the timeout window.mvn test -pl engine -Dtest="LSMVectorIndex*,VectorUtils*,DeltaScan*,VectorIndex*,GraphRAGTest").