Skip to content

fix: LSM_VECTOR inactivity rebuild timer never re-armed after skip (#4215)#4272

Merged
robfrank merged 7 commits into
mainfrom
fix/4215-vector-inactivity-rebuild-timer-rearm
May 21, 2026
Merged

fix: LSM_VECTOR inactivity rebuild timer never re-armed after skip (#4215)#4272
robfrank merged 7 commits into
mainfrom
fix/4215-vector-inactivity-rebuild-timer-rearm

Conversation

@robfrank
Copy link
Copy Markdown
Collaborator

Summary

  • Fixes LSM_VECTOR inactivity rebuild timer is never re-armed when a fire is Skipped, leaving indexes stuck with pending mutations indefinitely #4215. When a small-graph LSM_VECTOR index's inactivity-rebuild timer fired and REBUILD_SEMAPHORE.tryAcquire() failed (another rebuild was holding the single permit), the task logged "Skipping..." and terminated without rescheduling itself. Since only put()/putBatch() re-arm the timer, any index that was skipped during a quiet period stayed stuck with pending mutations indefinitely.
  • The fix re-schedules the timer from the skipping branch so the index polls again after timeoutMs and rebuilds as soon as the semaphore is free.
  • The large-graph path (startAsyncGraphRebuild) was unaffected because it starts a background thread that does a blocking acquire() and always eventually runs.

Test plan

  • New regression test LSMVectorIndexRebuildTest#skippedInactivityRebuildShouldRetryUntilServed creates two small-graph indexes that compete for the single-permit semaphore; without the fix one index stays stuck at 50 pending mutations, with the fix both clear to 0 within ~15x the timeout window.
  • Confirmed test fails on the unfixed code and passes after the fix.
  • All 155 vector index tests pass with no regressions (mvn test -pl engine -Dtest="LSMVectorIndex*,VectorUtils*,DeltaScan*,VectorIndex*,GraphRAGTest").

…4215)

The small-graph inactivity-rebuild task used REBUILD_SEMAPHORE.tryAcquire()
and exited silently when another rebuild held the single permit. Only
put()/putBatch() re-armed the timer, so once write traffic stopped any
index that had been skipped stayed stuck with pending mutations forever.

Re-schedule the timer from the skipping branch so the index polls again
after timeoutMs and rebuilds as soon as the semaphore is free.
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 20, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that inactivity-based index rebuilds are rescheduled if they are skipped due to another rebuild being in progress. It includes a new regression test, 'skippedInactivityRebuildShouldRetryUntilServed', which verifies that pending mutations are eventually processed when multiple indexes compete for the rebuild semaphore. A review comment highlights a critical thread-safety concern in the 'scheduleInactivityRebuild()' method, pointing out that concurrent access from the timer thread and mutation operations could lead to race conditions and leaked timer threads.

indexName);
"Skipping inactivity rebuild for index %s: another rebuild is already in progress, will retry in %d ms",
indexName, timeoutMs);
scheduleInactivityRebuild();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The scheduleInactivityRebuild() method is not thread-safe. It lacks synchronization when accessing and modifying the shared inactivityRebuildTask and inactivityTimer fields. Calling it from the timer thread here, while concurrent put() or putBatch() operations might also be calling it, can lead to a race condition where multiple Timer objects are created or multiple tasks are scheduled, potentially leaking timer threads and triggering redundant rebuilds. It is highly recommended to make the scheduleInactivityRebuild() method synchronized and have it check isValid() before scheduling to ensure lifecycle safety.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified: the fields are volatile (visibility OK), but the compound if (inactivityTimer == null) inactivityTimer = new Timer(...) is a TOCTOU race - concurrent writers (the method is called outside the write lock at LSMVectorIndex.java:3557) could each see null and leak a Timer/daemon thread. The race pre-existed this PR but my added timer-thread caller widened the surface.

Fixed in 3c664e5:

  • scheduleInactivityRebuild() and cancelInactivityRebuildTimer() are now synchronized.
  • Added an isValid() guard so a late timer thread that wins the lock after close() does not attempt to schedule on a cancelled Timer (which would throw IllegalStateException).

Reentrancy is fine - intrinsic locks are reentrant, so the timer task calling scheduleInactivityRebuild() from inside run() works when the same task path also goes through other synchronized(this) methods like startAsyncGraphRebuild.

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Code Review

Clean, minimal fix for a real bug. The root cause analysis in the PR description is accurate and the one-line fix is exactly right.

Fix correctness (LSMVectorIndex.java)

The added scheduleInactivityRebuild() call in the tryAcquire() failure branch is the correct approach. A few observations:

Calling scheduleInactivityRebuild() from inside the timer task - When re-scheduling, the method reads inactivityRebuildTask (the currently-running task) and calls cancel() on it. Calling cancel() on an already-running TimerTask is a no-op in the JDK (returns false), so the self-cancel is harmless and the new task is scheduled correctly.

Stale comment (minor) - The comment immediately above the fix at line 4862-4863 now says:

// If a large rebuild is already running, skip this small one - the next
// inactivity timeout or mutation threshold will pick it up.

This is now misleading since the fix explicitly reschedules the timer rather than relying on a future mutation to pick it up. Worth updating to something like:

// If a large rebuild is already running, re-arm the timer so this small
// index retries once the semaphore is free.

No retry on exception - If buildGraphFromScratch() throws, the catch block at line 4879 logs a warning but does NOT reschedule. The timer then goes silent. This is probably intentional to avoid retry-storms on persistent errors, but it is asymmetric with the skip path (which now does retry). A comment explaining the deliberate choice would help future readers.

Log message improvement

The updated log message (including timeoutMs) is a nice improvement for diagnosability.

Test (LSMVectorIndexRebuildTest.java)

The regression test is well-structured:

  • Correctly uses @Tag("slow") per project conventions.
  • The 15x timeout window (4500 ms) gives the competing timers ample room even on a loaded CI runner.
  • Seeded Random(42) keeps the test reproducible.
  • Descriptive .as(...) messages on the assertions make failures self-explanatory.
  • Casting through TypeIndex -> LSMVectorIndex to inspect getStats() directly is the right pattern for this kind of white-box timing test.

One small note: if either index happens to accumulate no mutations before the assertion (e.g., schema creation races), the initial assertThat(...).isGreaterThan(0L) will catch it early. That guard is good.

Summary

The fix is correct and the test covers the exact failure scenario. The only actionable item before merge is updating the stale comment at line 4862-4863 - everything else is minor observation.

Address review feedback on #4272:

- scheduleInactivityRebuild() and cancelInactivityRebuildTimer() are
  now synchronized. The previous check-then-create on inactivityTimer
  was a TOCTOU race: concurrent writers (the method is called outside
  the write lock) could each see the field null and create two Timer
  instances, leaking the loser as a daemon thread.

- Added isValid() guard so a late timer thread that wins the lock
  after close() does not attempt to schedule a new task on a cancelled
  Timer (Timer.schedule() throws IllegalStateException post-cancel).

- Updated the now-misleading comment in the small-graph timer task:
  the skip path re-arms the timer instead of relying on a future
  mutation to pick it up.
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Code Review - PR 4272: fix LSM_VECTOR inactivity rebuild timer never re-armed after skip

Overview

This fix addresses a real and reproducible bug: when a small-graph LSM_VECTOR index's inactivity-rebuild timer fired and tryAcquire() failed (another rebuild held the single permit), the timer task returned without rescheduling, leaving the index stuck with pending mutations indefinitely. The fix is minimal and targeted - one extra scheduleInactivityRebuild() call in the skip branch, plus two synchronized guards and an isValid() guard.


Correctness

The fix is correct. The retry path reuses the existing inactivityTimer instance (which is still alive since it has not been cancelled), and scheduleInactivityRebuild() creates a fresh TimerTask and schedules it properly. The early-return at line 4839 (if (mutationsSinceSerialize.get() <= 0) return;) also prevents pointless retries if mutations are drained by another path while waiting.

One subtle point: when scheduleInactivityRebuild() is called from inside run(), it first calls existing.cancel() on the currently-running task (itself). Per the java.util.TimerTask contract, cancel() on an already-executing task is a no-op - harmless. It updates inactivityRebuildTask to the new task, which is what you want so cancelInactivityRebuildTimer() tracks the right handle.


Thread Safety

Making both scheduleInactivityRebuild() and cancelInactivityRebuildTimer() synchronized prevents a race where a cancel (during index close/drop) and a retry reschedule run concurrently. The isValid() guard is the key backstop: if the index is closed while a timer task is running, the retry call sees isValid() == false and bails out cleanly.

Pre-existing concern (not introduced by this PR): The valid field (line 217) is a plain boolean, not volatile. scheduleInactivityRebuild() is synchronized on the object monitor, while valid = false is written after releasing a ReadWriteLock write lock (line 4183). These are different synchronization mechanisms with no formal happens-before between the write and the read in isValid(). In practice the worst case is one extra timer reschedule, but making valid volatile would be cleaner. Worth a follow-up issue, not a blocker for this PR.


Potential Log Spam (Actionable)

The retry log uses Level.INFO. If two indexes compete repeatedly for the semaphore this emits an INFO log every timeoutMs - which could be very short. In high-contention scenarios this floods the logs. Consider lowering the retry message to Level.FINE/Level.DEBUG, or applying a throttle like the 60-second window used for pool saturation warnings elsewhere in the codebase. The initial "inactivity timeout expired" log at INFO is fine; it is the repeated retry notification that can become noisy.


Test Coverage

The regression test is well-designed:

  • Two small-graph indexes on the same database reliably create timer-fire contention on the single-permit semaphore.
  • @Tag("slow") correctly applied per project conventions.
  • The 15x multiplier (4.5 s at 300 ms timeout) gives a wide margin.
  • The failure mode is clearly documented in the comment above the sleep.

One suggestion: a polling approach with a hard deadline (instead of fixed Thread.sleep) would make the test more robust against heavily loaded CI runners and faster when the index clears quickly. Current approach is consistent with other tests in the file and is acceptable given the generous margin.


Summary

The bug is genuine, the fix is correct and minimal, thread safety is improved, and the regression test properly validates the retry behavior. The one actionable item is the Level.INFO log in the retry branch - lower it to FINE/DEBUG to avoid log flooding under sustained semaphore contention. The missing volatile on valid is a pre-existing issue worth a separate follow-up. Good fix overall.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.47%. Comparing base (cb372f3) to head (3767f56).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...java/com/arcadedb/index/vector/LSMVectorIndex.java 83.33% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4272      +/-   ##
============================================
- Coverage     64.69%   64.47%   -0.23%     
- Complexity        0      558     +558     
============================================
  Files          1644     1645       +1     
  Lines        127412   127456      +44     
  Branches      27299    27316      +17     
============================================
- Hits          82425    82172     -253     
- Misses        33446    33705     +259     
- Partials      11541    11579      +38     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Per review on #4272: valid was a plain boolean written outside the
synchronized(this) monitor (in the drop finally block after the
ReadWriteLock release). isValid() reads it from inside the monitor,
so JMM gave no happens-before between the write and the timer-thread
read. Combined with cancelInactivityRebuildTimer() nulling
inactivityTimer, a stale valid==true read could let the timer task
create a fresh Timer that nobody will cancel - a leaked daemon thread.

Making valid volatile is the minimal fix and directly closes a race
the previous commit exposed by adding a timer-thread caller to
scheduleInactivityRebuild().
@robfrank
Copy link
Copy Markdown
Collaborator Author

Verified the pre-existing concern and decided to fix it in this PR rather than defer to a follow-up: the race is exposed by the timer-thread caller added in this PR. Without volatile, a timer task could win the monitor after cancelInactivityRebuildTimer() nulls inactivityTimer but before the valid = false write at line 4183 becomes visible to it. It would pass isValid(), see inactivityTimer == null, create a fresh Timer, and leak the daemon thread - exactly the failure mode this PR's other changes were designed to prevent.

Fixed in ffe9fad (one-word change: marked valid volatile). All 13 LSMVectorIndexRebuildTest tests still pass.

@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

Code Review

Overview

This PR fixes a genuine bug (#4215): when the inactivity-rebuild timer fired on a small-graph LSM_VECTOR index and REBUILD_SEMAPHORE.tryAcquire() failed, the timer task terminated without rescheduling itself. Since only mutations re-arm the timer, a quiet index could be stuck with pending mutations indefinitely. The fix is minimal, targeted, and correct.


What the PR does well

  • Root cause addressed directly - rescheduling from the skip branch is exactly the right fix.
  • Defensive isValid() guard - avoids scheduling timers on a closed/dropped index; good catch.
  • volatile on valid - buildState was already volatile; isValid() ANDs them, so valid needed this too for cross-thread visibility. Good find.
  • synchronized on both schedule/cancel methods - prevents a race where a mutation thread and the timer thread race to set inactivityRebuildTask to different values.
  • Regression test with @Tag("slow") - correctly tagged per project conventions and actually verifies the failure mode.

Issues and suggestions

1. Duplicate mutationsSinceSerialize.get() > 0 guard in run() (minor, pre-existing)

// Line 4851 - outer guard
if (mutationsSinceSerialize.get() <= 0)
    return;

// ...

// Line 4867 - inner guard, always true here
if (mutationsSinceSerialize.get() > 0) {
    if (REBUILD_SEMAPHORE.tryAcquire()) {

The inner check on line 4867 is always true when reached (the outer check already returned if <= 0). This is pre-existing, but the retry path makes it slightly more confusing to read. Worth cleaning up in a follow-up, or in this PR since the surrounding block is already being touched.

2. No inactivityTimer.purge() call after retry-cancellation

In the retry path, scheduleInactivityRebuild() calls existing.cancel() on the currently-executing task but never calls inactivityTimer.purge(). Each retry cycle adds one cancelled entry to the Timer's internal queue. java.util.Timer does not auto-purge cancelled tasks.

In the normal case (semaphore freed after a handful of retries) this is harmless. But if the semaphore were held for an extended period on an active index, timer queue growth is unbounded. A call to inactivityTimer.purge() after cancelling would be prudent:

final java.util.TimerTask existing = inactivityRebuildTask;
if (existing != null) {
    existing.cancel();
    inactivityTimer.purge(); // reclaim cancelled-task memory
}

3. Test comment references an issue number

// Issue #4215: when the inactivity-rebuild timer fires on a small graph...

Per project conventions, issue references belong in the PR description, not in source comments. The comment should describe why the retry matters, not which ticket it fixes:

// If tryAcquire() fails, the timer must re-arm so the index is not stuck with pending
// mutations during a quiet period (no mutations = no one else to re-arm it).

4. Test sleep window may be tight on a heavily loaded CI runner

Thread.sleep(timeoutMs * 15L) with timeoutMs = 300 gives 4.5 seconds total for two timer cycles plus two rebuilds. At the outermost edge (two consecutive retries before success plus GC pauses) the margin is thin.

Consider bumping to 20x or 25x, or using Awaitility (already in scope in the test module) to avoid a fixed sleep entirely:

Awaitility.await("both indexes should rebuild")
    .atMost(Duration.ofMillis(timeoutMs * 20L))
    .pollInterval(Duration.ofMillis(50))
    .untilAsserted(() -> {
        assertThat(indexA.getStats().get("mutationsSinceRebuild")).isEqualTo(0L);
        assertThat(indexB.getStats().get("mutationsSinceRebuild")).isEqualTo(0L);
    });

Summary

The fix is correct and the synchronization changes are sound. The four items above are: one pre-existing dead-code smell worth cleaning up, one defensive memory practice (purge()), and two test-quality suggestions. None are blockers - the core change is solid.

Per review on #4272: the retry message fires every timeoutMs per
contending index, which floods INFO logs in high-contention setups
(several vector indexes, low timeout). The initial "Inactivity timeout
expired ..." log at INFO above the skip already signals the cycle to
operators, so the skip+retry detail belongs at FINE.
@robfrank
Copy link
Copy Markdown
Collaborator Author

Lowered the skip log to Level.FINE in 7ca0b59. Went with the simpler of the two suggested options - the throttle pattern (AtomicLong lastWarnMs, CAS on a 60s window) is what SparseVectorScoringPool uses for saturation warnings, but here the skip message is informational rather than a saturation signal that operators must see. The "Inactivity timeout expired ..." log above it stays at INFO, so operators still see one cycle marker per timer fire without the per-skip retry detail.

Review items on #4272:

1. Remove the dead `if (mutationsSinceSerialize.get() > 0)` guard
   inside the small-graph branch - the outer check at the top of run()
   already returned for <= 0 and nothing between the two checks
   decrements the counter.

2. Call inactivityTimer.purge() after cancelling the previous task in
   scheduleInactivityRebuild(). The reset-on-write path can otherwise
   leave many cancelled tasks in the Timer's internal queue until
   their scheduled time fires; purge() reclaims them immediately.

3. Drop the issue-number reference from the regression-test comment;
   describe the behavioural invariant instead, per project convention.

4. Replace the Thread.sleep(timeoutMs * 15L) wait in the new test with
   Awaitility.await(...).atMost(...).untilAsserted(...). This gives a
   25x ceiling and short polls, so the test still fails fast when the
   bug is present but is resilient to GC pauses on loaded CI runners.
@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

Code Review

Overview

This PR fixes a real bug: when a small-graph LSM_VECTOR index's inactivity-rebuild timer fired and REBUILD_SEMAPHORE.tryAcquire() failed, the task silently terminated without re-arming itself. Since only put()/putBatch() re-arm the timer during a quiet period, any index that lost the semaphore race stayed stuck with pending mutations indefinitely. The fix is minimal and well-targeted.


What's Correct and Well Done

  • Core fix is correct. Calling scheduleInactivityRebuild() inside the else branch of the failed tryAcquire() properly re-arms the timer without introducing a busy loop. The new task replaces the old one in inactivityRebuildTask, so there is no timer or task leak.
  • volatile on valid is the right fix. The field is read from both mutation threads and the timer thread; making it volatile ensures visibility without needing a full lock.
  • synchronized on both scheduleInactivityRebuild() and cancelInactivityRebuildTimer() correctly prevents a race between a timer-thread retry-arm and a main-thread close/drop cancel. The isValid() guard at the top of scheduleInactivityRebuild() closes the residual window where valid could be set to false just after the lock check.
  • Log level downgrade from INFO to FINE is the right call - this event can now fire repeatedly during a retry loop and would spam production logs at INFO.
  • Test is tagged @tag("slow") per project conventions for tests that take noticeably long.

Issues and Suggestions

1. Self-cancel in scheduleInactivityRebuild() when called from within run() - works but worth noting

When the timer task's run() calls scheduleInactivityRebuild(), the method cancels existing (the currently-running task itself) before scheduling a new one. Java's TimerTask.cancel() guarantees this is safe. However, since the task is already executing and is a one-shot schedule, the cancel() call is effectively a no-op. It is not harmful, but a short comment explaining the intent would help future readers.

2. Fixed sleep in the regression test is fragile

Thread.sleep(timeoutMs * 15L); // 4500 ms total

On a heavily loaded CI runner, 15x the timer interval may not be enough for the retry cycle to complete when two rebuilds are contending on the same semaphore. Consider replacing the fixed sleep with a polling loop that breaks as soon as both indexes reach 0 pending mutations, with a generous outer deadline (e.g., timeoutMs * 30L). This makes the test fast on idle hardware, robust on slow hardware, and self-documenting about what it is waiting for.

3. Test comment references the issue number directly

The comment "Issue #4215: when the inactivity-rebuild timer fires..." references the issue number directly. Per the project's coding conventions, issue references belong in the PR description rather than source code since they rot as the codebase evolves. The comment could be rephrased to describe the invariant being tested instead.

4. No retry backoff (minor, acceptable)

The retry re-schedules with the same timeoutMs. If a rebuild holds the semaphore for an unusually long time, this creates a steady stream of re-arms at the base interval. The FINE log level prevents log spam, so this is acceptable for now. A TODO comment would set expectations for a potential future improvement.

5. Redundant inner mutationsSinceSerialize.get() > 0 check (pre-existing, not blocking)

The guard at the top of the timer's run() already returns early if there are no pending mutations, making the inner check inside the small-graph path redundant. Pre-existing code, worth a follow-up cleanup.


Security / Performance

No security concerns. The retry loop is bounded naturally by the semaphore eventually being released or the index being dropped/closed (guarded by isValid()). No new allocations on the hot path.


Summary

The fix is correct, the concurrency reasoning is sound, and the regression test validates the right scenario. The main asks before merging: (1) update the test comment to remove the issue reference per project conventions, and (2) consider replacing the fixed Thread.sleep with a polling loop to reduce CI flakiness risk. Everything else is minor or pre-existing.

@robfrank
Copy link
Copy Markdown
Collaborator Author

All four items addressed in b46e6a8:

  1. Dead guard removed - the inner if (mutations > 0) was unreachable when false (outer guard at run() top returned for <= 0, and nothing between them decrements).
  2. purge() added - one nit on the analysis: the retry path cancels the currently-running task (already pulled off the Timer queue), so it doesn't actually accumulate there. The real benefit is the pre-existing reset-on-write path: each new mutation cancels the pending task but leaves it in the queue until its original scheduled time. purge() cleans those up immediately, which matters under sustained high write rates with a long timeoutMs.
  3. Issue ref removed from the test comment - rewrote to describe the invariant.
  4. Awaitility swap - 25x ceiling, 50ms polling. Still fails fast on the bug (the assertion never satisfies without the re-arm), but absorbs GC pauses on loaded runners.

All 13 LSMVectorIndexRebuildTest tests pass.

@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

Code Review

Overview

This PR fixes a real bug where a small-graph LSM_VECTOR index could be permanently stuck with pending mutations after the inactivity-rebuild timer fired but lost a tryAcquire() race. The root cause is clear, the fix is minimal and targeted.


What's Good

  • Core fix is correct. Re-arming the timer in the else (skip) branch is exactly right. The old code relied on put()/putBatch() callers to re-arm, but in a quiet period there are no callers - so the stuck state was permanent.
  • valid made volatile - necessary for correct cross-thread visibility when drop() sets it to false from a write-lock thread and the timer thread reads it.
  • synchronized on scheduleInactivityRebuild() and cancelInactivityRebuildTimer() - these two methods now form a proper monitor that serializes timer field mutations.
  • isValid() guard at the top of scheduleInactivityRebuild() prevents pointless reschedules after the index is dropped or invalidated.
  • inactivityTimer.purge() on cancel is a nice bonus - prevents cancelled TimerTask objects from accumulating in the timer queue under high write rates.
  • Log level INFO → FINE for the skip message is the right call; with retries the message could otherwise fill logs.
  • Inner mutationsSinceSerialize.get() > 0 check removed safely - the equivalent check already exists at the top of run() and is re-applied at the start of the re-armed scheduleInactivityRebuild() call.
  • Test properly annotated @Tag("slow"), uses Awaitility (already a parent-pom test dep), and the negative-case description in the comment is clear.

Issues / Suggestions

1. Close-vs-rearm race (minor)

cancelInactivityRebuildTimer() is called from close(), but valid = false is set in drop() (after the write-lock is released). The window is:

Thread A (close): cancelInactivityRebuildTimer() → inactivityTimer = null
Thread B (timer): run() → tryAcquire fails → scheduleInactivityRebuild()
                          isValid() → still true (close ≠ drop)
                          inactivityTimer == null → creates a NEW Timer
                          schedules a new task

A new Timer gets created and a task gets scheduled even though the index was just closed. The exception handler in run() will catch whatever buildGraphFromScratch() throws when it discovers the closed state, so this is not a crash, but it is a resource leak (daemon timer thread + extra rebuild attempt).

Simplest fix: also set valid = false early in close(), before calling cancelInactivityRebuildTimer(), or add an explicit closed boolean flag checked by scheduleInactivityRebuild().

2. Possible tight-retry loop if semaphore is never released (cosmetic concern)

If the semaphore is permanently held (e.g., buildGraphFromScratch() deadlocks or hangs), this index will reschedule every timeoutMs forever. Given that log level is now FINE, this would be silent. Consider adding a retry counter / backoff, or at least logging at WARNING after N consecutive skips.

3. Test timeout multiplier lacks a comment

.atMost(Duration.ofMillis(timeoutMs * 25L))

25× isn't obvious. A brief comment ("two indexes × up to 10 retries each + 5× slack") would help future readers understand why this bound is safe.

4. Pre-existing: fully-qualified Timer/TimerTask names

The surrounding code (and this PR) references java.util.TimerTask and java.util.Timer as fully qualified names rather than as imports. Per CLAUDE.md, classes should always be imported and used by their simple name. This is pre-existing and outside the scope of this PR, but worth a follow-up cleanup.


Test Coverage

The regression test correctly exercises the exact failure mode (two small-graph indexes, single-permit semaphore, quiet period after mutations). It is the right shape for this bug. No functional gaps found.


Summary

The fix is correct and the approach is sound. The close-vs-rearm race (#1) is the only concern worth addressing before merge; the others are minor polish items.

Review on #4272:

- close() now sets `valid = false` before cancelInactivityRebuildTimer().
  Previously only drop() set valid=false (after releasing the write
  lock), so a plain close() left isValid() returning true forever. A
  timer task that ran between cancelInactivityRebuildTimer() (which
  nulls inactivityTimer) and any subsequent state change could see
  isValid()==true and inactivityTimer==null, resurrect a fresh Timer,
  and leak its daemon thread.

- Add a one-line note next to the 25x Awaitility ceiling in the
  regression test explaining where the multiplier comes from.
@robfrank
Copy link
Copy Markdown
Collaborator Author

Four items, addressed e6fedec for the substantive ones:

1. Close-vs-rearm race: Verified - close() did not set valid=false (only drop() did, and only after the write lock released). Set valid = false early in close() before cancelInactivityRebuildTimer(). The drop() assignment is now idempotent but I left it as-is to avoid scope creep.

3. 25x multiplier comment: Added a one-liner explaining the budget (first cycle + retry cycle + two synchronous rebuilds + slack).

2. Tight-retry loop with permanently-held semaphore: Deferring. The scenario presumes buildGraphFromScratch() deadlocks or hangs forever, which would be a separate bug worth surfacing at SEVERE from inside the build itself rather than inferring from external observation. Adding a consecutive-skip counter + throttled WARN is doable (the SparseVectorScoringPool pattern is right there) but it adds state I'd rather not introduce on speculation. Happy to file a follow-up issue if you'd like, or revisit if real-world traces show this matters.

4. Fully-qualified java.util.Timer names: Skipping per your note that it's pre-existing and out of scope. The qualified name is intentional in this block to disambiguate from org.apache.kerby.util.Timer etc. floating around the import set, but agreed a follow-up could re-evaluate.

All 155 vector tests pass.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 21, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Coverage 94.44% diff coverage · -7.63% coverage variation

Metric Results
Coverage variation -7.63% coverage variation
Diff coverage 94.44% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (cb372f3) 127412 93995 73.77%
Head commit (3767f56) 159134 (+31722) 105249 (+11254) 66.14% (-7.63%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4272) 18 17 94.44%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Per CLAUDE.md: classes should be imported and referenced by simple
name. The fully-qualified java.util.Timer/TimerTask usage in this
block was an oversight - there is no name conflict in the import set.
@robfrank
Copy link
Copy Markdown
Collaborator Author

Correction on item 4: I claimed the fully-qualified java.util.Timer was "intentional to disambiguate". That was wrong - I checked and there is no Timer in the import set at all, so no conflict exists. Just an oversight from the original author. Cleaned up in 3767f56 (added the two imports, dropped four java.util. prefixes). Per CLAUDE.md class-import convention.

@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

test

@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

Code Review

Overview

This PR fixes a real bug: when a small-graph LSM_VECTOR index inactivity-rebuild timer fired and REBUILD_SEMAPHORE.tryAcquire() failed, the task logged "Skipping" and exited without rescheduling itself. Since only put()/putBatch() re-arm the timer, an index skipped during a quiet period stayed stuck with pending mutations indefinitely. The fix re-schedules the timer from the skip branch so the index retries at the next interval.

The approach is correct, the synchronization analysis is sound, and the regression test demonstrates the failure clearly.


Correctness and Synchronization

Good:

  • Making valid volatile is correct. isValid() is read from the timer thread while being written from close() / drop(), so without volatile the write was not guaranteed to be visible.
  • Setting valid = false before calling cancelInactivityRebuildTimer() in close() is the right ordering. Without it, a timer thread that lost the monitor race could call scheduleInactivityRebuild() after the timer was nulled and resurrect a fresh Timer on a closing index.
  • Adding synchronized to both scheduleInactivityRebuild() and cancelInactivityRebuildTimer() closes the TOCTOU window where one thread could read inactivityTimer != null just before another nulled it.
  • The isValid() guard at the top of scheduleInactivityRebuild() is the correct final gate: after close() sets valid = false, any retry path in run() that calls back into scheduleInactivityRebuild() will exit early.
  • The log level downgrade from INFO to FINE for the "Skipping" message is the right call - the message can now fire repeatedly and would otherwise flood operators' logs.

Minor concern - livelock under sustained semaphore pressure:

If a large-graph async rebuild (which does a blocking acquire()) is perpetually queued, a small-graph index will retry every timeoutMs indefinitely. In practice this cannot happen because individual rebuilds finish quickly, but there is no safety valve. An exponential back-off cap (e.g., clamp retry interval at min(timeoutMs * 8, 30_000)) or a maximum retry counter that escalates to REBUILD_SEMAPHORE.acquire() (blocking) after N retries would make the retry policy more robust. Not blocking this PR, but worth a follow-up issue.


timer.purge() Inside the Synchronized Block

if (existing != null) {
    existing.cancel();
    if (inactivityTimer != null)
        inactivityTimer.purge();
}

Timer.purge() is O(n) in the timer queue depth. Holding LSMVectorIndex.this for that work is fine at normal write rates (queue depth stays near 1), but under a pathological high-frequency write storm the lock is held longer than necessary. Consider calling purge() outside the synchronized block after saving the Timer reference, or calling it only every N cancels. Not a blocker, but worth keeping in mind if this lock ever shows up in profiling.


drop() Sets valid = false Twice

drop() calls close() (which already sets valid = false) and then sets valid = false again in the finally block. The double write is harmless, but the second assignment is now dead code after this PR. Could be cleaned up in a follow-on.


Test

The test is well-structured:

  • highThreshold = 10_000 prevents mutation-count triggering from masking inactivity-path coverage.
  • 50 vectors per index is well below ASYNC_REBUILD_MIN_GRAPH_SIZE = 1000, correctly targeting the synchronous small-graph code path.
  • @Tag("slow") is correctly applied per project conventions.
  • Awaitility is already on the classpath, so no new dependency is introduced.
  • The 25x timeout multiplier comment is clear and well-reasoned.

One nit: the assumption "well under 1 s each at 50 vectors" holds on developer hardware but CI runners can be slower under load. The 7.5 s budget is comfortable, but if this test proves flaky on CI a bump to 40x or 50x would be an easy fix.


Summary

The fix is correct, the synchronization is properly thought through, and the regression test captures the failure mode cleanly. The minor concerns above (livelock bound, purge lock hold time, dead valid = false in drop()) are all follow-up candidates rather than blockers.

@claude
Copy link
Copy Markdown

claude Bot commented May 21, 2026

Code Review - PR #4272: fix LSM_VECTOR inactivity rebuild timer never re-armed after skip

Overview

Targeted, minimal fix for a real liveness bug: when tryAcquire() fails in the small-graph inactivity rebuild path, the timer task now re-arms itself via scheduleInactivityRebuild() instead of silently terminating. The accompanying thread-safety improvements (synchronized, volatile valid, ordered shutdown) are welcome defensive additions.


Fix Correctness

Core one-liner - Calling scheduleInactivityRebuild() from the tryAcquire() failure branch is exactly right. The method already guards against mutationsSinceSerialize <= 0 and !isValid(), so the reschedule is self-limiting.

Self-cancel in timer run() - When re-scheduling, scheduleInactivityRebuild() reads inactivityRebuildTask (the still-running task) and calls cancel() on it. Calling cancel() on an already-executing TimerTask is a documented JDK no-op (returns false), so this is harmless.

Shutdown ordering - Setting valid = false before cancelInactivityRebuildTimer() is correct and the comment explaining the race it closes is clear.


Observations

Removed inner mutation guard (minor, worth a comment):

The original code had:

if (mutationsSinceSerialize.get() > 0) {
  if (REBUILD_SEMAPHORE.tryAcquire()) { ... }
}

The new code drops the inner mutationsSinceSerialize > 0 check around buildGraphFromScratch(). The outer guard at the top of run() and the scheduleInactivityRebuild() check cover the rescheduling path, but buildGraphFromScratch() can now be called in the rare window where mutations drained to 0 between the outer check and the acquire. If buildGraphFromScratch() is cheap/idempotent when there is nothing to do, this is fine - a brief comment would confirm intent.

Exception path does not reschedule (intentional asymmetry):

The catch (Exception e) block at the end of run() logs a warning and returns without rescheduling. The skip path now retries; the exception path does not. This is probably the right call to avoid retry-storms on persistent errors, but it is now asymmetric and worth a short comment (// deliberate: do not retry on error to avoid infinite loop or similar).

timeoutMs in log message is closure-captured (negligible):

The log message inside run() uses the timeoutMs value captured when the enclosing scheduleInactivityRebuild() was called. If the config value changes between schedules the message could be stale, but since it is just diagnostic text this is fine.

timer.purge() call is a good addition - prevents cancelled tasks from accumulating in the queue under high write rates.


Thread Safety

synchronized on both scheduleInactivityRebuild() and cancelInactivityRebuildTimer() correctly serialises the timer thread (calling scheduleInactivityRebuild() on retry) against mutation threads. No deadlock risk: neither method holds the instance monitor while calling back into client code or acquiring any other lock.


Test Coverage

The regression test is well-structured:

  • @Tag("slow") applied correctly per project convention - good.
  • Seeded Random(42) makes the data reproducible.
  • 25x timeout window gives ample slack for CI GC pauses without being unbounded.
  • Early assertion (isGreaterThan(0L)) catches a schema-race before the timing window starts.
  • White-box cast through TypeIndex -> LSMVectorIndex to inspect getStats() directly is the right approach for this kind of liveness test.

One minor nit: the comment above the test method is a good explanation of the bug scenario - consider promoting it to an inline Javadoc block with @see referencing issue #4215 for easier navigation from the index code.


Summary

The fix is correct, the test covers the exact failure scenario, and the defensive improvements around valid and synchronized are worth having. Two actionable items before merge:

  1. Add a comment in the exception catch block explaining why it deliberately does not reschedule (symmetry with the now-retrying skip path).
  2. Optional: add a brief comment on the removed mutationsSinceSerialize > 0 inner guard explaining that the outer guard and buildGraphFromScratch() idempotency cover this case.

Everything else is minor observation. Good PR.

@robfrank robfrank merged commit 6918763 into main May 21, 2026
21 of 24 checks passed
@robfrank robfrank added this to the 26.6.1 milestone May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LSM_VECTOR inactivity rebuild timer is never re-armed when a fire is Skipped, leaving indexes stuck with pending mutations indefinitely

1 participant