fix: backup concurrency race on Linux - flush thread ignored isSuspended by robfrank · Pull Request #3774 · ArcadeData/arcadedb

robfrank · 2026-04-03T13:02:48Z

Summary

PageManagerFlushThread never checked isSuspended() in its run loop, so the background thread kept writing pages to database files via FileChannel.write() while the backup's FileInputStream.transferTo() was reading those same files
On Linux's CFS scheduler this race caused FullBackupIT.fullBackupConcurrency to fail with count % 500 != 0 (partial transaction in backup)
Added deferred-flush queue per database: when the background thread polls a batch for a suspended database it defers it instead of flushing
setSuspended(false) now synchronously flushes deferred batches (preserving commit order), then re-enables normal async flushing
Replaced the broken one-shot flushPagesFromQueueToDisk(database, 0L) pre-backup call with waitForCurrentFlushToComplete(database) to properly wait out any in-progress write

Test plan

FullBackupIT#fullBackupConcurrency passes (was failing on Linux CI)
Full FullBackupIT suite (6 tests) passes locally

🤖 Generated with Claude Code

The background PageManagerFlushThread never checked isSuspended(), so it kept writing pages to database files via FileChannel.write() while the backup's FileInputStream.transferTo() was reading those same files. On Linux's CFS scheduler this race caused partial transaction data in backups (FullBackupIT.fullBackupConcurrency failing with count % 500 \!= 0). - Add deferredByDatabase map: when the background thread polls a batch for a suspended database it moves it to the deferred queue instead of flushing, leaving pageIndex intact - Add waitForCurrentFlushToComplete(Database) to wait out any flush that was already in-progress when setSuspended(true) was called - setSuspended(false) now: (1) synchronously flushes deferred batches while still suspended to preserve commit order, (2) removes the suspend flag, (3) re-enqueues any tail batches that arrived during (1) - Replace the one-shot flushPagesFromQueueToDisk(database, 0L) in suspendFlushAndExecute with waitForCurrentFlushToComplete so the backup only starts reading after the last in-flight write completes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codacy-production · 2026-04-03T13:03:56Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 17 complexity

Metric Results

Complexity 17

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

gemini-code-assist

Code Review

This pull request implements a deferred flushing mechanism for suspended databases in the PageManagerFlushThread. It introduces a deferredByDatabase map to hold pages during suspension and updates the setSuspended method to process these pages upon resumption. Review feedback identifies a potential NullPointerException in the new waitForCurrentFlushToComplete method due to multiple calls to an atomic reference and warns of potential data loss in the setSuspended method where using queue.offer with a timeout could fail to re-enqueue deferred batches if the queue is full.

gemini-code-assist · 2026-04-03T13:04:42Z

+  public void waitForCurrentFlushToComplete(final Database database) throws InterruptedException {
+    while (nextPagesToFlush.get() != null && database.equals(nextPagesToFlush.get().database))
+      Thread.sleep(1);
+  }


The waitForCurrentFlushToComplete method is susceptible to a NullPointerException. The nextPagesToFlush.get() call is performed twice in the while condition. If the background flush thread completes and sets the reference to null between these two calls, the second call will return null, causing an NPE when accessing the .database field. It is safer to capture the result of the first call in a local variable.

Suggested change

public void waitForCurrentFlushToComplete(final Database database) throws InterruptedException {

while (nextPagesToFlush.get() != null && database.equals(nextPagesToFlush.get().database))

Thread.sleep(1);

}

public void waitForCurrentFlushToComplete(final Database database) throws InterruptedException {

PagesToFlush current;

while ((current = nextPagesToFlush.get()) != null && database.equals(current.database))

Thread.sleep(1);

}

Fixed in 7d881c8 - capture the reference once per iteration: while ((current = nextPagesToFlush.get()) != null && database.equals(current.database))

gemini-code-assist · 2026-04-03T13:04:42Z

+    final ConcurrentLinkedQueue<PagesToFlush> newDeferred = deferredByDatabase.remove(database);
+    if (newDeferred != null) {
+      for (final PagesToFlush batch : newDeferred) {
+        try {
+          queue.offer(batch, 1, TimeUnit.SECONDS);
+        } catch (final InterruptedException e) {
+          Thread.currentThread().interrupt();
+          break;
+        }
+      }
+    }


In Phase 3 of setSuspended, using queue.offer(batch, 1, TimeUnit.SECONDS) can lead to silent data loss if the queue is full. Since these batches have already been removed from deferredByDatabase and are tracked in pageIndex, failing to re-enqueue them means they will never be flushed to disk. This could lead to data inconsistency as the system assumes these pages are pending flush. Using queue.put(batch) is more appropriate here to ensure the batches are eventually re-enqueued, even if the calling thread has to block briefly.

Suggested change

final ConcurrentLinkedQueue<PagesToFlush> newDeferred = deferredByDatabase.remove(database);

if (newDeferred != null) {

for (final PagesToFlush batch : newDeferred) {

try {

queue.offer(batch, 1, TimeUnit.SECONDS);

} catch (final InterruptedException e) {

Thread.currentThread().interrupt();

break;

}

}

}

final ConcurrentLinkedQueue<PagesToFlush> newDeferred = deferredByDatabase.remove(database);

if (newDeferred != null) {

for (final PagesToFlush batch : newDeferred) {

try {

queue.put(batch);

} catch (final InterruptedException e) {

Thread.currentThread().interrupt();

break;

}

}

}

Fixed in 7d881c8 - Phase 3 now retries queue.offer in a loop (mirroring scheduleFlushOfPages) with a WARNING log on each full-queue retry, so no batch is silently dropped. Also added a comment explaining the tail-append ordering and WAL safety.

claude · 2026-04-03T13:06:29Z

Code Review

Good fix for the backup concurrency race. The approach of deferring batches during suspension is sound. A few issues worth addressing before merging:

Bug: NPE race in waitForCurrentFlushToComplete (PageManagerFlushThread.java:202)

The flush thread sets nextPagesToFlush to null in its finally block. Between the null-check and the second .get().database call in the loop condition, another thread can null it out, causing NPE. Fix by capturing the reference once:

PagesToFlush current;
while ((current = nextPagesToFlush.get()) != null && database.equals(current.database))
    Thread.sleep(1);

Bug: DatabaseMetadataException not caught in deferred flush path (PageManagerFlushThread.java:221-224)

The main flush path catches DatabaseMetadataException (a RuntimeException subclass) for "file deleted" cases. Phase 1 of setSuspended only catches IOException. A deleted file during deferred flushing throws an uncaught DatabaseMetadataException, aborting Phase 1 mid-loop and leaving pages in pageIndex as ghost entries. Should mirror the main flush catch block.

Bug: Silent page loss in Phase 3 (PageManagerFlushThread.java:240)

The return value of queue.offer(batch, 1, TimeUnit.SECONDS) is not checked. This method returns false on timeout rather than throwing. Under heavy write load the queue could be full for 1 second, silently dropping the batch so those pages are never written to disk. At minimum log a SEVERE warning; ideally retry like scheduleFlushOfPages does.

Concern: Phase 3 appends deferred batches to the tail of the queue

Deferred batches (committed before suspension was released) are re-enqueued at the back via offer, behind new commits that arrived after suspended.remove(database) in Phase 2. This reverses commit order for the async flush path. Whether it matters depends on whether the WAL makes out-of-order page flushing safe - worth a deliberate comment.

Minor: Busy-wait with Thread.sleep(1)

Both waitForCurrentFlushToComplete and waitAllPagesOfDatabaseAreFlushed use sleep loops. Consistent with the pre-existing pattern, but not ideal for very short flush operations.

Overall the fix addresses the real root cause. The NPE and the silent page drop are the most urgent items to fix.

…page loss - waitForCurrentFlushToComplete: capture nextPagesToFlush.get() once per iteration to avoid NPE between null-check and .database access - setSuspended Phase 1: add DatabaseMetadataException catch to mirror the main flush path (file-deleted case) - setSuspended Phase 3: retry queue.offer in a loop instead of fire-and-forget to prevent silent page loss when queue is full; add comment explaining tail-append ordering and WAL safety guarantee Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-04-03T13:15:46Z

Code Review

Good fix for the backup concurrency race - the deferred-flush design is the right approach. Note: a prior automated review left some inaccurate observations; this review corrects those.

Correction of prior review claims

The previous automated review flagged two issues that are NOT bugs in the actual code:

NPE in waitForCurrentFlushToComplete: The code already captures the atomic reference once via the assignment-in-condition pattern (current = nextPagesToFlush.get()) != null && database.equals(current.database) - this is safe.
DatabaseMetadataException not caught in Phase 1: The code does catch it, alongside IOException.

Real issue: InterruptedException in Phase 3 silently drops batches

The break in the catch (InterruptedException) block exits the inner while (running) loop only, not the outer for (batch) loop. The current batch being offered is dropped. Then Thread.currentThread().interrupt() leaves the flag set, so the next iteration's queue.offer(..., 1, TimeUnit.SECONDS) immediately re-throws InterruptedException - cascading silently through all remaining deferred batches. Each is lost without any logging. This is a data loss path on thread interruption.

Suggested fix - label the outer loop and break from it:

outerLoop:
for (final PagesToFlush batch : newDeferred) {
    while (running) {
        try {
            if (queue.offer(batch, 1, TimeUnit.SECONDS))
                break;
            LogManager.instance().log(this, Level.WARNING,
                "Page flush queue is full while re-enqueueing deferred batch for database '%s'; retrying", database.getName());
        } catch (final InterruptedException e) {
            Thread.currentThread().interrupt();
            LogManager.instance().log(this, Level.SEVERE,
                "Interrupted while re-enqueueing deferred batches for database '%s'; remaining batches may be lost",
                database.getName());
            break outerLoop;
        }
    }
}

Type safety: unconditional cast to Database

PagesToFlush.database is typed as BasicDatabase, but two new lines cast it unconditionally to Database (for the deferredByDatabase computeIfAbsent and the isSuspended call). Works today since all concrete objects implement Database extends BasicDatabase, but fragile. Using a Java 16+ pattern match is safer and cleaner:

if (database == null && pagesToFlush.database instanceof final Database db && isSuspended(db)) {
    deferredByDatabase.computeIfAbsent(db, k -> new ConcurrentLinkedQueue<>()).offer(pagesToFlush);
    return;
}

Minor: waitForCurrentFlushToComplete is a 1ms busy-wait

Functional for backup scenarios (infrequent), but a CountDownLatch or Condition signalled in the finally block of flushPagesFromQueueToDisk would eliminate the polling. Low priority.

Overall

The core design is solid: deferring batches during suspension, flushing synchronously in Phase 1, and re-enqueueing Phase 3 stragglers is the right approach. The ordering concern (Phase 3 batches appended behind newer commits) is well-documented and safe via WAL. The InterruptedException cascade in Phase 3 is the one real concern to address before merging.

codecov · 2026-04-03T14:02:08Z

Codecov Report

❌ Patch coverage is 55.81395% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.13%. Comparing base (bd55489) to head (7d881c8).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...va/com/arcadedb/engine/PageManagerFlushThread.java	54.76%	13 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3774      +/-   ##
==========================================
+ Coverage   65.03%   65.13%   +0.10%     
==========================================
  Files        1580     1580              
  Lines      116275   116313      +38     
  Branches    24659    24669      +10     
==========================================
+ Hits        75615    75758     +143     
+ Misses      30350    30241     -109     
- Partials    10310    10314       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-04-03T14:20:04Z

Code Review

Overall this is a well-motivated fix addressing a real race condition on Linux where the background flush thread was writing pages concurrently with backup reads. The approach (deferred-flush queue per database) is sound. A few issues worth discussing:

Bug: `nextPagesToFlush` is set even when deferring (cosmetic race window)

In flushPagesFromQueueToDisk, when a batch is deferred the code still sets nextPagesToFlush before the deferred check:

nextPagesToFlush.set(pagesToFlush);   // set...
try {
    ...
    if (database == null && isSuspended(...)) {
        deferredByDatabase...offer(pagesToFlush);
        return;  // finally clears nextPagesToFlush — correct
    }

The finally block correctly clears it, but this means waitForCurrentFlushToComplete may spin for up to 1 ms waiting for a deferral that does no I/O. Not a correctness bug but worth noting. Moving the nextPagesToFlush.set(pagesToFlush) call to after the isSuspended check would avoid the spurious window.

Moderate: Phase 3 can enqueue stale batches after newer ones

The Phase 3 comment correctly acknowledges the ordering issue:

"Note: they are appended to the tail of the queue, so if any post-unsuspend commits have already been enqueued they will be flushed first."

This means pages committed earlier (during Phase 1) could be flushed to disk after pages committed later. The WAL guarantees correctness, but this is still unusual behaviour. An alternative would be to keep suspended=true until Phase 1 finishes, then re-add Phase 3 deferred batches to the front of the queue (or flush them synchronously too). If this is intentionally left as-is for simplicity, a comment explaining the WAL invariant that makes it safe is already there, which is good.

Moderate: Data loss risk when interrupted during Phase 3

} catch (final InterruptedException e) {
    Thread.currentThread().interrupt();
    break;   // batch is silently dropped from the flush queue
}

If the thread is interrupted mid-loop, remaining deferred batches are neither flushed nor re-enqueued; they simply disappear from the flush pipeline. The pages remain in pageIndex forever until the database closes. WAL replay on next restart would recover the data, but it would be better to log a warning here similar to the queue-full path, or to fall back to a synchronous flush of remaining batches before breaking.

Minor: Unsafe cast `(Database) pagesToFlush.database`

PagesToFlush.database is typed as BasicDatabase. The isSuspended map uses Database as key. The cast is necessary but will throw ClassCastException if a BasicDatabase that is not a Database ends up in the queue. Worth either changing PagesToFlush.database to Database, or adding an instanceof guard:

if (database == null && pagesToFlush.database instanceof Database db && isSuspended(db)) {

Minor: `waitForCurrentFlushToComplete` is a busy-wait

while ((current = nextPagesToFlush.get()) != null && database.equals(current.database))
    Thread.sleep(1);

1-ms polling is acceptable for the infrequent backup case, but LockSupport.parkNanos or a CountDownLatch would be cleaner and more precise. Not a blocker.

Good: `ORDER BY id` in `PostgresWJdbcIT`

The SELECT * FROM article ORDER BY id change correctly fixes a non-deterministic test that depended on an unspecified scan order. Good catch.

Good: `waitForCurrentFlushToComplete` replaces broken `flushPagesFromQueueToDisk(database, 0L)`

The original flushPagesFromQueueToDisk(database, 0L) with a zero timeout would almost never actually wait for an in-progress flush; the new method correctly polls nextPagesToFlush until the in-flight write for the target database completes before handing over to the backup. This is the crux of the fix and the logic is correct.

Summary: The fix is correct and addresses the root cause. The moderate items (Phase 3 ordering, interrupted Phase 3 data loss) are worth addressing before merge; the minor items can be follow-up issues if preferred.

…ded (ArcadeData#3774) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

fix PostgresWJdbcIT

e39dc91

robfrank merged commit 6964490 into main Apr 3, 2026
22 of 24 checks passed

tae898 pushed a commit to humemai/arcadedb-embedded-python that referenced this pull request Apr 7, 2026

fix: backup concurrency race on Linux - flush thread ignored isSuspen…

0d198b0

…ded (ArcadeData#3774) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

robfrank added this to the 26.4.1 milestone Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: backup concurrency race on Linux - flush thread ignored isSuspended#3774

fix: backup concurrency race on Linux - flush thread ignored isSuspended#3774
robfrank merged 3 commits intomainfrom
fix/fullbackup-it-on-linux

robfrank commented Apr 3, 2026

Uh oh!

codacy-production Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

robfrank Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

robfrank Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026

Uh oh!

codecov Bot commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

robfrank commented Apr 3, 2026

Summary

Test plan

Uh oh!

codacy-production Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

robfrank Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

robfrank Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 3, 2026

Code Review

Uh oh!

claude Bot commented Apr 3, 2026

Code Review

Uh oh!

codecov Bot commented Apr 3, 2026

Codecov Report

Uh oh!

claude Bot commented Apr 3, 2026

Code Review

Bug: nextPagesToFlush is set even when deferring (cosmetic race window)

Moderate: Phase 3 can enqueue stale batches after newer ones

Moderate: Data loss risk when interrupted during Phase 3

Minor: Unsafe cast (Database) pagesToFlush.database

Minor: waitForCurrentFlushToComplete is a busy-wait

Good: ORDER BY id in PostgresWJdbcIT

Good: waitForCurrentFlushToComplete replaces broken flushPagesFromQueueToDisk(database, 0L)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codacy-production Bot commented Apr 3, 2026 •

edited

Loading

Bug: `nextPagesToFlush` is set even when deferring (cosmetic race window)

Minor: Unsafe cast `(Database) pagesToFlush.database`

Minor: `waitForCurrentFlushToComplete` is a busy-wait

Good: `ORDER BY id` in `PostgresWJdbcIT`

Good: `waitForCurrentFlushToComplete` replaces broken `flushPagesFromQueueToDisk(database, 0L)`