fix: improve database restore process and enhance error logging by robfrank · Pull Request #4026 · ArcadeData/arcadedb

robfrank · 2026-04-28T20:59:53Z

What does this PR do?

A brief description of the change being made with this pull request.

Motivation

What inspired you to submit this pull request?

Related issues

A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.

Additional Notes

Anything else we should know when reviewing?

Checklist

I have run the build using mvn clean package command
My unit tests cover both failure and success scenarios

…istent user count before node restart

…re restart and waiting for Raft leader The convergence check timed out when tests ran together because cycleLeaderCount was sampled after the isolated node restarted using a blind 10-second sleep. Under load, the cluster may still be in Raft re-election at that point, producing a stale target count the convergence condition can never satisfy. Apply the same pattern used in NetworkPartitionRecoveryIT: wait for the two majority nodes to agree before sampling the count (while still in a stable, pre-restart state), then replace the blind sleep after restart with waitForRaftLeader so the convergence Awaitility block starts from a known-stable cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- RestoreDatabaseScenarioIT: assert node 0 holds leadership after transferLeadershipAndWait before issuing the restore, since the transfer targets any Raft peer and not specifically node 0. Fix misleading log message to say "attempting" rather than guaranteeing node 0. - NetworkPartitionRecoveryIT + SplitBrainIT: assert waitForRaftLeader return value >= 0 before convergence Awaitility checks so tests fail fast with a clear message if no leader is elected within the timeout rather than silently running into the convergence deadline. - Add missing assertThat static import in NetworkPartitionRecoveryIT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…integration tests

codacy-production · 2026-04-28T21:00:59Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 11 complexity

Metric Results

Complexity 11

View in Codacy

🟢 Coverage ∅ diff coverage · -8.38% coverage variation

Metric Results

Coverage variation ✅ -8.38% coverage variation

Diff coverage ✅ ∅ diff coverage

View coverage diff in Codacy

Coverage variation details

Coverable lines Covered lines Coverage

Common ancestor commit (9a4ef79) 120148 88498 73.66%

Head commit (2e0bffb) 151403 (+31255) 98833 (+10335) 65.28% (-8.38%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

Coverable lines Covered lines Diff coverage

Pull request (#4026) 0 0 ∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request enhances the high-availability test suite by introducing a transferLeadershipToNode method for targeted Raft leadership transfers and improving character encoding consistency using StandardCharsets.UTF_8. It also adds documentation regarding variable aliasing in SplitBrainIT. A bug was identified in the getRaftPeerId method where JSONObject.getString is incorrectly used with a default value argument; a suggestion was provided to use optString instead.

gemini-code-assist · 2026-04-28T21:02:37Z

+      try {
+        if (conn.getResponseCode() == 200) {
+          final String body = new String(conn.getInputStream().readAllBytes(), StandardCharsets.UTF_8);
+          return new JSONObject(body).getString("localPeerId", "");


The getString method in the JSONObject class (from com.arcadedb.serializer.json) does not support a default value as a second argument. This will likely result in a compilation error. Use optString instead to safely provide a default value if the key is missing.

Suggested change

return new JSONObject(body).getString("localPeerId", "");

return new JSONObject(body).optString("localPeerId", "");

optString does not exist in com.arcadedb.serializer.json.JSONObject. The two-argument overload getString(String name, String defaultValue) is defined at line 211 of that class and returns the default when the key is absent or null. The build already compiles and passes with getString("localPeerId", "") — introducing optString would be a compile error.

claude · 2026-04-28T21:03:05Z

Code Review

Overview

This PR improves the HA e2e test infrastructure in three ways:

Adds transferLeadershipToNode() - a targeted Raft leadership transfer to a specific node, with fallback to the existing untargeted transferLeadershipAndWait()
Adds getRaftPeerId() - a helper that queries GET /api/v1/cluster to resolve a node's Raft peer ID
Fixes implicit platform-default charset usages to explicit StandardCharsets.UTF_8

Positive Points

Charset fix is correct - new String(errStream.readAllBytes()) using the platform default charset is a latent bug on non-UTF-8 hosts. The fix in RestoreDatabaseScenarioIT is good. The same fix is worth applying to transferLeadershipAndWait() in the base class (lines 732/739), which still uses bare getBytes() without a charset.
Targeted transfer improves determinism - Tests that require a specific node to hold leadership (e.g., because a backup file lives on node 0) would be flaky with the untargeted form. The targeted approach is the right solution.
Graceful fallback - If the peer ID cannot be resolved, falling back to transferLeadershipAndWait is sensible.
Correct use of JSONObject.getString(key, default) - Consistent with the project convention of using the two-arg getter to avoid null checks.
SplitBrainIT comment - The explanation of why the majorityServers alias is needed (lambda capture, servers is reassigned later) is genuinely useful for future readers.

Issues and Suggestions

1. OutputStream not flushed/closed before reading the response

In transferLeadershipToNode and the existing transferLeadershipAndWait, the OutputStream is written to but never explicitly flushed or closed before getResponseCode() is called. For HttpURLConnection, calling getResponseCode() implicitly flushes, but relying on that is fragile. Prefer:

try (final OutputStream os = conn.getOutputStream()) {
  os.write(...);
}
final int status = conn.getResponseCode();

2. getRaftPeerId: connection may not be disconnected on early failure

conn.disconnect() is only called inside the inner try-finally, but conn is created in the outer try. If an exception occurs between openConnection() and the start of the inner try, the connection leaks. Consider moving finally { conn.disconnect(); } to wrap the entire conn lifetime, consistent with the structure in postServerCommand.

3. transferLeadershipToNode silently succeeds even if the wrong node became leader

The method calls waitForRaftLeader at the end but discards the return value. The caller in RestoreDatabaseScenarioIT asserts the result externally, which is fine, but the method is easy to misuse. Consider at least logging a warning when the resulting leader is not targetNode:

final int newLeaderIdx = waitForRaftLeader(servers, timeoutSeconds);
if (newLeaderIdx != servers.indexOf(targetNode))
  logger.warn("Leadership transfer: expected {} but new leader is node {}", targetNode.host(), newLeaderIdx);

4. Remaining bare getBytes() calls in transferLeadershipAndWait (pre-existing, minor)

Lines 732 and 739 in the existing method still use bare getBytes() without a charset. Since this PR already touches charset hygiene throughout, fixing those two lines would be a natural cleanup.

Test Coverage

The new helpers are exercised by RestoreDatabaseScenarioIT.restoreDatabaseReplicatedAcrossCluster, which provides reasonable functional coverage. No standalone unit tests are expected for test-infrastructure helpers.

Summary

The overall direction is correct and the changes improve test reliability and correctness. The main actionable items are: (1) close/flush the OutputStream before reading the response code, and (2) fix the remaining bare getBytes() calls in transferLeadershipAndWait. Everything else is minor or optional.

robfrank · 2026-04-28T21:16:03Z

Item 3 (warning log): Implemented. transferLeadershipToNode now captures the return value of waitForRaftLeader and logs a warning when the resulting leader is not the expected target peer.

Item 4 (bare getBytes() in transferLeadershipAndWait): Fixed. Both calls in the existing method now use StandardCharsets.UTF_8.

Item 1 (OutputStream flush): Not changed. Every HTTP method in this file - postServerCommand, databaseExistsOnServer, countUsersViaHttp, and the pre-existing transferLeadershipAndWait - uses the same bare write + getResponseCode() pattern. HttpURLConnection.getResponseCode() is documented to trigger the request; changing only the new methods would be inconsistent without a broader cleanup.

Item 2 (connection leak in getRaftPeerId): Not changed. Between openConnection() and the inner try, only three setters run (setRequestProperty, setConnectTimeout, setReadTimeout) - none can throw. There is no realistic exception gap. The structure is identical to postServerCommand.

…erLeadershipAndWait Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-04-28T21:33:24Z

Code Review

Overview

This PR improves the HA e2e test infrastructure in three areas:

Adds targeted Raft leadership transfer (transferLeadershipToNode) so tests can ensure a specific node holds leadership before acting on it
Fixes platform-default charset bugs across several call sites (consistently uses StandardCharsets.UTF_8)
Adds an explanatory comment in SplitBrainIT clarifying why a local alias is needed for lambda capture

The changes are test-only (no production code is touched).

Positive Observations

Charset fixes are correct and important. new String(bytes) and .getBytes() without an explicit charset are latent encoding bugs - fixing them with StandardCharsets.UTF_8 is the right call.
Targeted transfer improves test reliability. The old transferLeadershipAndWait transferred to any peer; the restore test requires node 0 specifically. The new method handles that case properly with a graceful fallback.
JSONObject usage is consistent with project conventions: getString("localPeerId", "") uses the two-arg getter that avoids null checks, as recommended in CLAUDE.md.
Resource cleanup via finally { conn.disconnect(); } is correct in both new helper methods.

Issues and Suggestions

1. transferLeadershipToNode - OutputStream not explicitly closed before reading response (minor)

conn.getOutputStream().write(...) is called but the stream is never explicitly closed before getResponseCode(). HttpURLConnection does implicitly flush on getResponseCode(), so this works in practice, but being explicit is safer. The existing transferLeadershipAndWait has the same pattern and would benefit from the same fix.

2. transferLeadershipToNode - error body not logged on failure (minor)

When status != 200, the warning logs the status code but not the response body. Adding error-stream logging would make test failure diagnostics much easier - the same pattern is already used in postServerCommand.

3. transferLeadershipToNode - final check is warn-only (design note)

After issuing the transfer the code only logs a warning if the wrong node becomes leader. The caller in RestoreDatabaseScenarioIT adds an explicit assertThat().isEqualTo(0) immediately after, so the test will fail correctly - but the warn-only behavior inside the helper is slightly misleading to readers. A short comment noting that callers are expected to assert the outcome would clarify intent.

4. getRaftPeerId - InputStream not explicitly closed (minor)

conn.getInputStream().readAllBytes() reads without closing the stream; conn.disconnect() in finally covers it, but wrapping in a try-with-resources is more idiomatic and avoids relying on that implicit behavior.

5. Javadoc verbosity (style note)

Project style (CLAUDE.md) recommends avoiding multi-line comment blocks unless the WHY is non-obvious. The Javadoc on getRaftPeerId and transferLeadershipToNode is on the verbose side relative to the rest of the codebase. For protected base-class methods used by subclasses this level of documentation is defensible, but trimming the @param/@return tags to a single-line summary would be more consistent with project conventions.

Summary

The PR is solid - the targeted leadership transfer addresses a real reliability gap and the charset fixes prevent subtle encoding bugs. Most suggestions above are minor polish items. The most actionable one is logging the error response body on non-200 responses from the transfer endpoint, which would significantly help diagnose future flaky HA tests.

codecov · 2026-04-28T22:10:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.24%. Comparing base (9a4ef79) to head (2e0bffb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4026      +/-   ##
==========================================
- Coverage   64.65%   64.24%   -0.41%     
==========================================
  Files        1597     1597              
  Lines      120148   120148              
  Branches    25580    25580              
==========================================
- Hits        77685    77193     -492     
- Misses      31681    32282     +601     
+ Partials    10782    10673     -109

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

robfrank and others added 5 commits April 28, 2026 22:57

fix: improve database restore process and enhance error logging

98f90a0

fix: enhance stability in NetworkPartitionRecoveryIT by ensuring cons…

9d40234

…istent user count before node restart

fix: enhance leadership transfer logic and improve error handling in …

7798706

…integration tests

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

fix: add leader-mismatch warning and StandardCharsets.UTF_8 to transf…

2e0bffb

…erLeadershipAndWait Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

robfrank merged commit 9707afd into main Apr 28, 2026
11 of 12 checks passed

robfrank deleted the fix/e2e-ha-addedum branch April 28, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: improve database restore process and enhance error logging#4026

fix: improve database restore process and enhance error logging#4026
robfrank merged 6 commits into
mainfrom
fix/e2e-ha-addedum

robfrank commented Apr 28, 2026

Uh oh!

codacy-production Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Uh oh!

robfrank Apr 28, 2026

Uh oh!

claude Bot commented Apr 28, 2026

Uh oh!

robfrank commented Apr 28, 2026

Uh oh!

Uh oh!

claude Bot commented Apr 28, 2026

Uh oh!

codecov Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return new JSONObject(body).getString("localPeerId", "");
	return new JSONObject(body).optString("localPeerId", "");

Uh oh!

Conversation

robfrank commented Apr 28, 2026

What does this PR do?

Motivation

Related issues

Additional Notes

Checklist

Uh oh!

codacy-production Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

robfrank Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 28, 2026

Code Review

Overview

Positive Points

Issues and Suggestions

Test Coverage

Summary

Uh oh!

robfrank commented Apr 28, 2026

Uh oh!

Uh oh!

claude Bot commented Apr 28, 2026

Code Review

Overview

Positive Observations

Issues and Suggestions

Summary

Uh oh!

codecov Bot commented Apr 28, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codacy-production Bot commented Apr 28, 2026 •

edited

Loading