Skip to content

fix: improve database restore process and enhance error logging#4026

Merged
robfrank merged 6 commits into
mainfrom
fix/e2e-ha-addedum
Apr 28, 2026
Merged

fix: improve database restore process and enhance error logging#4026
robfrank merged 6 commits into
mainfrom
fix/e2e-ha-addedum

Conversation

@robfrank
Copy link
Copy Markdown
Collaborator

What does this PR do?

A brief description of the change being made with this pull request.

Motivation

What inspired you to submit this pull request?

Related issues

A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.

Additional Notes

Anything else we should know when reviewing?

Checklist

  • I have run the build using mvn clean package command
  • My unit tests cover both failure and success scenarios

robfrank and others added 5 commits April 28, 2026 22:57
…re restart and waiting for Raft leader

The convergence check timed out when tests ran together because cycleLeaderCount
was sampled after the isolated node restarted using a blind 10-second sleep. Under
load, the cluster may still be in Raft re-election at that point, producing a stale
target count the convergence condition can never satisfy.

Apply the same pattern used in NetworkPartitionRecoveryIT: wait for the two majority
nodes to agree before sampling the count (while still in a stable, pre-restart state),
then replace the blind sleep after restart with waitForRaftLeader so the convergence
Awaitility block starts from a known-stable cluster.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RestoreDatabaseScenarioIT: assert node 0 holds leadership after
  transferLeadershipAndWait before issuing the restore, since the transfer
  targets any Raft peer and not specifically node 0. Fix misleading log
  message to say "attempting" rather than guaranteeing node 0.
- NetworkPartitionRecoveryIT + SplitBrainIT: assert waitForRaftLeader
  return value >= 0 before convergence Awaitility checks so tests fail
  fast with a clear message if no leader is elected within the timeout
  rather than silently running into the convergence deadline.
- Add missing assertThat static import in NetworkPartitionRecoveryIT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 28, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 11 complexity

Metric Results
Complexity 11

View in Codacy

🟢 Coverage ∅ diff coverage · -8.38% coverage variation

Metric Results
Coverage variation -8.38% coverage variation
Diff coverage diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (9a4ef79) 120148 88498 73.66%
Head commit (2e0bffb) 151403 (+31255) 98833 (+10335) 65.28% (-8.38%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4026) 0 0 ∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the high-availability test suite by introducing a transferLeadershipToNode method for targeted Raft leadership transfers and improving character encoding consistency using StandardCharsets.UTF_8. It also adds documentation regarding variable aliasing in SplitBrainIT. A bug was identified in the getRaftPeerId method where JSONObject.getString is incorrectly used with a default value argument; a suggestion was provided to use optString instead.

try {
if (conn.getResponseCode() == 200) {
final String body = new String(conn.getInputStream().readAllBytes(), StandardCharsets.UTF_8);
return new JSONObject(body).getString("localPeerId", "");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The getString method in the JSONObject class (from com.arcadedb.serializer.json) does not support a default value as a second argument. This will likely result in a compilation error. Use optString instead to safely provide a default value if the key is missing.

Suggested change
return new JSONObject(body).getString("localPeerId", "");
return new JSONObject(body).optString("localPeerId", "");

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optString does not exist in com.arcadedb.serializer.json.JSONObject. The two-argument overload getString(String name, String defaultValue) is defined at line 211 of that class and returns the default when the key is absent or null. The build already compiles and passes with getString("localPeerId", "") — introducing optString would be a compile error.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review

Overview

This PR improves the HA e2e test infrastructure in three ways:

  1. Adds transferLeadershipToNode() - a targeted Raft leadership transfer to a specific node, with fallback to the existing untargeted transferLeadershipAndWait()
  2. Adds getRaftPeerId() - a helper that queries GET /api/v1/cluster to resolve a node's Raft peer ID
  3. Fixes implicit platform-default charset usages to explicit StandardCharsets.UTF_8

Positive Points

  • Charset fix is correct - new String(errStream.readAllBytes()) using the platform default charset is a latent bug on non-UTF-8 hosts. The fix in RestoreDatabaseScenarioIT is good. The same fix is worth applying to transferLeadershipAndWait() in the base class (lines 732/739), which still uses bare getBytes() without a charset.
  • Targeted transfer improves determinism - Tests that require a specific node to hold leadership (e.g., because a backup file lives on node 0) would be flaky with the untargeted form. The targeted approach is the right solution.
  • Graceful fallback - If the peer ID cannot be resolved, falling back to transferLeadershipAndWait is sensible.
  • Correct use of JSONObject.getString(key, default) - Consistent with the project convention of using the two-arg getter to avoid null checks.
  • SplitBrainIT comment - The explanation of why the majorityServers alias is needed (lambda capture, servers is reassigned later) is genuinely useful for future readers.

Issues and Suggestions

1. OutputStream not flushed/closed before reading the response

In transferLeadershipToNode and the existing transferLeadershipAndWait, the OutputStream is written to but never explicitly flushed or closed before getResponseCode() is called. For HttpURLConnection, calling getResponseCode() implicitly flushes, but relying on that is fragile. Prefer:

try (final OutputStream os = conn.getOutputStream()) {
  os.write(...);
}
final int status = conn.getResponseCode();

2. getRaftPeerId: connection may not be disconnected on early failure

conn.disconnect() is only called inside the inner try-finally, but conn is created in the outer try. If an exception occurs between openConnection() and the start of the inner try, the connection leaks. Consider moving finally { conn.disconnect(); } to wrap the entire conn lifetime, consistent with the structure in postServerCommand.

3. transferLeadershipToNode silently succeeds even if the wrong node became leader

The method calls waitForRaftLeader at the end but discards the return value. The caller in RestoreDatabaseScenarioIT asserts the result externally, which is fine, but the method is easy to misuse. Consider at least logging a warning when the resulting leader is not targetNode:

final int newLeaderIdx = waitForRaftLeader(servers, timeoutSeconds);
if (newLeaderIdx != servers.indexOf(targetNode))
  logger.warn("Leadership transfer: expected {} but new leader is node {}", targetNode.host(), newLeaderIdx);

4. Remaining bare getBytes() calls in transferLeadershipAndWait (pre-existing, minor)

Lines 732 and 739 in the existing method still use bare getBytes() without a charset. Since this PR already touches charset hygiene throughout, fixing those two lines would be a natural cleanup.


Test Coverage

The new helpers are exercised by RestoreDatabaseScenarioIT.restoreDatabaseReplicatedAcrossCluster, which provides reasonable functional coverage. No standalone unit tests are expected for test-infrastructure helpers.


Summary

The overall direction is correct and the changes improve test reliability and correctness. The main actionable items are: (1) close/flush the OutputStream before reading the response code, and (2) fix the remaining bare getBytes() calls in transferLeadershipAndWait. Everything else is minor or optional.

@robfrank
Copy link
Copy Markdown
Collaborator Author

Item 3 (warning log): Implemented. transferLeadershipToNode now captures the return value of waitForRaftLeader and logs a warning when the resulting leader is not the expected target peer.

Item 4 (bare getBytes() in transferLeadershipAndWait): Fixed. Both calls in the existing method now use StandardCharsets.UTF_8.

Item 1 (OutputStream flush): Not changed. Every HTTP method in this file - postServerCommand, databaseExistsOnServer, countUsersViaHttp, and the pre-existing transferLeadershipAndWait - uses the same bare write + getResponseCode() pattern. HttpURLConnection.getResponseCode() is documented to trigger the request; changing only the new methods would be inconsistent without a broader cleanup.

Item 2 (connection leak in getRaftPeerId): Not changed. Between openConnection() and the inner try, only three setters run (setRequestProperty, setConnectTimeout, setReadTimeout) - none can throw. There is no realistic exception gap. The structure is identical to postServerCommand.

…erLeadershipAndWait

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@robfrank robfrank merged commit 9707afd into main Apr 28, 2026
11 of 12 checks passed
@robfrank robfrank deleted the fix/e2e-ha-addedum branch April 28, 2026 21:24
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review

Overview

This PR improves the HA e2e test infrastructure in three areas:

  • Adds targeted Raft leadership transfer (transferLeadershipToNode) so tests can ensure a specific node holds leadership before acting on it
  • Fixes platform-default charset bugs across several call sites (consistently uses StandardCharsets.UTF_8)
  • Adds an explanatory comment in SplitBrainIT clarifying why a local alias is needed for lambda capture

The changes are test-only (no production code is touched).


Positive Observations

  • Charset fixes are correct and important. new String(bytes) and .getBytes() without an explicit charset are latent encoding bugs - fixing them with StandardCharsets.UTF_8 is the right call.
  • Targeted transfer improves test reliability. The old transferLeadershipAndWait transferred to any peer; the restore test requires node 0 specifically. The new method handles that case properly with a graceful fallback.
  • JSONObject usage is consistent with project conventions: getString("localPeerId", "") uses the two-arg getter that avoids null checks, as recommended in CLAUDE.md.
  • Resource cleanup via finally { conn.disconnect(); } is correct in both new helper methods.

Issues and Suggestions

1. transferLeadershipToNode - OutputStream not explicitly closed before reading response (minor)

conn.getOutputStream().write(...) is called but the stream is never explicitly closed before getResponseCode(). HttpURLConnection does implicitly flush on getResponseCode(), so this works in practice, but being explicit is safer. The existing transferLeadershipAndWait has the same pattern and would benefit from the same fix.

2. transferLeadershipToNode - error body not logged on failure (minor)

When status != 200, the warning logs the status code but not the response body. Adding error-stream logging would make test failure diagnostics much easier - the same pattern is already used in postServerCommand.

3. transferLeadershipToNode - final check is warn-only (design note)

After issuing the transfer the code only logs a warning if the wrong node becomes leader. The caller in RestoreDatabaseScenarioIT adds an explicit assertThat().isEqualTo(0) immediately after, so the test will fail correctly - but the warn-only behavior inside the helper is slightly misleading to readers. A short comment noting that callers are expected to assert the outcome would clarify intent.

4. getRaftPeerId - InputStream not explicitly closed (minor)

conn.getInputStream().readAllBytes() reads without closing the stream; conn.disconnect() in finally covers it, but wrapping in a try-with-resources is more idiomatic and avoids relying on that implicit behavior.

5. Javadoc verbosity (style note)

Project style (CLAUDE.md) recommends avoiding multi-line comment blocks unless the WHY is non-obvious. The Javadoc on getRaftPeerId and transferLeadershipToNode is on the verbose side relative to the rest of the codebase. For protected base-class methods used by subclasses this level of documentation is defensible, but trimming the @param/@return tags to a single-line summary would be more consistent with project conventions.


Summary

The PR is solid - the targeted leadership transfer addresses a real reliability gap and the charset fixes prevent subtle encoding bugs. Most suggestions above are minor polish items. The most actionable one is logging the error response body on non-200 responses from the transfer endpoint, which would significantly help diagnose future flaky HA tests.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.24%. Comparing base (9a4ef79) to head (2e0bffb).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4026      +/-   ##
==========================================
- Coverage   64.65%   64.24%   -0.41%     
==========================================
  Files        1597     1597              
  Lines      120148   120148              
  Branches    25580    25580              
==========================================
- Hits        77685    77193     -492     
- Misses      31681    32282     +601     
+ Partials    10782    10673     -109     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant