fix: improve database restore process and enhance error logging#4026
Conversation
…istent user count before node restart
…re restart and waiting for Raft leader The convergence check timed out when tests ran together because cycleLeaderCount was sampled after the isolated node restarted using a blind 10-second sleep. Under load, the cluster may still be in Raft re-election at that point, producing a stale target count the convergence condition can never satisfy. Apply the same pattern used in NetworkPartitionRecoveryIT: wait for the two majority nodes to agree before sampling the count (while still in a stable, pre-restart state), then replace the blind sleep after restart with waitForRaftLeader so the convergence Awaitility block starts from a known-stable cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RestoreDatabaseScenarioIT: assert node 0 holds leadership after transferLeadershipAndWait before issuing the restore, since the transfer targets any Raft peer and not specifically node 0. Fix misleading log message to say "attempting" rather than guaranteeing node 0. - NetworkPartitionRecoveryIT + SplitBrainIT: assert waitForRaftLeader return value >= 0 before convergence Awaitility checks so tests fail fast with a clear message if no leader is elected within the timeout rather than silently running into the convergence deadline. - Add missing assertThat static import in NetworkPartitionRecoveryIT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…integration tests
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 11 |
🟢 Coverage ∅ diff coverage · -8.38% coverage variation
Metric Results Coverage variation ✅ -8.38% coverage variation Diff coverage ✅ ∅ diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (9a4ef79) 120148 88498 73.66% Head commit (2e0bffb) 151403 (+31255) 98833 (+10335) 65.28% (-8.38%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4026) 0 0 ∅ (not applicable) Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request enhances the high-availability test suite by introducing a transferLeadershipToNode method for targeted Raft leadership transfers and improving character encoding consistency using StandardCharsets.UTF_8. It also adds documentation regarding variable aliasing in SplitBrainIT. A bug was identified in the getRaftPeerId method where JSONObject.getString is incorrectly used with a default value argument; a suggestion was provided to use optString instead.
| try { | ||
| if (conn.getResponseCode() == 200) { | ||
| final String body = new String(conn.getInputStream().readAllBytes(), StandardCharsets.UTF_8); | ||
| return new JSONObject(body).getString("localPeerId", ""); |
There was a problem hiding this comment.
The getString method in the JSONObject class (from com.arcadedb.serializer.json) does not support a default value as a second argument. This will likely result in a compilation error. Use optString instead to safely provide a default value if the key is missing.
| return new JSONObject(body).getString("localPeerId", ""); | |
| return new JSONObject(body).optString("localPeerId", ""); |
There was a problem hiding this comment.
optString does not exist in com.arcadedb.serializer.json.JSONObject. The two-argument overload getString(String name, String defaultValue) is defined at line 211 of that class and returns the default when the key is absent or null. The build already compiles and passes with getString("localPeerId", "") — introducing optString would be a compile error.
Code ReviewOverviewThis PR improves the HA e2e test infrastructure in three ways:
Positive Points
Issues and Suggestions1. OutputStream not flushed/closed before reading the response In try (final OutputStream os = conn.getOutputStream()) {
os.write(...);
}
final int status = conn.getResponseCode();2.
3. The method calls final int newLeaderIdx = waitForRaftLeader(servers, timeoutSeconds);
if (newLeaderIdx != servers.indexOf(targetNode))
logger.warn("Leadership transfer: expected {} but new leader is node {}", targetNode.host(), newLeaderIdx);4. Remaining bare Lines 732 and 739 in the existing method still use bare Test CoverageThe new helpers are exercised by SummaryThe overall direction is correct and the changes improve test reliability and correctness. The main actionable items are: (1) close/flush the |
|
Item 3 (warning log): Implemented. Item 4 (bare Item 1 (OutputStream flush): Not changed. Every HTTP method in this file - Item 2 (connection leak in |
…erLeadershipAndWait Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code ReviewOverviewThis PR improves the HA e2e test infrastructure in three areas:
The changes are test-only (no production code is touched). Positive Observations
Issues and Suggestions1. transferLeadershipToNode - OutputStream not explicitly closed before reading response (minor)
2. transferLeadershipToNode - error body not logged on failure (minor) When 3. transferLeadershipToNode - final check is warn-only (design note) After issuing the transfer the code only logs a warning if the wrong node becomes leader. The caller in 4. getRaftPeerId - InputStream not explicitly closed (minor)
5. Javadoc verbosity (style note) Project style (CLAUDE.md) recommends avoiding multi-line comment blocks unless the WHY is non-obvious. The Javadoc on SummaryThe PR is solid - the targeted leadership transfer addresses a real reliability gap and the charset fixes prevent subtle encoding bugs. Most suggestions above are minor polish items. The most actionable one is logging the error response body on non-200 responses from the transfer endpoint, which would significantly help diagnose future flaky HA tests. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4026 +/- ##
==========================================
- Coverage 64.65% 64.24% -0.41%
==========================================
Files 1597 1597
Lines 120148 120148
Branches 25580 25580
==========================================
- Hits 77685 77193 -492
- Misses 31681 32282 +601
+ Partials 10782 10673 -109 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
A brief description of the change being made with this pull request.
Motivation
What inspired you to submit this pull request?
Related issues
A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.
Additional Notes
Anything else we should know when reviewing?
Checklist
mvn clean packagecommand