fix: e2e-ha stability#4020
Conversation
There was a problem hiding this comment.
Code Review
This pull request improves the stability of the high-availability database restoration tests by ensuring node 0 is the leader before performing a restore and enhancing error logging for failed server commands. Additionally, it ensures proper cleanup of Docker networks and Toxiproxy instances during test teardown. Feedback suggests adding an assertion to verify node 0's leadership after a transfer and specifying UTF-8 encoding when reading error streams for better cross-platform consistency.
| final int leaderIdx = waitForRaftLeader(servers, 30); | ||
| if (leaderIdx != 0) { | ||
| logger.info("Node 0 is not the leader (leader is {}); transferring leadership back to node 0", leaderIdx); | ||
| transferLeadershipAndWait(servers, 30); | ||
| } |
There was a problem hiding this comment.
The transferLeadershipAndWait method (as implemented in ContainersTestTemplate) initiates a leadership transfer to an arbitrary node because it sends an empty peerId. This does not guarantee that node 0 will become the leader, which is required for the subsequent restore operation. If leadership is transferred to another node (e.g., node 2), the restore command on node 0 will fail. It is recommended to verify that node 0 actually became the leader before proceeding.
| final int leaderIdx = waitForRaftLeader(servers, 30); | |
| if (leaderIdx != 0) { | |
| logger.info("Node 0 is not the leader (leader is {}); transferring leadership back to node 0", leaderIdx); | |
| transferLeadershipAndWait(servers, 30); | |
| } | |
| if (waitForRaftLeader(servers, 30) != 0) { | |
| logger.info("Node 0 is not the leader; transferring leadership back to node 0"); | |
| transferLeadershipAndWait(servers, 30); | |
| assertThat(waitForRaftLeader(servers, 30)).as("Node 0 must be the leader for restore").isEqualTo(0); | |
| } |
| logger.error("Server command '{}' returned HTTP {}: {}", command, status, | ||
| new String(errStream.readAllBytes())); |
There was a problem hiding this comment.
Specify a character set (e.g., UTF-8) when converting the error stream bytes to a string to ensure consistent behavior across different platforms and environments.
| logger.error("Server command '{}' returned HTTP {}: {}", command, status, | |
| new String(errStream.readAllBytes())); | |
| logger.error("Server command '{}' returned HTTP {}: {}", command, status, | |
| new String(errStream.readAllBytes(), java.nio.charset.StandardCharsets.UTF_8)); |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
🟢 Coverage ∅ diff coverage · -8.52% coverage variation
Metric Results Coverage variation ✅ -8.52% coverage variation Diff coverage ✅ ∅ diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (aa526c0) 120069 88606 73.80% Head commit (5679275) 151403 (+31334) 98834 (+10228) 65.28% (-8.52%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4020) 0 0 ∅ (not applicable) Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
|
Code review posted below - see full analysis in thread. |
Code ReviewOverviewThis PR makes two focused improvements to the HA e2e test infrastructure:
Potential Bug - Leadership Transfer Does Not Guarantee Node 0 Becomes LeaderThe main concern is in If the current leader is node 2 and the transfer goes to node 1, the restore is still issued to node 0 (a follower) and the restore will fail. The log message says "transferring leadership back to node 0" but the implementation does not guarantee this. Options to fix:
Minor IssuesMissing charset in error stream reading ( Positive Changes
SummaryThe resource-leak fix and error logging improvements are solid. The leadership transfer logic has a correctness gap: it transfers to any Raft peer rather than specifically to node 0, so the assumption that node 0 will be the leader after the call may not hold. Please either target node 0 explicitly in the |
…istent user count before node restart
Code ReviewPR: fix: improve database restore process and enhance error logging OverviewThis PR fixes flaky HA e2e tests across three related areas:
Positive Changes
Issues and SuggestionsCritical
In The code then unconditionally calls Suggestion: Re-check leadership after the transfer and fail fast if node 0 still does not hold it: Alternatively, update MinorSilent exception swallowing in new Awaitility block ( If Misleading log message ( The message says "transferring leadership back to node 0" but the transfer does not guarantee node 0 wins. The message should say "attempting to transfer leadership" to reflect the uncertainty.
SummaryThe core ideas are sound and address real flakiness sources. The main concern is that |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4020 +/- ##
==========================================
- Coverage 64.79% 64.25% -0.55%
==========================================
Files 1597 1597
Lines 120069 120148 +79
Branches 25557 25580 +23
==========================================
- Hits 77804 77199 -605
- Misses 31493 32281 +788
+ Partials 10772 10668 -104 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
how the hell did I auto subscribe on this PR |
…re restart and waiting for Raft leader The convergence check timed out when tests ran together because cycleLeaderCount was sampled after the isolated node restarted using a blind 10-second sleep. Under load, the cluster may still be in Raft re-election at that point, producing a stale target count the convergence condition can never satisfy. Apply the same pattern used in NetworkPartitionRecoveryIT: wait for the two majority nodes to agree before sampling the count (while still in a stable, pre-restart state), then replace the blind sleep after restart with waitForRaftLeader so the convergence Awaitility block starts from a known-stable cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Code ReviewOverviewThis PR fixes race conditions in three HA e2e tests and a resource leak in the test template. The core problem being addressed is that user counts were sampled after restarting the isolated node — during a period where Raft re-election makes the cluster state unstable — leading to flaky test assertions. The fixes move the count sampling to before the restart, add proper Awaitility waits for Raft convergence, and ensure node 0 holds leadership before a restore (since the backup file lives on that node's container). What's Good
Issues1. Leadership transfer does not guarantee node 0 becomes leader (potential bug)In final int leaderIdx = waitForRaftLeader(servers, 30);
if (leaderIdx != 0) {
logger.info("Node 0 is not the leader (leader is {}); transferring leadership back to node 0", leaderIdx);
transferLeadershipAndWait(servers, 30);
}
Suggestion: After the transfer, verify leadership landed on node 0: if (leaderIdx != 0) {
transferLeadershipAndWait(servers, 30);
final int newLeader = waitForRaftLeader(servers, 30);
assertThat(newLeader).as("Node 0 should be the leader before restore").isEqualTo(0);
}Or, pass node 0's peer ID explicitly to the transfer API if it supports targeted transfer. 2. Unnecessary variable alias in
|
- RestoreDatabaseScenarioIT: assert node 0 holds leadership after transferLeadershipAndWait before issuing the restore, since the transfer targets any Raft peer and not specifically node 0. Fix misleading log message to say "attempting" rather than guaranteeing node 0. - NetworkPartitionRecoveryIT + SplitBrainIT: assert waitForRaftLeader return value >= 0 before convergence Awaitility checks so tests fail fast with a clear message if no leader is elected within the timeout rather than silently running into the convergence deadline. - Add missing assertThat static import in NetworkPartitionRecoveryIT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed in commit 5679275: Issue 1 - leadership transfer correctness: Added Issue 2 - Issue 3 - |
Code ReviewOverviewThis PR improves HA test stability in three areas:
The direction is correct and the motivations are well-documented via comments. One potential flakiness issue stands out, and there are a couple of minor polish items. Issue: Leadership Transfer Does Not Target Node 0File: transferLeadershipAndWait(servers, 30);
assertThat(waitForRaftLeader(servers, 30)).as("Node 0 must be leader before restore").isEqualTo(0);
Since the constraint is that the backup file lives on node 0, the fix should either:
This is the most actionable concern in the PR. Minor: Missing Charset SpecificationFile: connection.getOutputStream().write(
new JSONObject().put("command", command).toString().getBytes()); // pre-existing
...
logger.error("...: {}", new String(errStream.readAllBytes())); // new codeBoth calls rely on the JVM's default charset. The existing new String(errStream.readAllBytes(), StandardCharsets.UTF_8)Nit: Redundant
|
|
Addressing the three items from the last review: Leadership transfer targeting: Added Missing charset:
|
What does this PR do?
A brief description of the change being made with this pull request.
Motivation
What inspired you to submit this pull request?
Related issues
A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.
Additional Notes
Anything else we should know when reviewing?
Checklist
mvn clean packagecommand