Skip to content

fix: e2e-ha stability#4020

Merged
robfrank merged 4 commits into
mainfrom
fix/e32-ha-tests
Apr 28, 2026
Merged

fix: e2e-ha stability#4020
robfrank merged 4 commits into
mainfrom
fix/e32-ha-tests

Conversation

@robfrank
Copy link
Copy Markdown
Collaborator

What does this PR do?

A brief description of the change being made with this pull request.

Motivation

What inspired you to submit this pull request?

Related issues

A list of issues either fixed, containing architectural discussions, otherwise relevant
for this Pull Request.

Additional Notes

Anything else we should know when reviewing?

Checklist

  • I have run the build using mvn clean package command
  • My unit tests cover both failure and success scenarios

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the stability of the high-availability database restoration tests by ensuring node 0 is the leader before performing a restore and enhancing error logging for failed server commands. Additionally, it ensures proper cleanup of Docker networks and Toxiproxy instances during test teardown. Feedback suggests adding an assertion to verify node 0's leadership after a transfer and specifying UTF-8 encoding when reading error streams for better cross-platform consistency.

Comment on lines +119 to +123
final int leaderIdx = waitForRaftLeader(servers, 30);
if (leaderIdx != 0) {
logger.info("Node 0 is not the leader (leader is {}); transferring leadership back to node 0", leaderIdx);
transferLeadershipAndWait(servers, 30);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The transferLeadershipAndWait method (as implemented in ContainersTestTemplate) initiates a leadership transfer to an arbitrary node because it sends an empty peerId. This does not guarantee that node 0 will become the leader, which is required for the subsequent restore operation. If leadership is transferred to another node (e.g., node 2), the restore command on node 0 will fail. It is recommended to verify that node 0 actually became the leader before proceeding.

Suggested change
final int leaderIdx = waitForRaftLeader(servers, 30);
if (leaderIdx != 0) {
logger.info("Node 0 is not the leader (leader is {}); transferring leadership back to node 0", leaderIdx);
transferLeadershipAndWait(servers, 30);
}
if (waitForRaftLeader(servers, 30) != 0) {
logger.info("Node 0 is not the leader; transferring leadership back to node 0");
transferLeadershipAndWait(servers, 30);
assertThat(waitForRaftLeader(servers, 30)).as("Node 0 must be the leader for restore").isEqualTo(0);
}

Comment on lines +178 to +179
logger.error("Server command '{}' returned HTTP {}: {}", command, status,
new String(errStream.readAllBytes()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specify a character set (e.g., UTF-8) when converting the error stream bytes to a string to ensure consistent behavior across different platforms and environments.

Suggested change
logger.error("Server command '{}' returned HTTP {}: {}", command, status,
new String(errStream.readAllBytes()));
logger.error("Server command '{}' returned HTTP {}: {}", command, status,
new String(errStream.readAllBytes(), java.nio.charset.StandardCharsets.UTF_8));

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 28, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

🟢 Coverage ∅ diff coverage · -8.52% coverage variation

Metric Results
Coverage variation -8.52% coverage variation
Diff coverage diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (aa526c0) 120069 88606 73.80%
Head commit (5679275) 151403 (+31334) 98834 (+10228) 65.28% (-8.52%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4020) 0 0 ∅ (not applicable)

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code review posted below - see full analysis in thread.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review

Overview

This PR makes two focused improvements to the HA e2e test infrastructure:

  1. RestoreDatabaseScenarioIT: Adds cluster stabilization and leadership handling before restore, plus enhanced error logging in postServerCommand.
  2. ContainersTestTemplate: Fixes a resource leak by properly closing the Docker network in tearDown, and null-guarding the toxiproxy reference.

Potential Bug - Leadership Transfer Does Not Guarantee Node 0 Becomes Leader

The main concern is in RestoreDatabaseScenarioIT.java around the leadership transfer block. Looking at transferLeadershipAndWait in ContainersTestTemplate.java, the transfer sends an empty peerId, meaning Ratis picks any candidate - not specifically node 0.

If the current leader is node 2 and the transfer goes to node 1, the restore is still issued to node 0 (a follower) and the restore will fail. The log message says "transferring leadership back to node 0" but the implementation does not guarantee this.

Options to fix:

  • Pass node 0's peerId explicitly to the transfer endpoint, so leadership is pinned to the node holding the backup file.
  • OR: after the transfer, assert waitForRaftLeader(servers, 30) == 0 and fail-fast if it is not 0, rather than silently proceeding.

Minor Issues

Missing charset in error stream reading (postServerCommand): new String(errStream.readAllBytes()) uses the platform default charset. Should specify StandardCharsets.UTF_8. The same pattern exists in countUsersViaHttp (pre-existing), but new code should set the example.


Positive Changes

  • Docker network cleanup in ContainersTestTemplate.tearDown is a real fix. Without network.close(), Docker networks accumulate across test runs in CI, eventually exhausting the network namespace. The null-guarding and exception handling are correct.
  • toxiproxy = null after toxiproxy.stop() is a good defensive fix to prevent double-stop if tearDown is ever called more than once.
  • Error body logging in postServerCommand is a good debugging improvement - previously a 4xx/5xx response gave no indication of why a server command failed.
  • waitForAllNodesKnowLeader before restore is a sound improvement. Waiting for all followers to know the leader before issuing the restore prevents the silent proxying failures described in the code comment.

Summary

The resource-leak fix and error logging improvements are solid. The leadership transfer logic has a correctness gap: it transfers to any Raft peer rather than specifically to node 0, so the assumption that node 0 will be the leader after the call may not hold. Please either target node 0 explicitly in the peerId or add an assertion after the transfer to verify node 0 is the new leader before proceeding.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review

PR: fix: improve database restore process and enhance error logging
Files changed: 3 (all in e2e test infrastructure)


Overview

This PR fixes flaky HA e2e tests across three related areas:

  1. NetworkPartitionRecoveryIT - Fixes a race condition in multiplePartitionCycles() where the expected user count was captured after restarting the isolated node, risking a stale read during Raft re-election.
  2. RestoreDatabaseScenarioIT - Ensures the Raft cluster is stable and node 0 holds leadership before triggering a restore, since the backup file lives inside node 0's container.
  3. ContainersTestTemplate - Adds proper Docker network cleanup in tearDown() to prevent resource leaks across test runs.

Positive Changes

  • Moving cycleCount capture before the isolated-node restart in multiplePartitionCycles() is the right fix. The old position raced against Raft re-election.
  • Adding an Awaitility wait for majority convergence before capturing cycleCount is more robust than a simple point-in-time read.
  • Moving capturedLeader/capturedOther earlier is correct: they are now in scope for both the new pre-capture wait and the downstream convergence check.
  • Adding waitForRaftLeader after the isolated-node restart ensures the convergence check starts from a stable cluster state.
  • The network.close() plus null pattern in tearDown() prevents Docker network leaks that can exhaust the bridge limit on CI runners.
  • The null-after-stop on toxiproxy is a small but correct fix.
  • Error-stream logging in postServerCommand substantially improves debuggability for restore/drop failures.

Issues and Suggestions

Critical

transferLeadershipAndWait does not guarantee node 0 wins the election

In RestoreDatabaseScenarioIT.java (line 122), when node 0 is not the leader, the code calls transferLeadershipAndWait(servers, 30). This method posts to /api/v1/cluster/leader with peerId set to an empty string, which lets Ratis pick the best candidate. The comment inside that method confirms this: "Transfer to any peer (Ratis picks the best candidate)". After this call, node 0 may still not be the leader.

The code then unconditionally calls postServerCommand(servers.get(0), "restore database ..."), assuming node 0 is the leader. If node 0 is still a follower, the restore will fail - the 200-status assertion will catch it, but only after a confusing failure.

Suggestion: Re-check leadership after the transfer and fail fast if node 0 still does not hold it:

if (leaderIdx != 0) {
    logger.info("Node 0 is not leader ({}); attempting to transfer leadership back to node 0", leaderIdx);
    transferLeadershipAndWait(servers, 30);
    final int newLeaderIdx = waitForRaftLeader(servers, 30);
    assertThat(newLeaderIdx).as("Node 0 must be leader before restore").isEqualTo(0);
}

Alternatively, update transferLeadershipAndWait to accept a target peer ID so the transfer is deterministic.


Minor

Silent exception swallowing in new Awaitility block (NetworkPartitionRecoveryIT.java lines 233-236):

If countUsers() fails consistently (e.g., a node crashes), the test silently loops until timeout with no indication of why. Consider adding a logger.warn inside the catch block, matching the style used in the existing convergence check at line 279.

Misleading log message (RestoreDatabaseScenarioIT.java line 122):

The message says "transferring leadership back to node 0" but the transfer does not guarantee node 0 wins. The message should say "attempting to transfer leadership" to reflect the uncertainty.

capturedLeader assumption after restart (NetworkPartitionRecoveryIT.java line 276):
capturedLeader refers to the leader from the start of the current cycle. After waitForRaftLeader(servers, 60), the actual leader may have changed. This is unlikely to cause a test flap since the convergence check verifies all three nodes, but a brief comment noting the assumption would help future readers.


Summary

The core ideas are sound and address real flakiness sources. The main concern is that transferLeadershipAndWait does not guarantee node 0 becomes the leader, so the primary fix in RestoreDatabaseScenarioIT may not reliably work when node 0 is a follower after the drop. That path should be hardened with a post-transfer leadership assertion before merging.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.25%. Comparing base (aa526c0) to head (5679275).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4020      +/-   ##
==========================================
- Coverage   64.79%   64.25%   -0.55%     
==========================================
  Files        1597     1597              
  Lines      120069   120148      +79     
  Branches    25557    25580      +23     
==========================================
- Hits        77804    77199     -605     
- Misses      31493    32281     +788     
+ Partials    10772    10668     -104     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ExtReMLapin
Copy link
Copy Markdown
Contributor

how the hell did I auto subscribe on this PR

…re restart and waiting for Raft leader

The convergence check timed out when tests ran together because cycleLeaderCount
was sampled after the isolated node restarted using a blind 10-second sleep. Under
load, the cluster may still be in Raft re-election at that point, producing a stale
target count the convergence condition can never satisfy.

Apply the same pattern used in NetworkPartitionRecoveryIT: wait for the two majority
nodes to agree before sampling the count (while still in a stable, pre-restart state),
then replace the blind sleep after restart with waitForRaftLeader so the convergence
Awaitility block starts from a known-stable cluster.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review

Overview

This PR fixes race conditions in three HA e2e tests and a resource leak in the test template. The core problem being addressed is that user counts were sampled after restarting the isolated node — during a period where Raft re-election makes the cluster state unstable — leading to flaky test assertions. The fixes move the count sampling to before the restart, add proper Awaitility waits for Raft convergence, and ensure node 0 holds leadership before a restore (since the backup file lives on that node's container).


What's Good

  • Race condition fixes are correct in principle: Capturing cycleCount/cycleLeaderCount before the isolated node restart, then waiting for a confirmed leader before the convergence check, is the right approach.
  • Replacing Thread.sleep(10) with waitForRaftLeader() is a direct improvement — sleeping blindly is a common source of flakiness.
  • Docker network cleanup (ContainersTestTemplate.tearDown) fills a genuine resource leak. Setting both fields to null after stopping/closing prevents a double-cleanup if tearDown is ever called more than once.
  • Error logging in postServerCommand() significantly improves debuggability when the restore or drop commands fail silently.

Issues

1. Leadership transfer does not guarantee node 0 becomes leader (potential bug)

In RestoreDatabaseScenarioIT.java (lines 119-123):

final int leaderIdx = waitForRaftLeader(servers, 30);
if (leaderIdx != 0) {
    logger.info("Node 0 is not the leader (leader is {}); transferring leadership back to node 0", leaderIdx);
    transferLeadershipAndWait(servers, 30);
}

transferLeadershipAndWait() calls the cluster API with "peerId":"", which asks Ratis to pick any suitable peer — not specifically node 0. If leadership moves from node 1 to node 2, the restore command still goes to node 0 (a follower), which will fail. There is no post-transfer assertion that node 0 is now actually the leader.

Suggestion: After the transfer, verify leadership landed on node 0:

if (leaderIdx != 0) {
    transferLeadershipAndWait(servers, 30);
    final int newLeader = waitForRaftLeader(servers, 30);
    assertThat(newLeader).as("Node 0 should be the leader before restore").isEqualTo(0);
}

Or, pass node 0's peer ID explicitly to the transfer API if it supports targeted transfer.


2. Unnecessary variable alias in SplitBrainIT.java

final List<ServerWrapper> majorityServers = servers;

servers is already effectively final in scope and accessible from the lambda directly. This alias adds noise without benefit. Remove it and use servers directly in the lambda.


3. waitForRaftLeader() return value silently ignored

In both NetworkPartitionRecoveryIT.java and SplitBrainIT.java, waitForRaftLeader() is called before the convergence check but its return value is discarded:

waitForRaftLeader(servers, 60);

If no leader is elected within 60 seconds it returns -1 and logs an error, but the test continues. At minimum, consider asserting the return value is not -1 so the test fails fast with a clear message rather than timing out on the convergence Awaitility.await() further down.


Minor / Style

  • Comments in the diff are thorough and explain the race conditions well — that is appreciated.
  • Awaitility timeout values (60s for majority sync, 180s for full convergence) appear reasonable for the scenarios described.
  • catch (final Exception ignored) for error stream reading in postServerCommand() is acceptable — failure to read the body should not mask the original HTTP error being returned.
  • No System.out calls introduced.
  • No new dependencies added.

Summary

The overall direction of this PR is sound. The main concern is the leadership transfer in RestoreDatabaseScenarioIT not guaranteeing that node 0 ends up as leader before the restore is issued — this could cause intermittent failures under the same conditions the PR is trying to fix. The unnecessary majorityServers alias and unguarded waitForRaftLeader() return values are minor but easy to clean up.

- RestoreDatabaseScenarioIT: assert node 0 holds leadership after
  transferLeadershipAndWait before issuing the restore, since the transfer
  targets any Raft peer and not specifically node 0. Fix misleading log
  message to say "attempting" rather than guaranteeing node 0.
- NetworkPartitionRecoveryIT + SplitBrainIT: assert waitForRaftLeader
  return value >= 0 before convergence Awaitility checks so tests fail
  fast with a clear message if no leader is elected within the timeout
  rather than silently running into the convergence deadline.
- Add missing assertThat static import in NetworkPartitionRecoveryIT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@robfrank
Copy link
Copy Markdown
Collaborator Author

Addressed in commit 5679275:

Issue 1 - leadership transfer correctness: Added assertThat(waitForRaftLeader(servers, 30)).as("Node 0 must be leader before restore").isEqualTo(0) after transferLeadershipAndWait. Also fixed the log message from "transferring leadership back to node 0" to "attempting leadership transfer" since the API picks any Raft peer.

Issue 2 - majorityServers alias: This alias is necessary and cannot be removed. servers is reassigned later in the same loop body (servers = List.of(...) at the bottom of the cycle), making it not effectively final. Java requires a effectively-final variable to be captured in a lambda - using servers directly would be a compilation error.

Issue 3 - waitForRaftLeader return value ignored: Added assertThat(waitForRaftLeader(servers, 60)).as("Raft leader must be elected before convergence check").isGreaterThanOrEqualTo(0) in both NetworkPartitionRecoveryIT.multiplePartitionCycles() and SplitBrainIT.clusterReformation() at the point immediately before the convergence Awaitility.await(). Added the missing assertThat static import to NetworkPartitionRecoveryIT.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Code Review

Overview

This PR improves HA test stability in three areas:

  1. Ensures node 0 is the Raft leader before running a restore (since the backup file lives on node 0).
  2. Captures write counts before restarting isolated nodes (avoiding stale reads during re-election).
  3. Fixes a Docker network resource leak in ContainersTestTemplate.tearDown().

The direction is correct and the motivations are well-documented via comments. One potential flakiness issue stands out, and there are a couple of minor polish items.


Issue: Leadership Transfer Does Not Target Node 0

File: RestoreDatabaseScenarioIT.java, lines 122-123

transferLeadershipAndWait(servers, 30);
assertThat(waitForRaftLeader(servers, 30)).as("Node 0 must be leader before restore").isEqualTo(0);

transferLeadershipAndWait sends {"peerId":"","timeoutMs":30000} - an empty peer ID - meaning Raft (Ratis) picks any peer as the new leader. There is no guarantee it will pick node 0. If node 1 or node 2 wins the election the assertion isEqualTo(0) will fail with a confusing message about leadership, not about the backup file.

Since the constraint is that the backup file lives on node 0, the fix should either:

  • Pass node 0's peer ID explicitly when calling the transfer endpoint, or
  • Retry the transfer until node 0 wins (which is fragile), or
  • Copy/mount the backup in a location accessible to all nodes so any leader can restore it.

This is the most actionable concern in the PR.


Minor: Missing Charset Specification

File: RestoreDatabaseScenarioIT.java, lines 173 and 180

connection.getOutputStream().write(
    new JSONObject().put("command", command).toString().getBytes()); // pre-existing
...
logger.error("...: {}", new String(errStream.readAllBytes())); // new code

Both calls rely on the JVM's default charset. The existing getBytes() call is pre-existing debt, but the new new String(errStream.readAllBytes()) was introduced by this PR. For cross-platform consistency, prefer:

new String(errStream.readAllBytes(), StandardCharsets.UTF_8)

Nit: Redundant majorityServers Alias (SplitBrainIT.java, line 382)

final List<ServerWrapper> majorityServers = servers;

This alias is required because servers is reassigned later in the loop body, making it non-effectively-final. The code is correct, but a brief inline comment explaining why the alias exists (Java lambda capture requires effectively-final) would prevent future readers from treating it as dead code and removing it.


Good Changes

  • ContainersTestTemplate.tearDown(): Closing and nulling the Docker network after containers stop is a correct resource-leak fix. The try/catch with a warn log (rather than a failing assertion) is appropriate since teardown should be best-effort.
  • Count capture before restart (NetworkPartitionRecoveryIT, SplitBrainIT): Moving cycleCount/cycleLeaderCount to before the node restart correctly reflects committed writes on the stable majority. The approach is sound.
  • Replacing sleep(10) with waitForRaftLeader: Eliminates a hardcoded sleep in favour of a proper condition check - good.
  • Error logging in postServerCommand: Logging the HTTP status and body on error significantly improves debuggability of CI failures.

@robfrank
Copy link
Copy Markdown
Collaborator Author

Addressing the three items from the last review:

Leadership transfer targeting: Added getRaftPeerId(ServerWrapper) to ContainersTestTemplate - queries GET /api/v1/cluster and extracts localPeerId (format: arcadedb-0_2434). Added transferLeadershipToNode(servers, targetNode, timeoutSeconds) which POSTs that peer ID explicitly to the current leader's /api/v1/cluster/leader endpoint, so Ratis targets node 0 deterministically. RestoreDatabaseScenarioIT now calls transferLeadershipToNode(servers, servers.get(0), 30).

Missing charset: new String(errStream.readAllBytes(), StandardCharsets.UTF_8) in postServerCommand.

majorityServers alias: Added comment: "Alias required: servers is reassigned later in this loop body (after restart), making it not effectively final and therefore not capturable in the lambda below."

@robfrank robfrank changed the title fix: improve database restore process and enhance error logging fix: e2e-ha stability Apr 28, 2026
@robfrank robfrank merged commit 9a4ef79 into main Apr 28, 2026
24 of 27 checks passed
@robfrank robfrank deleted the fix/e32-ha-tests branch April 28, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants