Skip to content

Fix flaky testMultiRS: replace fixed sleep with deterministic domain-ready wait#615

Merged
vharseko merged 2 commits intomasterfrom
copilot/fix-flaky-github-actions-job
Apr 2, 2026
Merged

Fix flaky testMultiRS: replace fixed sleep with deterministic domain-ready wait#615
vharseko merged 2 commits intomasterfrom
copilot/fix-flaky-github-actions-job

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

GenerationIdTest#testMultiRS was flaky in CI because it asserted generation IDs across all three replication servers after only a fixed Thread.sleep(2000), before RS-to-RS handshakes and domain creation had necessarily completed — leaving replServer2 at genId=-1 instead of the expected value.

Changes

  • New helper waitForDomainsOnAllReplicationServers() — polls (up to 30s, every 100ms via TestTimer) until getReplicationServerDomain(baseDN) is non-null on replServer1, replServer2, and replServer3, confirming topology convergence before any assertion.
  • Replaced Thread.sleep(2000) in testMultiRS (post connectServer1ToReplServer) with a call to this helper.
// Before
connectServer1ToReplServer(replServer1);
Thread.sleep(2000); // hope the RS mesh is ready

// After
connectServer1ToReplServer(replServer1);
waitForDomainsOnAllReplicationServers(); // deterministic: polls until all 3 RSes have the domain

No behavioral change for passing cases; only the synchronization point becomes deterministic.

Original prompt

Create a pull request to fix the flaky failing GitHub Actions job in OpenIdentityPlatform/OpenDJ (PR #613 run). The job failed in unit test org.opends.server.replication.GenerationIdTest#testMultiRS where replServer2 generationId remained -1 instead of the expected value (e.g., 48), indicating the test asserts before replication servers finish handshaking / domain creation.

Use commit ref 6a75f55 and workflow .github/workflows/build.yml (job: build-maven) and logs showing the failure.

Implement a deterministic wait in opendj-server-legacy/src/test/java/org/opends/server/replication/GenerationIdTest.java so that testMultiRS waits for the replication domain to exist on all three replication servers (replServer1/2/3) before asserting generation IDs, instead of relying on fixed Thread.sleep(...).

Suggested approach:

  • Add helper waitForDomainsOnAllReplicationServers() that uses TestTimer to poll until replServerX.getReplicationServerDomain(baseDN) is non-null for X=1..3.
  • In testMultiRS, after connectServer1ToReplServer(replServer1), call the new helper and remove or reduce the fixed sleep.
  • Keep timeouts reasonable (e.g., 30s) and add clear assertion messages for debugging.
  • Ensure no behavior change to non-flaky cases; only improve test synchronization.

Validate that the test suite still passes.

Reference failing job URL (for context): https://github.com/OpenIdentityPlatform/OpenDJ/actions/runs/23842862927/job/69502680510?pr=613

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Why did this job fail?

Assistant: The job failed primarily because of failures in the unit test org.opends.server.replication.GenerationIdTest#testMultiRS. The main issue was an assertion failure during replication testing.

Key Details:

  1. Test Failure Details:

    • In replication server replServer2, the expected generation ID was 48L, but the actual value was -1L.
    • The specific assertion error:
      [in replServer2] expected:<[48]L> but was:<[-1]L>
      
    • This indicates a mismatch in the expected versus actual state of the replication server.
  2. Related Logs and Issues During the Test:

    • Persistent issues with incorrect group IDs among replication servers, causing directory servers to connect to replication servers with the wrong context.
    • Disconnections between directory servers and replication servers were noted.
    • Exception details:
      Exception: No domain matches the provided base DN 'o=test'
      
    • Errors during synchronization such as:
      IOException: no more data (Session.java:441)
      
  3. Summary of Other Failures:

    • Issues stem from the synchronization process and the initialization of replication servers/domains - specifically the "o=test" domain.
  4. Overall Test Result Summary:

    • Tests run: 2
    • Failures: 1
    • Errors: 0
    • The problematic test was:
      GenerationIdTest.testMultiRS
      

Suggested Actions:

  • Fix the Generation ID Issue:
    Ensure consistent generation IDs across all replication servers.

    • Investigate why replServer2 is receiving an incorrect generation ID (-1L) during the test. It may relate to incorrect setup or timing issues.
  • Check Synchronization and Domain Configurations:

    • Debug the replication domain setup for the "o=test" domain.
    • Address the mismatch in the group IDs for replication servers.
  • Examine Connection Stability:

    • Analyze disconnections between the directory servers and replication servers.
  • Test Isolation:
    Rerun the failed test in isolation with logs enabled:

    mvn -Dtest=GenerationIdTest -Dtest.failures=true
    

Let me know if you'd like more guidance on how to address these issues!

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The failure is a real test bug / race in GenerationIdTest.testMultiRS, not a workflow problem.

What’s happening (from the job log)

The failing assertion is:

  • GenerationIdTest.testMultiRS expected all 3 replication servers to have generationId 48, but:
    • replServer2 generationId stayed -1
    • Assert failed at GenerationIdTest#assertGenIdEquals (lines 1009–1016)

Log excerpt shows the symptom earlier too:

  • DS2/DS3 connect with wrong group id:
    Connected to a replication server with wrong group id. We have group id -1 ... has group id 1
    
  • Then the test’s “Expect genId are set in all replServers.” step fails because genId propagation across RSes didn’t complete in time / didn’t happen due to topology state.

Root cause: the test assumes RS-to-RS topology + domain initialization is fully established after a fixed Thread.sleep(...), but i...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Fix flaky failing GitHub Actions job in OpenDJ Fix flaky testMultiRS: replace fixed sleep with deterministic domain-ready wait Apr 1, 2026
Copilot AI requested a review from vharseko April 1, 2026 19:24
@vharseko vharseko marked this pull request as ready for review April 2, 2026 08:56
@vharseko vharseko merged commit 2898206 into master Apr 2, 2026
0 of 15 checks passed
@vharseko vharseko deleted the copilot/fix-flaky-github-actions-job branch April 2, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants