Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CleanAllRUV test suite fails to restore master4 in the topology #2922

Closed
389-ds-bot opened this issue Sep 13, 2020 · 12 comments
Closed

CleanAllRUV test suite fails to restore master4 in the topology #2922

389-ds-bot opened this issue Sep 13, 2020 · 12 comments
Labels
closed: fixed Migration flag - Issue
Milestone

Comments

@389-ds-bot
Copy link

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49863


Issue Description

Sometimes after we successfully finish some test function, we try to restore master4
in the topology (restore_master4 function). But the tests start to fail during
test_replication function (after successful master4 restoration) -
https://pagure.io/389-ds-base/blob/master/f/dirsrvtests/tests/suites/replication/cleanallruv_test.py#_160

This mostly happens after test_abort or test_abort_restart.
But I've seen the failure happening after test_clean_restart too.

It looks like a timing issue but it happens also on fast machines.

We can see the error log having these lines in the end:

  • on master1

      [18/Jul/2018:16:48:05.207259621 -0400] - ERR - NSMMReplicationPlugin - ruv_compare_ruv - RUV [database RUV] does not contain element [{replica 4} 5b4fa75c000000040000 5b4fa76a000800040000] which is present in RUV [changelog max RUV]
      [18/Jul/2018:16:48:05.209129423 -0400] - WARN - NSMMReplicationPlugin - replica_check_for_data_reload - For replica dc=example,dc=com there were some differences between the changelog max RUV and the database RUV.  If there are obsolete elements in the database RUV, you should remove them using the CLEANALLRUV task.  If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.
      [18/Jul/2018:16:48:05.320759190 -0400] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 39001 for LDAP requests
      [18/Jul/2018:16:48:19.975883756 -0400] - WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=002" (host-172-16-36-24:39002): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
      [18/Jul/2018:16:48:19.977935906 -0400] - WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=003" (host-172-16-36-24:39003): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
      [18/Jul/2018:16:48:20.481433078 -0400] - INFO - NSMMReplicationPlugin - repl5_tot_run - Beginning total update of replica "agmt="cn=temp_004" (host-172-16-36-24:39004)".
      [18/Jul/2018:16:48:21.114461980 -0400] - ERR - NSMMReplicationPlugin - check_flow_control_tot_init - agmt="cn=temp_004" (host-172-16-36-24:39004) -  Total update flow control gives time (2000 msec) to the consumer before sending more entries [ msgid sent: 1453, rcv: 452]) If total update fails you can try to increase nsds5ReplicaFlowControlPause and/or decrease nsds5ReplicaFlowControlWindow in the replica agreement configuration
      [18/Jul/2018:16:48:22.370839738 -0400] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=002" (host-172-16-36-24:39002): Replication bind with SIMPLE auth resumed
      [18/Jul/2018:16:48:22.373100328 -0400] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=003" (host-172-16-36-24:39003): Replication bind with SIMPLE auth resumed
      [18/Jul/2018:16:48:25.676770766 -0400] - INFO - NSMMReplicationPlugin - repl5_tot_run - Finished total update of replica "agmt="cn=temp_004" (host-172-16-36-24:39004)". Sent 1519 entries.
      [18/Jul/2018:16:48:28.646462533 -0400] - ERR - NSMMReplicationPlugin - multimaster_extop_StartNSDS50ReplicationRequest - conn=4 op=3 replica="dc=example,dc=com": Unable to acquire replica: error: duplicate replica ID detected
    
  • on master4:

      [18/Jul/2018:16:48:25.440484901 -0400] - NOTICE - NSMMReplicationPlugin - multimaster_be_state_change - Replica dc=example,dc=com is coming online; enabling replication
      [18/Jul/2018:16:48:25.485255650 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
      [18/Jul/2018:16:48:25.485971910 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
      [18/Jul/2018:16:48:25.486589078 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
      [18/Jul/2018:16:48:25.487219702 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
      [18/Jul/2018:16:48:25.744562488 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa75c000000040000
      [18/Jul/2018:16:48:25.746187981 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa762000200040000
      [18/Jul/2018:16:48:25.747534085 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa762000c00040000
      [18/Jul/2018:16:48:25.748864428 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa763000100040000
      [18/Jul/2018:16:48:25.750210622 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa769000100040000
      [18/Jul/2018:16:48:25.751621244 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa76a000000040000
      [18/Jul/2018:16:48:25.753006696 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa76a000800040000
      [18/Jul/2018:16:48:28.655153846 -0400] - ERR - NSMMReplicationPlugin - acquire_replica - agmt="cn=001" (host-172-16-36-24:39001): Unable to aquire replica: the replica has the same Replica ID as this one. Replication is aborting.
      [18/Jul/2018:16:48:28.656000657 -0400] - ERR - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=001" (host-172-16-36-24:39001): Incremental update failed and requires administrator action
    

The replicaID from dse.ldif:

[root@host-172-16-36-24 dirsrv]# grep -rni replicaid
slapd-master1/dse.ldif.bak:1330:nsDS5ReplicaId: 1
slapd-master1/dse.ldif.startOK:1330:nsDS5ReplicaId: 1
slapd-master1/dse.ldif:1330:nsDS5ReplicaId: 1
slapd-master2/dse.ldif.bak:1330:nsDS5ReplicaId: 2
slapd-master2/dse.ldif.startOK:1329:nsDS5ReplicaId: 2
slapd-master2/dse.ldif:1330:nsDS5ReplicaId: 2
slapd-master3/dse.ldif.bak:1330:nsDS5ReplicaId: 3
slapd-master3/dse.ldif.startOK:1329:nsDS5ReplicaId: 3
slapd-master3/dse.ldif:1330:nsDS5ReplicaId: 3
slapd-master4/dse.ldif.bak:1330:nsDS5ReplicaId: 4
slapd-master4/dse.ldif:1330:nsDS5ReplicaId: 4

Package Version and Platform

389-ds-base built on master with https://pagure.io/389-ds-base/pull-request/49846

Steps to reproduce

The issue is not 100% reproducible.
It can be reproduced by running suites/replication/cleanallruv_test.py many times.

@389-ds-bot 389-ds-bot added the closed: fixed Migration flag - Issue label Sep 13, 2020
@389-ds-bot 389-ds-bot added this to the 1.4.0 milestone Sep 13, 2020
@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2018-07-21 18:21:51

Simon try this patch out. It seems to work for me. The main part of the fix is the change to m4rid()

cleanallruv2.patch

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2018-07-21 18:21:52

Metadata Update from @mreynolds389:

  • Custom field component adjusted to None
  • Custom field origin adjusted to None
  • Custom field reviewstatus adjusted to None
  • Custom field type adjusted to None
  • Custom field version adjusted to None

@389-ds-bot
Copy link
Author

Comment from vashirov (@vashirov) at 2018-07-24 14:16:54

@mreynolds389, I've ran cleanallruv test with your patch several times on a machine where it previously failed, it now passes all the time. Thanks!

@389-ds-bot
Copy link
Author

Comment from spichugi (@droideck) at 2018-07-24 23:27:32

It still fails for me on the internal tool virtual machine.

[root@host-172-16-36-17 ds]# py.test -v dirsrvtests/tests/suites/replication/cleanallruv_test.py
==================== test session starts ======================
platform linux -- Python 3.6.6, pytest-3.6.3, py-1.5.4, pluggy-0.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.6.6', 'Platform': 'Linux-4.17.3-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight', 'Packages': {'pytest': '3.6.3', 'py': '1.5.4', 'pluggy': '0.6.0'}, 'Plugins': {'metadata': '1.7.0', 'html': '1.19.0'}}
389-ds-base: 1.4.0.13-20180724gitfd8ca9a.fc28
nss: 3.38.0-1.0.fc28
nspr: 4.19.0-1.fc28
openldap: 2.4.46-2.fc28
svrcore: not installed
FIPS: 0

rootdir: /mnt/tests/rhds/tests/upstream/ds, inifile:
plugins: metadata-1.7.0, html-1.19.0
collected 8 items

dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean PASSED [ 12%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean ERROR [ 12%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean_restart ERROR [ 25%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean_force ERROR [ 37%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort ERROR [ 50%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort_restart ERROR [ 62%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort_certify ERROR [ 75%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_stress_clean ERROR [ 87%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_multiple_tasks_with_force ERROR [100%]

============== ERRORS ================

It has the same ERROR as before:

 @pytest.fixture()
 def m4rid(request, topology_m4):
     log.debug("-------------- BEGIN RESET of m4 -----------------")
    repl = ReplicationManager(DEFAULT_SUFFIX)
>       repl.test_replication_topology(topology_m4.ms.values())

>       raise Exception("Replication did not sync in time!")
E       Exception: Replication did not sync in time!

@389-ds-bot
Copy link
Author

Comment from spichugi (@droideck) at 2018-07-25 15:37:15

Okay, the tests does pass on a faster machine still fails on a slower. We can increase the timeouts then.

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2018-07-25 18:52:13

How is it failing? What are the errors being reported?

@389-ds-bot
Copy link
Author

Comment from spichugi (@droideck) at 2018-07-26 00:41:16

How is it failing? What are the errors being reported?

The same errors I reported in the issue description.
On a fast machine, it does pass 100% times though. I am trying to tweak some timeouts now...

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from spichugi (@droideck) at 2018-07-27 16:16:24

Mark, your diff was applied to #2905
Thanks!

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2019-01-10 18:03:16

@droideck - if this is fixed and you close this ticket?

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2019-01-10 18:03:32

Metadata Update from @mreynolds389:

  • Issue assigned to droideck
  • Issue set to the milestone: 1.4.0

@389-ds-bot
Copy link
Author

Comment from spichugi (@droideck) at 2019-05-30 15:23:45

The issue is no longer present

@389-ds-bot
Copy link
Author

Comment from spichugi (@droideck) at 2019-05-30 15:23:50

Metadata Update from @droideck:

  • Issue close_status updated to: fixed
  • Issue status updated to: Closed (was: Open)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed: fixed Migration flag - Issue
Projects
None yet
Development

No branches or pull requests

1 participant