CleanAllRUV test suite fails to restore master4 in the topology #2922

389-ds-bot · 2020-09-13T03:56:04Z

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49863

Created at 2018-07-20 17:17:12 by spichugi (@droideck)
Closed at 2019-05-30 15:23:49 as fixed
Assigned to spichugi (@droideck)

Issue Description

Sometimes after we successfully finish some test function, we try to restore master4
in the topology (restore_master4 function). But the tests start to fail during
test_replication function (after successful master4 restoration) -
https://pagure.io/389-ds-base/blob/master/f/dirsrvtests/tests/suites/replication/cleanallruv_test.py#_160

This mostly happens after test_abort or test_abort_restart.
But I've seen the failure happening after test_clean_restart too.

It looks like a timing issue but it happens also on fast machines.

We can see the error log having these lines in the end:

on master1

  [18/Jul/2018:16:48:05.207259621 -0400] - ERR - NSMMReplicationPlugin - ruv_compare_ruv - RUV [database RUV] does not contain element [{replica 4} 5b4fa75c000000040000 5b4fa76a000800040000] which is present in RUV [changelog max RUV]
  [18/Jul/2018:16:48:05.209129423 -0400] - WARN - NSMMReplicationPlugin - replica_check_for_data_reload - For replica dc=example,dc=com there were some differences between the changelog max RUV and the database RUV.  If there are obsolete elements in the database RUV, you should remove them using the CLEANALLRUV task.  If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.
  [18/Jul/2018:16:48:05.320759190 -0400] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 39001 for LDAP requests
  [18/Jul/2018:16:48:19.975883756 -0400] - WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=002" (host-172-16-36-24:39002): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
  [18/Jul/2018:16:48:19.977935906 -0400] - WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=003" (host-172-16-36-24:39003): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
  [18/Jul/2018:16:48:20.481433078 -0400] - INFO - NSMMReplicationPlugin - repl5_tot_run - Beginning total update of replica "agmt="cn=temp_004" (host-172-16-36-24:39004)".
  [18/Jul/2018:16:48:21.114461980 -0400] - ERR - NSMMReplicationPlugin - check_flow_control_tot_init - agmt="cn=temp_004" (host-172-16-36-24:39004) -  Total update flow control gives time (2000 msec) to the consumer before sending more entries [ msgid sent: 1453, rcv: 452]) If total update fails you can try to increase nsds5ReplicaFlowControlPause and/or decrease nsds5ReplicaFlowControlWindow in the replica agreement configuration
  [18/Jul/2018:16:48:22.370839738 -0400] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=002" (host-172-16-36-24:39002): Replication bind with SIMPLE auth resumed
  [18/Jul/2018:16:48:22.373100328 -0400] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=003" (host-172-16-36-24:39003): Replication bind with SIMPLE auth resumed
  [18/Jul/2018:16:48:25.676770766 -0400] - INFO - NSMMReplicationPlugin - repl5_tot_run - Finished total update of replica "agmt="cn=temp_004" (host-172-16-36-24:39004)". Sent 1519 entries.
  [18/Jul/2018:16:48:28.646462533 -0400] - ERR - NSMMReplicationPlugin - multimaster_extop_StartNSDS50ReplicationRequest - conn=4 op=3 replica="dc=example,dc=com": Unable to acquire replica: error: duplicate replica ID detected

on master4:

  [18/Jul/2018:16:48:25.440484901 -0400] - NOTICE - NSMMReplicationPlugin - multimaster_be_state_change - Replica dc=example,dc=com is coming online; enabling replication
  [18/Jul/2018:16:48:25.485255650 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
  [18/Jul/2018:16:48:25.485971910 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
  [18/Jul/2018:16:48:25.486589078 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
  [18/Jul/2018:16:48:25.487219702 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
  [18/Jul/2018:16:48:25.744562488 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa75c000000040000
  [18/Jul/2018:16:48:25.746187981 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa762000200040000
  [18/Jul/2018:16:48:25.747534085 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa762000c00040000
  [18/Jul/2018:16:48:25.748864428 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa763000100040000
  [18/Jul/2018:16:48:25.750210622 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa769000100040000
  [18/Jul/2018:16:48:25.751621244 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa76a000000040000
  [18/Jul/2018:16:48:25.753006696 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa76a000800040000
  [18/Jul/2018:16:48:28.655153846 -0400] - ERR - NSMMReplicationPlugin - acquire_replica - agmt="cn=001" (host-172-16-36-24:39001): Unable to aquire replica: the replica has the same Replica ID as this one. Replication is aborting.
  [18/Jul/2018:16:48:28.656000657 -0400] - ERR - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=001" (host-172-16-36-24:39001): Incremental update failed and requires administrator action

The replicaID from dse.ldif:

[root@host-172-16-36-24 dirsrv]# grep -rni replicaid
slapd-master1/dse.ldif.bak:1330:nsDS5ReplicaId: 1
slapd-master1/dse.ldif.startOK:1330:nsDS5ReplicaId: 1
slapd-master1/dse.ldif:1330:nsDS5ReplicaId: 1
slapd-master2/dse.ldif.bak:1330:nsDS5ReplicaId: 2
slapd-master2/dse.ldif.startOK:1329:nsDS5ReplicaId: 2
slapd-master2/dse.ldif:1330:nsDS5ReplicaId: 2
slapd-master3/dse.ldif.bak:1330:nsDS5ReplicaId: 3
slapd-master3/dse.ldif.startOK:1329:nsDS5ReplicaId: 3
slapd-master3/dse.ldif:1330:nsDS5ReplicaId: 3
slapd-master4/dse.ldif.bak:1330:nsDS5ReplicaId: 4
slapd-master4/dse.ldif:1330:nsDS5ReplicaId: 4

Package Version and Platform

389-ds-base built on master with https://pagure.io/389-ds-base/pull-request/49846

Steps to reproduce

The issue is not 100% reproducible.
It can be reproduced by running suites/replication/cleanallruv_test.py many times.

The text was updated successfully, but these errors were encountered:

389-ds-bot · 2020-09-13T03:56:06Z

Comment from mreynolds (@mreynolds389) at 2018-07-21 18:21:51

Simon try this patch out. It seems to work for me. The main part of the fix is the change to m4rid()

389-ds-bot · 2020-09-13T03:56:07Z

Comment from mreynolds (@mreynolds389) at 2018-07-21 18:21:52

Metadata Update from @mreynolds389:

Custom field component adjusted to None
Custom field origin adjusted to None
Custom field reviewstatus adjusted to None
Custom field type adjusted to None
Custom field version adjusted to None

389-ds-bot · 2020-09-13T03:56:08Z

Comment from vashirov (@vashirov) at 2018-07-24 14:16:54

@mreynolds389, I've ran cleanallruv test with your patch several times on a machine where it previously failed, it now passes all the time. Thanks!

389-ds-bot · 2020-09-13T03:56:08Z

Comment from spichugi (@droideck) at 2018-07-24 23:27:32

It still fails for me on the internal tool virtual machine.

[root@host-172-16-36-17 ds]# py.test -v dirsrvtests/tests/suites/replication/cleanallruv_test.py
==================== test session starts ======================
platform linux -- Python 3.6.6, pytest-3.6.3, py-1.5.4, pluggy-0.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.6.6', 'Platform': 'Linux-4.17.3-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight', 'Packages': {'pytest': '3.6.3', 'py': '1.5.4', 'pluggy': '0.6.0'}, 'Plugins': {'metadata': '1.7.0', 'html': '1.19.0'}}
389-ds-base: 1.4.0.13-20180724gitfd8ca9a.fc28
nss: 3.38.0-1.0.fc28
nspr: 4.19.0-1.fc28
openldap: 2.4.46-2.fc28
svrcore: not installed
FIPS: 0

rootdir: /mnt/tests/rhds/tests/upstream/ds, inifile:
plugins: metadata-1.7.0, html-1.19.0
collected 8 items

dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean PASSED [ 12%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean ERROR [ 12%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean_restart ERROR [ 25%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean_force ERROR [ 37%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort ERROR [ 50%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort_restart ERROR [ 62%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort_certify ERROR [ 75%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_stress_clean ERROR [ 87%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_multiple_tasks_with_force ERROR [100%]

============== ERRORS ================

It has the same ERROR as before:

 @pytest.fixture()
 def m4rid(request, topology_m4):
     log.debug("-------------- BEGIN RESET of m4 -----------------")
    repl = ReplicationManager(DEFAULT_SUFFIX)
>       repl.test_replication_topology(topology_m4.ms.values())

>       raise Exception("Replication did not sync in time!")
E       Exception: Replication did not sync in time!

389-ds-bot · 2020-09-13T03:56:09Z

Comment from spichugi (@droideck) at 2018-07-25 15:37:15

Okay, the tests does pass on a faster machine still fails on a slower. We can increase the timeouts then.

389-ds-bot · 2020-09-13T03:56:10Z

Comment from mreynolds (@mreynolds389) at 2018-07-25 18:52:13

How is it failing? What are the errors being reported?

389-ds-bot · 2020-09-13T03:56:11Z

Comment from spichugi (@droideck) at 2018-07-26 00:41:16

How is it failing? What are the errors being reported?

The same errors I reported in the issue description.
On a fast machine, it does pass 100% times though. I am trying to tweak some timeouts now...

389-ds-bot · 2020-09-13T03:56:12Z

Comment from spichugi (@droideck) at 2018-07-27 16:16:24

Mark, your diff was applied to #2905
Thanks!

389-ds-bot · 2020-09-13T03:56:12Z

Comment from mreynolds (@mreynolds389) at 2019-01-10 18:03:16

@droideck - if this is fixed and you close this ticket?

389-ds-bot · 2020-09-13T03:56:13Z

Comment from mreynolds (@mreynolds389) at 2019-01-10 18:03:32

Metadata Update from @mreynolds389:

Issue assigned to droideck
Issue set to the milestone: 1.4.0

389-ds-bot · 2020-09-13T03:56:14Z

Comment from spichugi (@droideck) at 2019-05-30 15:23:45

The issue is no longer present

389-ds-bot · 2020-09-13T03:56:15Z

Comment from spichugi (@droideck) at 2019-05-30 15:23:50

Metadata Update from @droideck:

Issue close_status updated to: fixed
Issue status updated to: Closed (was: Open)

389-ds-bot added the closed: fixed Migration flag - Issue label Sep 13, 2020

389-ds-bot added this to the 1.4.0 milestone Sep 13, 2020

389-ds-bot closed this as completed Sep 13, 2020

389-ds-bot mentioned this issue Sep 13, 2020

PR - Issue 49761 - Fix replication test suite issues #2905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CleanAllRUV test suite fails to restore master4 in the topology #2922

CleanAllRUV test suite fails to restore master4 in the topology #2922

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 •

edited

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

CleanAllRUV test suite fails to restore master4 in the topology #2922

CleanAllRUV test suite fails to restore master4 in the topology #2922

Comments

389-ds-bot commented Sep 13, 2020

Issue Description

Package Version and Platform

Steps to reproduce

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 • edited

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 •

edited