Replicated server hangs on startup/shutdown #560

389-ds-bot · 2020-09-12T14:45:22Z

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/560

Created at 2013-01-19 02:12:12 by mdunphy
Closed as Duplicate
Assigned to mreynolds (@mreynolds389)

Hi, I have set up a multimaster server with 4 masters and 5 consumers all over the world to move our old sun ldap 5.2 server too
In the midst of importing and testing from the old I have noticed that sometimes the 389 directory server gets locked up and hangs and will not shutdown.
On startup strace shows this

Snip …
access("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb", F_OK) = 0
access("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb/DBVERSION", F_OK) = 0
open("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb/DBVERSION", O_RDONLY) = 21
read(21, "bdb/4.7/libreplication-plugin\n", 8192) = 30
close(21) = 0
open("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 21
brk(0x1277000) = 0x1277000
getdents(21, /* 5 entries */, 32768) = 224
open("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb/8dc56082-328811e2-a630f7d9-0b0203e8_505908f8000000010000.db4", O_RDONLY) = 22
read(22, "\0\0\0\0\1\0\0\0\0\0\0\0b1\5\0\t\0\0\0\0 \0\0\0\t\0\0\0\0\0\0"..., 512) = 512
close(22) = 0
sched_yield() = 0
futex(0x7fe74966e4fc, FUTEX_WAIT, 1, NULL

This has happened several times and on several of the consumers, all have the same signature blocked on the futex. It stays locked and doesnt ever come back. To fix I have to completely blow everything away and start all over again

I am using the latest from epel
389-ds.noarch 1.2.2-1.el6 @epel
389-ds-base.x86_64 1.2.10.12-1.el6 @epel-389-ds-base
389-ds-base-debuginfo.x86_64 1.2.10.12-1.el6 @epel-389-ds-base
389-ds-base-libs.x86_64 1.2.10.12-1.el6 @epel-389-ds-base

Today the main master that I configured locked up and had the same problem, I had done a bunch of deletes ( which hung ) and then noticed it had marked itself read only before I had attempted to shut it down.

I see other tickets on here that seem to be related. Originally I was attributing this phenomena to not shutting down correctly but I am going to have to put off rolling to this until I can understand what is going on.

There is no error message and nothing in the logs.

Its hard to recreate and seems to only happen after a few weeks of running.

Any help is appreciated.

389-ds-bot · 2020-09-12T14:45:23Z

Comment from rmeggins (@richm) at 2013-01-19 02:15:39

How many replication agreements do you have per server? Are you using SSL/TLS for replication?

389-ds-bot · 2020-09-12T14:45:24Z

Comment from mdunphy at 2013-01-19 02:51:51

Replying to [comment:1 richm]:

How many replication agreements do you have per server? Are you using SSL/TLS for replication?

In this testing mode all the replication agreements are only on one master for testing, I had planned on setting up the other replication agreements on the other masters after I had rolled the production environment. So right now the first master ( the one that hung today ) has 3 multimaster replication agreements to the other masters and 5 consumer agreements to all the consumers.

Using plane jane ldap authentification no encryption and simple bind password for replication

389-ds-bot · 2020-09-12T14:45:25Z

Comment from mdunphy at 2013-01-19 06:01:03

I am currently running this setup on Red Hat Enterprise Linux Workstation release 6.3 (Santiago) on VM's on ESX clusters at all the sites. Except for this problem it has been ok but it is a showstopper for me now.

I was reading the release notes on version 1.3 and some of the defects fixed might be related, should I try 1.3 ?

It also states that 1.3 is not supported on EL6 and wont be, when I do the following against the test
repos I am not seeing much in updates

yum --enablerepo=epel-testing-389-ds-base --enablerepo=epel-testing update 389-admin.x86_64 389-admin-console.noarch 389-admin-console-doc.noarch 389-adminutil.x86_64 389-console.noarch 389-ds.noarch 389-ds-base.x86_64 389-ds-base-libs.x86_64 389-ds-console.noarch 389-ds-console-doc.noarch 389-dsgw.x86_64
Loaded plugins: product-id, refresh-packagekit, security, subscription-manager
Updating certificate-based repositories.
Setting up Update Process
Resolving Dependencies
--> Running transaction check
---> Package 389-ds-base.x86_64 0:1.2.10.12-1.el6 will be updated
---> Package 389-ds-base.x86_64 0:1.2.10.14-2.el6 will be an update
---> Package 389-ds-base-libs.x86_64 0:1.2.10.12-1.el6 will be updated
---> Package 389-ds-base-libs.x86_64 0:1.2.10.14-2.el6 will be an update

Should I look at running this directory server infrastucture on a different O.S. like fedora so I can
get the 1.3 ?

I really dont care what the O.S. is as far as linux goes so would go with any reccomendation.

389-ds-bot · 2020-09-12T14:45:26Z

Comment from mdunphy at 2013-01-21 21:00:15

I think I am running into the same issue as ticket 558

I was doing updates every 5 seconds round robining through all 4 masters after having set up all the replication agreements on all 4 master to each other and to the consumer and I noticed that one of the masters had a HI cpu load on the slapd process so I stopped the updates and proceeded to stop the dirsrv

It took 20 minutes and it finally stopped. strace showed this
[root@cvor-ldap-master ~]# strace -p 5471
Process 5471 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 838473}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)

Once stopped it started cleanly and all is well.

What I had done earlier was kill -9 the process and that caused all sorts of hate and discontent and I think corrupted the database since it would never start again.

Do you concur ?
If so I guess I need to be patient, but there still is a issue of why it gets into this state.
Before running the stop command the directory server was fine, still updating and quering albeit using 99% of 1 cpu.

389-ds-bot · 2020-09-12T14:45:26Z

Comment from mreynolds (@mreynolds389) at 2013-01-21 23:33:32

Yes, this looks exactly like ticket 558 and this has just been fixed in 1.3.1 via: https://fedorahosted.org/389/ticket/558

I was originally concerned that you said there was a hang before running the shutdown, but apparently I might have read that wrong (according to your last update).

Closing this ticket as a duplicate. If there is an outstanding issue please let me know and we can reopen this.

Thanks,
Mark

389-ds-bot · 2020-09-12T14:45:27Z

Comment from mdunphy at 2013-01-22 04:21:25

Replying to [comment:6 mreynolds389]:

Yes, this looks exactly like ticket 558 and this has just been fixed in 1.3.1 via: https://fedorahosted.org/389/ticket/558

I was originally concerned that you said there was a hang before running the shutdown, but apparently I might have read that wrong (according to your last update).

Closing this ticket as a duplicate. If there is an outstanding issue please let me know and we can reopen this.

Thanks,
Mark

Thanks. So how do I get 1.3.1 do I need to use git and compile myself ?

Also what is meant by this on the release notes page "Notes

NOTE: 1.3.0 will not be available for Fedora 17 or earlier, nor for EL6 or earlier 1.3.0 will only be available for Fedora 18 and later. We are trying to stabilize current, stable releases - upgrades to 1.3.0 will disrupt stability. "

I am currently on EL6 would it be in my best interest to use Fedora ?

389-ds-bot · 2020-09-12T14:45:28Z

Comment from mreynolds (@mreynolds389) at 2013-01-22 05:53:01

Replying to [comment:7 mdunphy]:

Replying to [comment:6 mreynolds389]:

Yes, this looks exactly like ticket 558 and this has just been fixed in 1.3.1 via: https://fedorahosted.org/389/ticket/558

I was originally concerned that you said there was a hang before running the shutdown, but apparently I might have read that wrong (according to your last update).

Closing this ticket as a duplicate. If there is an outstanding issue please let me know and we can reopen this.

Thanks,
Mark

Thanks. So how do I get 1.3.1 do I need to use git and compile myself ?

That's always an option. You can checkout the code, switch to your version of the code, and apply the patch to your version of 389. Then build and create the rpms. You should be able to find everything you need from here:

http://directory.fedoraproject.org/wiki/Source

Also what is meant by this on the release notes page "Notes

NOTE: 1.3.0 will not be available for Fedora 17 or earlier, nor for EL6 or earlier 1.3.0 will only be available for Fedora 18 and later. We are trying to stabilize current, stable releases - upgrades to 1.3.0 will disrupt stability. "

That means we won't be doing any official builds on earlier platforms (Fedora 16, 17)- that doesn't stop you from doing it. We are still releasing 389(well really its Red Hat Directory Server) on rhel 6.4.

If you need this right away I suggest building it yourself. If you want to do this and have any questions just let me know.

Mark

I am currently on EL6 would it be in my best interest to use Fedora ?

389-ds-bot · 2020-09-12T14:45:29Z

Comment from mdunphy at 2017-02-11 22:55:18

Metadata Update from @mdunphy:

Issue assigned to mreynolds389
Issue set to the milestone: N/A

389-ds-bot added the closed: duplicate Migration flag - Issue label Sep 12, 2020

389-ds-bot closed this as completed Sep 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicated server hangs on startup/shutdown #560

Replicated server hangs on startup/shutdown #560

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

Replicated server hangs on startup/shutdown #560

Replicated server hangs on startup/shutdown #560

Comments

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020

389-ds-bot commented Sep 12, 2020