New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicated server hangs on startup/shutdown #560
Comments
Comment from rmeggins (@richm) at 2013-01-19 02:15:39 How many replication agreements do you have per server? Are you using SSL/TLS for replication? |
Comment from mdunphy at 2013-01-19 02:51:51 Replying to [comment:1 richm]:
In this testing mode all the replication agreements are only on one master for testing, I had planned on setting up the other replication agreements on the other masters after I had rolled the production environment. So right now the first master ( the one that hung today ) has 3 multimaster replication agreements to the other masters and 5 consumer agreements to all the consumers. Using plane jane ldap authentification no encryption and simple bind password for replication |
Comment from mdunphy at 2013-01-19 06:01:03 I am currently running this setup on Red Hat Enterprise Linux Workstation release 6.3 (Santiago) on VM's on ESX clusters at all the sites. Except for this problem it has been ok but it is a showstopper for me now. I was reading the release notes on version 1.3 and some of the defects fixed might be related, should I try 1.3 ? It also states that 1.3 is not supported on EL6 and wont be, when I do the following against the test yum --enablerepo=epel-testing-389-ds-base --enablerepo=epel-testing update 389-admin.x86_64 389-admin-console.noarch 389-admin-console-doc.noarch 389-adminutil.x86_64 389-console.noarch 389-ds.noarch 389-ds-base.x86_64 389-ds-base-libs.x86_64 389-ds-console.noarch 389-ds-console-doc.noarch 389-dsgw.x86_64 Should I look at running this directory server infrastucture on a different O.S. like fedora so I can I really dont care what the O.S. is as far as linux goes so would go with any reccomendation. |
Comment from mdunphy at 2013-01-21 21:00:15 I think I am running into the same issue as ticket 558 I was doing updates every 5 seconds round robining through all 4 masters after having set up all the replication agreements on all 4 master to each other and to the consumer and I noticed that one of the masters had a HI cpu load on the slapd process so I stopped the updates and proceeded to stop the dirsrv It took 20 minutes and it finally stopped. strace showed this Once stopped it started cleanly and all is well. What I had done earlier was kill -9 the process and that caused all sorts of hate and discontent and I think corrupted the database since it would never start again. Do you concur ? |
Comment from mreynolds (@mreynolds389) at 2013-01-21 23:33:32 Yes, this looks exactly like ticket 558 and this has just been fixed in 1.3.1 via: https://fedorahosted.org/389/ticket/558 I was originally concerned that you said there was a hang before running the shutdown, but apparently I might have read that wrong (according to your last update). Closing this ticket as a duplicate. If there is an outstanding issue please let me know and we can reopen this. Thanks, |
Comment from mdunphy at 2013-01-22 04:21:25 Replying to [comment:6 mreynolds389]:
Thanks. So how do I get 1.3.1 do I need to use git and compile myself ? Also what is meant by this on the release notes page "Notes NOTE: 1.3.0 will not be available for Fedora 17 or earlier, nor for EL6 or earlier 1.3.0 will only be available for Fedora 18 and later. We are trying to stabilize current, stable releases - upgrades to 1.3.0 will disrupt stability. " I am currently on EL6 would it be in my best interest to use Fedora ? |
Comment from mreynolds (@mreynolds389) at 2013-01-22 05:53:01 Replying to [comment:7 mdunphy]:
That's always an option. You can checkout the code, switch to your version of the code, and apply the patch to your version of 389. Then build and create the rpms. You should be able to find everything you need from here: http://directory.fedoraproject.org/wiki/Source
That means we won't be doing any official builds on earlier platforms (Fedora 16, 17)- that doesn't stop you from doing it. We are still releasing 389(well really its Red Hat Directory Server) on rhel 6.4. If you need this right away I suggest building it yourself. If you want to do this and have any questions just let me know. Mark
|
Comment from mdunphy at 2017-02-11 22:55:18 Metadata Update from @mdunphy:
|
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/560
Hi, I have set up a multimaster server with 4 masters and 5 consumers all over the world to move our old sun ldap 5.2 server too
In the midst of importing and testing from the old I have noticed that sometimes the 389 directory server gets locked up and hangs and will not shutdown.
On startup strace shows this
Snip …
access("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb", F_OK) = 0
access("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb/DBVERSION", F_OK) = 0
open("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb/DBVERSION", O_RDONLY) = 21
read(21, "bdb/4.7/libreplication-plugin\n", 8192) = 30
close(21) = 0
open("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 21
brk(0x1277000) = 0x1277000
getdents(21, /* 5 entries */, 32768) = 224
open("/var/lib/dirsrv/slapd-kidc-ldap-consumer/changelogdb/8dc56082-328811e2-a630f7d9-0b0203e8_505908f8000000010000.db4", O_RDONLY) = 22
read(22, "\0\0\0\0\1\0\0\0\0\0\0\0b1\5\0\t\0\0\0\0 \0\0\0\t\0\0\0\0\0\0"..., 512) = 512
close(22) = 0
sched_yield() = 0
futex(0x7fe74966e4fc, FUTEX_WAIT, 1, NULL
This has happened several times and on several of the consumers, all have the same signature blocked on the futex. It stays locked and doesnt ever come back. To fix I have to completely blow everything away and start all over again
I am using the latest from epel
389-ds.noarch 1.2.2-1.el6 @epel
389-ds-base.x86_64 1.2.10.12-1.el6 @epel-389-ds-base
389-ds-base-debuginfo.x86_64 1.2.10.12-1.el6 @epel-389-ds-base
389-ds-base-libs.x86_64 1.2.10.12-1.el6 @epel-389-ds-base
Today the main master that I configured locked up and had the same problem, I had done a bunch of deletes ( which hung ) and then noticed it had marked itself read only before I had attempted to shut it down.
I see other tickets on here that seem to be related. Originally I was attributing this phenomena to not shutting down correctly but I am going to have to put off rolling to this until I can understand what is going on.
There is no error message and nothing in the logs.
Its hard to recreate and seems to only happen after a few weeks of running.
Any help is appreciated.
The text was updated successfully, but these errors were encountered: