New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On small VM, autotune breaks the access of the suffixes #2263
Comments
Comment from firstyear (@Firstyear) at 2017-04-03 04:29:29 I think I may have just in-advertently fixed this. My current theory is that on a small machine, because our memory detection actually works now, we are breaking the hardlimit in check_and_set_import_cache, which causes ldif2db failures. As a result, setup-ds.pl then won't import the backend which causes id2entry to not exist (thus creating 49198) I'm currently fixing the issue in #1924, and perhaps you can test the patches there to see if they resolve the issue. |
Comment from firstyear (@Firstyear) at 2017-04-03 04:29:32 Metadata Update from @Firstyear:
|
Comment from firstyear (@Firstyear) at 2017-04-05 06:07:17 Metadata Update from @Firstyear:
|
Comment from tbordaz (@tbordaz) at 2017-04-05 13:52:57 Unfortunately the fix for #1924 does not fix #2263. There are the same starting errors
and dump of entryrdn and id2entry shows that index and db are looking fine
|
Comment from lkrispen (@elkris) at 2017-04-05 14:20:13 you say that it is always reproducable on small VM,
|
Comment from firstyear (@Firstyear) at 2017-04-06 01:05:25 Have we got the log from setup-ds.pl too? |
Comment from firstyear (@Firstyear) at 2017-04-06 01:44:50 Actually, what if it's that we are setting the BDB cache size too small in the autotuning setup? |
Comment from tbordaz (@tbordaz) at 2017-04-06 10:11:06 @elkris : The bug was not reproducible on beaker machine (with larger VM). I am using QE dedicated VM to debug. @Firstyear : setup-ds is launched without debug mode. It was successfull. setup-ds is a preliminary step of IPA install and DS was successful for ~10min before the failure is detected |
Comment from tbordaz (@tbordaz) at 2017-04-06 10:11:10 Metadata Update from @tbordaz:
|
Comment from firstyear (@Firstyear) at 2017-04-06 10:13:33 I'm was walking through the shopping centre today and I wondered if there was a condition by which the machine is under enough memory pressure that we autotune to 0. That would certainly cause some issues. I'll be checking this case, but it's only a guess, not a reall investigated solution. |
Comment from firstyear (@Firstyear) at 2017-04-10 05:24:41 Hey mate, I ran a build of DS in a container with a 256M memory limit. I can see it OOMing a highmemory test:
Running setup-ds.pl, I can see the following warnings:
However, it looks like the server starts, with all entries added: The only bit I'm concerned about here is:
This would be during ldif2db in setup-ds.pl. However, the server still starts correctly. I'll make a patch to prevent 0 from being set on these values, but this may not be the cause of the issue. |
Comment from firstyear (@Firstyear) at 2017-04-10 05:49:05
|
Comment from firstyear (@Firstyear) at 2017-04-10 05:49:40 Metadata Update from @Firstyear:
|
Comment from mreynolds (@mreynolds389) at 2017-04-11 01:41:50 Metadata Update from @mreynolds389:
|
Comment from tbordaz (@tbordaz) at 2017-04-11 08:09:19 The patch looks good but note that it does not fully fix the problem. It keeps occurring
|
Comment from tbordaz (@tbordaz) at 2017-04-11 16:59:57 The root cause is possibly identified: If at startup memory pressure prevent setting of entrycache size, this entrycache remains NULL. Running long duration tests (the problem use to occur 50%-75%) with the following patch. This is likely NOT the definitive fix |
Comment from firstyear (@Firstyear) at 2017-04-12 01:57:09 So this fixes the problem you have seen @tbordaz ? I would be curious to see the value of delta in this log output. I think the idea of this change is good though, but the catch is that this is where the user sets the cachememsize via cn=config. So we check the delta of val - current size, to see if we can accomodate the difference because we already consume current_size in memory used (*mi). So really, delta is valid if it's 0, because you aren't changing the size at all, and for the sanity check to be carried out, delta must be a positive value relative to current size. Second is that this code is only called from a change to cn=config, so I'm still not 100% it's the fix. Perhaps the real fix is that util_is_cachesize_sane shouldn't be reducing the size of the allocation so much, and we should check it's output. IE if the request is say ... 1mb, we always allow it, and if it's greater than that we have enough room to actually do a reduction. As well, we should check that the result is never 0 as well. |
Comment from firstyear (@Firstyear) at 2017-04-12 02:38:08 Hey @tbordaz . I've had a look at your fix and tweaked it a bit. I squashed all three together to this single patch, so I hope that it helps and fixes the issue for you. |
Comment from tbordaz (@tbordaz) at 2017-04-12 10:46:39 Long duration tests completed successfully with https://pagure.iohttps://fedorapeople.org/groups/389ds/github_attachments/6f730599b09c6e098488cb1601550e7160e2918deafcc482c74323a86ab89b2c-0001-Ticket-49204-On-small-VM-autotune-breaks-the-access-.patch I will check if the error came from entrycache being zeroed or uninitialized |
Comment from tbordaz (@tbordaz) at 2017-04-12 18:30:00 The entrycache.size is not uninitialized, it can be 0 or some specific value (like 512K). So I think the error comes from the failure (LDAP_UNWILLING_TO_PERFORM) when parsing dse.ldif. The failure itself is triggered by memory pressure but I have not a clear understanding how the failure (unwilling) can lead to this weird error (suffix entries not readable). |
Comment from tbordaz (@tbordaz) at 2017-04-13 10:07:23 Tests are inconclusive: patch was successful for the first 18 run but after DS crashed systematically for the remaining run. Unfortunately ABRT did not dump the core because of the memory pressure. In conclusion:
|
Comment from firstyear (@Firstyear) at 2017-04-13 10:28:04 How about we take the patch as it currently is, and I try running DS in a container with say 64mb of ram available? That should provide a better env for me to reproduce in? |
Comment from tbordaz (@tbordaz) at 2017-04-13 18:38:41 The failures were triggered by IPC being exhausted by apache. Doing regular cleanup, the tests (25first runs) were successful. |
Comment from firstyear (@Firstyear) at 2017-04-18 01:18:49 commit ae44ed4 |
Comment from firstyear (@Firstyear) at 2017-04-18 01:18:58 Metadata Update from @Firstyear:
|
Comment from mreynolds (@mreynolds389) at 2017-04-18 16:04:56 Fix conflicting types in header file: |
Comment from mreynolds (@mreynolds389) at 2017-04-18 16:57:03 |
Comment from mreynolds (@mreynolds389) at 2017-04-18 16:57:08 Metadata Update from @mreynolds389:
|
Comment from firstyear (@Firstyear) at 2017-04-18 23:56:18 Metadata Update from @Firstyear:
|
Comment from mreynolds (@mreynolds389) at 2017-04-19 13:11:34 Thanks William! |
Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49204
Issue Description
The issue is described in RH BZ 1435122.
Although suffix files (id2entry, entryrdn...) are valid and contains the expected entries, at startup the data appear to be missing.
An error is logged :
Note the failure is 100% reproducible on small VM but having not found RC, there is no guaranty it can also be reproducible on larger VM
Package Version and Platform
DS 1.3.6
IPA 4.5
Steps to reproduce
Actual results
Failure at install
Expected results
Success
The text was updated successfully, but these errors were encountered: