Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On small VM, autotune breaks the access of the suffixes #2263

Closed
389-ds-bot opened this issue Sep 13, 2020 · 30 comments
Closed

On small VM, autotune breaks the access of the suffixes #2263

389-ds-bot opened this issue Sep 13, 2020 · 30 comments
Labels
closed: fixed Migration flag - Issue

Comments

@389-ds-bot
Copy link

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49204


Issue Description

The issue is described in RH BZ 1435122.
Although suffix files (id2entry, entryrdn...) are valid and contains the expected entries, at startup the data appear to be missing.
An error is logged :

[31/Mar/2017:14:41:44.650378747 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988)

Note the failure is 100% reproducible on small VM but having not found RC, there is no guaranty it can also be reproducible on larger VM

Package Version and Platform

DS 1.3.6
IPA 4.5

Steps to reproduce

  1. Install IPA on a small VM (1Gb memory)

Actual results

Failure at install

Expected results

Success

@389-ds-bot 389-ds-bot added the closed: fixed Migration flag - Issue label Sep 13, 2020
@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from firstyear (@Firstyear) at 2017-04-03 04:29:29

I think I may have just in-advertently fixed this.

My current theory is that on a small machine, because our memory detection actually works now, we are breaking the hardlimit in check_and_set_import_cache, which causes ldif2db failures. As a result, setup-ds.pl then won't import the backend which causes id2entry to not exist (thus creating 49198)

I'm currently fixing the issue in #1924, and perhaps you can test the patches there to see if they resolve the issue.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-03 04:29:32

Metadata Update from @Firstyear:

  • Custom field reviewstatus adjusted to new
  • Custom field type adjusted to defect

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-05 06:07:17

Metadata Update from @Firstyear:

  • Issue assigned to Firstyear

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from tbordaz (@tbordaz) at 2017-04-05 13:52:57

Unfortunately the fix for #1924 does not fix #2263.

There are the same starting errors

-rw-------. 1 dirsrv dirsrv  57344 Apr  5 00:51 entryrdn.db
-rw-------. 1 dirsrv dirsrv 172032 Apr  5 00:51 id2entry.db

 [05/Apr/2017:00:51:33.690701632 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988)
 [05/Apr/2017:00:51:33.754144025 +051800] - ERR - index_addordel_entry - database index operation failed BAD 1031, err=-30988 BDB0073 DB_NOTFOUND: No matching key/data pair found

and dump of entryrdn and id2entry shows that index and db are looking fine

entryrdn
    ID: 1; RDN: "dc=testrelm,dc=test"; NRDN: "dc=testrelm,dc=test"
C1

 id2entry
id 1
    rdn: dc=testrelm,dc=test
    objectClass: top
    objectClass: domain
    objectClass: pilotObject
    ...

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2017-04-05 14:20:13

you say that it is always reproducable on small VM,

  • did you test on the QA beaker machine only or on a different small VM as well ?
  • does it happen with earlier releases, is it a regression ?

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-06 01:05:25

Have we got the log from setup-ds.pl too?

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-06 01:44:50

Actually, what if it's that we are setting the BDB cache size too small in the autotuning setup?

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-06 10:11:06

@elkris : The bug was not reproducible on beaker machine (with larger VM). I am using QE dedicated VM to debug.
It is difficult to confirm it is regression but IMHO it is very likely. The failure occurs at DS startup BEFORE IPA runs an IPA upgrade procedure (./freeipa/install/updates/90-post_upgrade_plugins.update).
To verify it is a regression that would need to run new DS (with autotune) with old IPA installer.

@Firstyear : setup-ds is launched without debug mode. It was successfull. setup-ds is a preliminary step of IPA install and DS was successful for ~10min before the failure is detected

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-06 10:11:10

Metadata Update from @tbordaz:

  • Custom field reviewstatus reset (from new)

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-06 10:13:33

I'm was walking through the shopping centre today and I wondered if there was a condition by which the machine is under enough memory pressure that we autotune to 0. That would certainly cause some issues.

I'll be checking this case, but it's only a guess, not a reall investigated solution.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-10 05:24:41

Hey mate,

I ran a build of DS in a container with a 256M memory limit. I can see it OOMing a highmemory test:

Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [29887] 1343600009 29887     9210        0      24      509             0 zsh
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ 3807] 1343600009  3807    30291       29      60     3958             0 gdb
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: [ 3826] 1343600009  3826 4563422503    47969     292    78569             0 lt-test_libsds
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: Memory cgroup out of memory: Kill process 3826 (lt-test_libsds) score 967 or sacrifice child
Apr 10 13:17:52 rei.prd.blackhats.net.au kernel: Killed process 3826 (lt-test_libsds) total-vm:18253690012kB, anon-rss:191876kB, file-rss:0kB, shmem-rss:0kB

Running setup-ds.pl, I can see the following warnings:

[10/Apr/2017:13:20:05.856337382 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (0 bytes).
[10/Apr/2017:13:20:05.901729073 +1000] - INFO - dblayer_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[10/Apr/2017:13:20:05.917218763 +1000] - INFO - check_and_set_import_cache - pagesize: 4096, available bytes 6070214656, process usage 65798144 
[10/Apr/2017:13:20:05.935436023 +1000] - INFO - check_and_set_import_cache - Import allocates 2371177KB import cache.
[10/Apr/2017:13:20:06.102251864 +1000] - INFO - import_main_offline - import userRoot: Beginning import job...
[10/Apr/2017:13:20:06.117203867 +1000] - INFO - import_main_offline - import userRoot: Index buffering enabled with bucket size 100
[10/Apr/2017:13:20:06.327920654 +1000] - INFO - import_producer - import userRoot: Processing file "/tmp/ldifhfIEEp.ldif"
[10/Apr/2017:13:20:06.361842151 +1000] - INFO - import_producer - import userRoot: Finished scanning file "/tmp/ldifhfIEEp.ldif" (9 entries)
[10/Apr/2017:13:20:06.846415798 +1000] - INFO - import_monitor_threads - import userRoot: Workers finished; cleaning up...
[10/Apr/2017:13:20:07.065802887 +1000] - INFO - import_monitor_threads - import userRoot: Workers cleaned up.
[10/Apr/2017:13:20:07.084330172 +1000] - INFO - import_main_offline - import userRoot: Cleaning up producer thread...
[10/Apr/2017:13:20:07.098956207 +1000] - INFO - import_main_offline - import userRoot: Indexing complete.  Post-processing...
[10/Apr/2017:13:20:07.111706878 +1000] - INFO - import_main_offline - import userRoot: Generating numsubordinates (this may take several minutes to complete)...
[10/Apr/2017:13:20:07.150434875 +1000] - INFO - import_main_offline - import userRoot: Generating numSubordinates complete.
[10/Apr/2017:13:20:07.170592164 +1000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs...
[10/Apr/2017:13:20:07.185626419 +1000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished gathering ancestorid non-leaf IDs.
[10/Apr/2017:13:20:07.211114186 +1000] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Creating ancestorid index (new idl)...
[10/Apr/2017:13:20:07.229210532 +1000] - INFO - ldbm_ancestorid_new_idl_create_index - import userRoot: Created ancestorid index (new idl).
[10/Apr/2017:13:20:07.246735160 +1000] - INFO - import_main_offline - import userRoot: Flushing caches...
[10/Apr/2017:13:20:07.265159211 +1000] - INFO - import_main_offline - import userRoot: Closing files...
[10/Apr/2017:13:20:07.510799223 +1000] - INFO - dblayer_pre_close - All database threads now stopped
[10/Apr/2017:13:20:07.524294478 +1000] - INFO - import_main_offline - import userRoot: Import complete.  Processed 9 entries in 1 seconds. (9.00 entries/sec)

+[10/Apr/2017:13:20:09.122785218 +1000] - NOTICE - ldbm_back_start - found 262144k physical memory
+[10/Apr/2017:13:20:09.133959896 +1000] - NOTICE - ldbm_back_start - found 64197k avaliable
+[10/Apr/2017:13:20:09.150539284 +1000] - NOTICE - ldbm_back_start - cache autosizing: db cache: 10485k
+[10/Apr/2017:13:20:09.169102140 +1000] - NOTICE - ldbm_back_start - cache autosizing: userRoot entry cache (1 total): 65536k
+[10/Apr/2017:13:20:09.188007215 +1000] - WARN - entrycache_set_max_size - Cachesize (47374540) may use more than the available physical memory.
+[10/Apr/2017:13:20:09.202063252 +1000] - NOTICE - ldbm_back_start - total cache size: 92476014 B; 
+[10/Apr/2017:13:20:09.213497617 +1000] - WARN - ldbm_back_start - It is highly likely your memory configuration of all backends will EXCEED your systems memory.
+[10/Apr/2017:13:20:09.228089373 +1000] - WARN - ldbm_back_start - In a future release this WILL prevent server start up. You MUST alter your configuration.
+[10/Apr/2017:13:20:09.246454204 +1000] - WARN - ldbm_back_start - Total entry cache size: 49303756 B; dbcache size: 8589934 B; available memory size: 65738342 B; 
+[10/Apr/2017:13:20:09.264939787 +1000] - WARN - ldbm_back_start - This can be corrected by altering the values of nsslapd-dbcachesize, nsslapd-cachememsize and nsslapd-dncachememsize
+
+[10/Apr/2017:13:20:09.280016381 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (8589934 bytes).
+[10/Apr/2017:13:20:09.297357115 +1000] - INFO - dblayer_start - Resizing db cache size: 2428085862 -> 8589934
+[10/Apr/2017:13:20:10.573234982 +1000] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 389 for LDAP requests

However, it looks like the server starts, with all entries added:

The only bit I'm concerned about here is:

[10/Apr/2017:13:20:05.856337382 +1000] - WARN - dblayer_start - Likely CONFIGURATION ERROR -dbcachesize is configured to use more than the available physical memory, decreased to the largest available size (0 bytes).

This would be during ldif2db in setup-ds.pl. However, the server still starts correctly.

I'll make a patch to prevent 0 from being set on these values, but this may not be the cause of the issue.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-10 05:49:05

[10/Apr/2017:13:38:44.393226936 +1000] - INFO - check_and_set_import_cache - Import allocates 6211KB import cache.

0001-Ticket-49204-Fix-lower-bounds-on-import-autosize.patch

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-10 05:49:40

Metadata Update from @Firstyear:

  • Custom field reviewstatus adjusted to review

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-04-11 01:41:50

Metadata Update from @mreynolds389:

  • Custom field reviewstatus adjusted to ack (was: review)

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-11 08:09:19

The patch looks good but note that it does not fully fix the problem. It keeps occurring

[11/Apr/2017:02:57:59.416666948 +051800] - ERR - _entryrdn_insert_key - Suffix "dc=testrelm,dc=test" not found: BDB0073 DB_NOTFOUND: No matching key/data pair found(-30988)
[11/Apr/2017:02:57:59.492313735 +051800] - ERR - index_addordel_entry - database index operation failed BAD 1031, err=-30988 BDB0073 DB_NOTFOUND: No matching key/data pair found

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-11 16:59:57

The root cause is possibly identified: If at startup memory pressure prevent setting of entrycache size, this entrycache remains NULL.

Running long duration tests (the problem use to occur 50%-75%) with the following patch.

0001-Ticket-49204-On-small-VM-autotune-breaks-the-access-.patch

This is likely NOT the definitive fix

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-12 01:57:09

So this fixes the problem you have seen @tbordaz ?

I would be curious to see the value of delta in this log output. I think the idea of this change is good though, but the catch is that this is where the user sets the cachememsize via cn=config. So we check the delta of val - current size, to see if we can accomodate the difference because we already consume current_size in memory used (*mi). So really, delta is valid if it's 0, because you aren't changing the size at all, and for the sanity check to be carried out, delta must be a positive value relative to current size.

Second is that this code is only called from a change to cn=config, so I'm still not 100% it's the fix.

Perhaps the real fix is that util_is_cachesize_sane shouldn't be reducing the size of the allocation so much, and we should check it's output. IE if the request is say ... 1mb, we always allow it, and if it's greater than that we have enough room to actually do a reduction. As well, we should check that the result is never 0 as well.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-12 02:38:08

Hey @tbordaz . I've had a look at your fix and tweaked it a bit. I squashed all three together to this single patch, so I hope that it helps and fixes the issue for you.

0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-12 10:46:39

Long duration tests completed successfully with https://pagure.iohttps://fedorapeople.org/groups/389ds/github_attachments/6f730599b09c6e098488cb1601550e7160e2918deafcc482c74323a86ab89b2c-0001-Ticket-49204-On-small-VM-autotune-breaks-the-access-.patch

I will check if the error came from entrycache being zeroed or uninitialized

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-12 18:30:00

The entrycache.size is not uninitialized, it can be 0 or some specific value (like 512K). So I think the error comes from the failure (LDAP_UNWILLING_TO_PERFORM) when parsing dse.ldif. The failure itself is triggered by memory pressure but I have not a clear understanding how the failure (unwilling) can lead to this weird error (suffix entries not readable).

Testing https://pagure.iohttps://fedorapeople.org/groups/389ds/github_attachments/754f2ef07452640a3b5af46a9ae95ec07537d03e62b9f4937b371b79a9753dcf-0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-13 10:07:23

Testing https://pagure.iohttps://fedorapeople.org/groups/389ds/github_attachments/754f2ef07452640a3b5af46a9ae95ec07537d03e62b9f4937b371b79a9753dcf-0001-Ticket-49204-Fix-lower-bounds-on-import-autosize-On-.patch

Tests are inconclusive: patch was successful for the first 18 run but after DS crashed systematically for the remaining run. Unfortunately ABRT did not dump the core because of the memory pressure.

In conclusion:

  • The patch looks good to fix the current ticket, because of the 18 success in a raw
  • The patch may introduce an other problem or we may have an environment issue (memory pressure too high). Note that the available memory is 1Gb+1.5Gb_swap.
  • I will rerun tests

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-13 10:28:04

How about we take the patch as it currently is, and I try running DS in a container with say 64mb of ram available? That should provide a better env for me to reproduce in?

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-04-13 18:38:41

The failures were triggered by IPC being exhausted by apache. Doing regular cleanup, the tests (25first runs) were successful.
IMHO the fix is ok. ACK

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-18 01:18:49

commit ae44ed4
To ssh://git@pagure.io/389-ds-base.git
8979cc6..ae44ed4 master -> master

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-18 01:18:58

Metadata Update from @Firstyear:

  • Issue close_status updated to: fixed
  • Issue status updated to: Closed (was: Open)

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-04-18 16:04:56

Fix conflicting types in header file:

ae44ed4..553d7dd master -> master

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-04-18 16:57:03

0001-Issue-49204-Fix-32bit-arch-build-failures.patch

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-04-18 16:57:08

Metadata Update from @mreynolds389:

  • Custom field reviewstatus adjusted to review (was: ack)

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-04-18 23:56:18

Metadata Update from @Firstyear:

  • Custom field reviewstatus adjusted to ack (was: review)

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-04-19 13:11:34

Thanks William!

553d7dd..86931be master -> master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed: fixed Migration flag - Issue
Projects
None yet
Development

No branches or pull requests

1 participant