Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory with many KDB handles #3140

Closed
raphi011 opened this issue Oct 31, 2019 · 25 comments
Assignees

Comments

@raphi011
Copy link
Contributor

@raphi011 raphi011 commented Oct 31, 2019

While running my benchmarks I somehow broke elektra.

When I run a kdb command like kdb ls /

I get the following message (also when calling the elektra library directly)

Sorry, I crashed by the signal SIGSEGV
This should not have happened!

Please report the issue at https://issues.libelektra.org/
fish: 'kdb ls /' terminated by signal SIGABRT (Abort)

I've already run kdb reset and kdb reset-elektra and re-compiled and uninstalled + installed kdb.
Any ideas how I could fix / debug this?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Oct 31, 2019

Thank you for reporting. Can you attach a trace, maybe run valgrind kdb ls /? Some information on how to reproduce it would also be helpful.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Oct 31, 2019

==110109== Memcheck, a memory error detector
==110109== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==110109== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==110109== Command: kdb ls /
==110109== 
==110109== Invalid read of size 8
==110109==    at 0x52778E8: updatePointers (mmapstorage.c:948)
==110109==    by 0x52778E8: libelektra_mmapstorage_LTX_elektraPluginget (mmapstorage.c:1148)
==110109==    by 0x5270711: elektraCacheGet (cache.c:321)
==110109==    by 0x48E6849: elektraCacheLoad (kdb.c:825)
==110109==    by 0x48E6849: kdbGet (kdb.c:1038)
==110109==    by 0x174EA0: get (kdb.hpp:185)
==110109==    by 0x174EA0: get (kdb.hpp:168)
==110109==    by 0x174EA0: Cmdline::Cmdline(int, char**, Command*) (cmdline.cpp:310)
==110109==    by 0x1696F6: main (main.cpp:170)
==110109==  Address 0x5285d10 is not stack'd, malloc'd or (recently) free'd
==110109== 

Sorry, I crashed by the signal SIGSEGV
This should not have happened!

Please report the issue at https://issues.libelektra.org/
==110109== 
==110109== Process terminating with default action of signal 6 (SIGABRT): dumping core
==110109==    at 0x4CDEF25: raise (in /usr/lib/libc-2.30.so)
==110109==    by 0x4CC8896: abort (in /usr/lib/libc-2.30.so)
==110109==    by 0x1DF04B: catchSignal(int) (main.cpp:110)
==110109==    by 0x4CDEFAF: ??? (in /usr/lib/libc-2.30.so)
==110109==    by 0x52778E7: updatePointers (mmapstorage.c:946)
==110109==    by 0x52778E7: libelektra_mmapstorage_LTX_elektraPluginget (mmapstorage.c:1148)
==110109==    by 0x5270711: elektraCacheGet (cache.c:321)
==110109==    by 0x48E6849: elektraCacheLoad (kdb.c:825)
==110109==    by 0x48E6849: kdbGet (kdb.c:1038)
==110109==    by 0x174EA0: get (kdb.hpp:185)
==110109==    by 0x174EA0: get (kdb.hpp:168)
==110109==    by 0x174EA0: Cmdline::Cmdline(int, char**, Command*) (cmdline.cpp:310)
==110109==    by 0x1696F6: main (main.cpp:170)
==110109== 
==110109== HEAP SUMMARY:
==110109==     in use at exit: 173,723 bytes in 1,005 blocks
==110109==   total heap usage: 2,915 allocs, 1,910 frees, 437,331 bytes allocated
==110109== 
==110109== LEAK SUMMARY:
==110109==    definitely lost: 0 bytes in 0 blocks
==110109==    indirectly lost: 0 bytes in 0 blocks
==110109==      possibly lost: 0 bytes in 0 blocks
==110109==    still reachable: 173,723 bytes in 1,005 blocks
==110109==         suppressed: 0 bytes in 0 blocks
==110109== Rerun with --leak-check=full to see details of leaked memory
==110109== 
==110109== For lists of detected and suppressed errors, rerun with: -s
==110109== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
fish: 'valgrind kdb ls /' terminated by signal SIGABRT (Abort)

@mpranj mpranj assigned mpranj and unassigned markus2330 Oct 31, 2019
@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Oct 31, 2019

Seems to be an issue with mmap. After running rm -rf ~/.cache/elektra it works again

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Oct 31, 2019

Yes, the crash definitely happened because of cache / mmap. Can you explain what you did to trigger this?

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Oct 31, 2019

I ran my elektrad benchmarks, after clearing the cache and re-running it didn't break again. If this occurs again I will mention it in this issue.

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Oct 31, 2019

@mpranj a big thanks that you take care of this issue.

Why are there casts to char* in the segfaulting code?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Oct 31, 2019

Why are there casts to char* in the segfaulting code?

Pretty sure the cast is not the problem. It's used for byte-wise pointer arithmetic. Before storage we subtract the old base address, making all pointers relative to the base address of the mmap region. After mmaping again, we update the pointers to the new base mmap address by adding the new address to all pointers.

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Oct 31, 2019

Ok. Do you have an idea how this problem was caused?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Oct 31, 2019

Do you have an idea how this problem was caused?

Sorry but currently no. This code is basically executed on each kdbGet now, so it's not like a seldom taken code path.

Corrupting the cache file in exactly the right spot might cause something like that (like a wrong ks->size or similar).

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Oct 31, 2019

Are the operations on the cache files atomic? Did you test aborting the process while writing the cache?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Oct 31, 2019

There are no testcases for such a scenario, but the cache file is written to a temp file and only rename()d on success.

EDIT: Additionally, the magic footer of the mmap file woud prevent us from reading an unfinished file. The footer should be the last write. I could enforce it via an additional msync(), but it might slow things down.

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Oct 31, 2019

Additionally, the magic footer of the mmap file woud prevent us from reading an unfinished file.

So most probably it was something else.

I could enforce it via an additional msync()

The sync/msync would be the correct thing to do (also on the folder where you rename, see sync plugin).

but it might slow things down.

As it only impacts writing, you do not need to worry.

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Oct 31, 2019

Just to clarify: msync is done already. This would just mean another msync to enforce order of the writes. Let's do it and see what it does in the benchmarks.

Still: It's unclear that this is the reason for the segfault here.

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Oct 31, 2019

@raphi011 you do not have the cache file anymore to reproduce this issue? Did you do some upgrade of Elektra or similar?

mpranj added a commit to mpranj/libelektra that referenced this issue Nov 1, 2019
… prevents ElektraInitiative#3140
mpranj added a commit to mpranj/libelektra that referenced this issue Nov 3, 2019
… prevents ElektraInitiative#3140
@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 3, 2019

@raphi011 you do not have the cache file anymore to reproduce this issue? Did you do some upgrade of Elektra or similar?

Unfortunately not ..

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 3, 2019

The commits referencing this issue are a shot in the dark. The assumption was that the OS reordered some writes from mmap and mmapstorage did not correctly detect incompletely written data, which was possible before those commits.

Even having the cache files would probably not help as there is no way to know whether some data was not completely written.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 3, 2019

What is the actual lifetime of the keys in mmap storage? While benchmarking the memory of my server is growing and growing even though I'm ksClearing all keysets after it's usage - until it get's killed by the OS because it uses too much memory.

When is memory that is allocated with mmap freed?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 3, 2019

When is memory that is allocated with mmap freed?

The answer is never. This was a design decision discussed with @markus2330. The memory pages are shared and the OS implements COW via mmap. This means depending on the usage it is possible to have a KeySet open which is larger than the available main memory.

The alternative back then was to implement reference counting to unmap the mmaped files which needed some tricky implementation in the core. (Think of another KeySet referencing the keys from a mmaped KeySet, it is tricky to know when it can be really unmapped and basically out of scope for my thesis)

Maybe this is the reason why you had incomplete mmap cache files. 🤔

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 3, 2019

@markus2330 this is probably not the right place to discuss this .. but how is this supposed to work with elektrad that has long lived handles? After a session expires a new handle + keyset is created and the old one is closed - but the memory is not freed which results in the OS killing the process after a while (which happens pretty fast in the benchmarks).

This probably means I have to disable mmap somehow for elektrad .. is this possible?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 3, 2019

The OS can free up memory from mmap that has not been changed, so we "only" leak the pages that are changed COW. But you're right, we might need to re-evaluate this decision.

This probably means I have to disable mmap somehow for elektrad .. is this possible?

Easy: kdb cache disable. The cache is stored on a per-system-user basis, so this needs to be also done on a per-user basis.
If you need to completely disable it then compile without the cache plugin
(e.g. -DPLUGINS="ALL;-cache").

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 3, 2019

The alternative back then was to implement reference counting to unmap the mmaped files which needed some tricky implementation in the core. (Think of another KeySet referencing the keys from a mmaped KeySet, it is tricky to know when it can be really unmapped and basically out of scope for my thesis)

@markus2330 One problem was/is that unmapping would have to be implemented in the core, and mmap() is not part of the C standards, but let's assume this is no problem. Then unmapping can be done somewhat easily in the core if ksAppend() really copies the keys and does not simply reference them. Referencing mmaped Keys was afair the biggest problem with this unmapping approach. I'm not sure if we can/may change the semantics of ksAppend() though.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 3, 2019

I have disabled the cache plugin completely - as described above via -DPLUGINS="ALL;-cache" and somehow memory is STILL not freed after calling ksDel

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 3, 2019

If you compiled Elektra cleanly with the cmake option -DPLUGINS="ALL;-cache" (verify by cmake output "Exclude Plugin cache because explicitly excluded"), then I see no way how mmap could affect your results, except if you explicitly enabled mmapstorage as default storage (again, can be excluded by -DPLUGINS="ALL;-cache;-mmapstorage" ).

If you're sure we're leaking somewhere, then you can always run valgrind --leak-check=full /your/kdb/tests and see which source file has leaked memory (make sure you have debug symbols, which is default anyway: CMAKE_BUILD_TYPE: RelWithDebInfo).

@markus2330 markus2330 changed the title kdb crashes on startup out of memory with many KDB handles Nov 4, 2019
@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 4, 2019

@mpranj wrote:

and basically out of scope for my thesis

I fully agree. For now we only find ways of how to disable the cache when it does not work. I created #3155 and #3156

@raphi011 wrote:

this is probably not the right place to discuss this

The issue tracker is the right place. But please always make sure to either rename the title (if the previous discussion is finished) or open new issues.

which happens pretty fast in the benchmarks

Can you tell us about how many KDB handles (sessions) we are speaking here? Maybe this is only a synthetic (benchmark) problem anyway. In real-world scenarios not more than a few handles are expected to be alive at the same time. (elektrad is there to configure a single system.)

even though I'm ksClearing all keysets after it's usage
I have disabled the cache plugin completely - as described above via -DPLUGINS="ALL;-cache" and somehow memory is STILL not freed after calling ksDel

ksClear is internal and afaik there are no tests if it is safe to reuse the keyset afterwards. So maybe you already lost pointers to allocated memory within ksClear. If this is the case, please open a new issue. (Lets keep this issue to discuss the problems with mmap and too many handles.)

mpranj added a commit to mpranj/libelektra that referenced this issue Nov 5, 2019
… prevents ElektraInitiative#3140
@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 19, 2019

I will close this issue since a fix has been added. If this happens again please feel free to reopen it.

@raphi011 raphi011 closed this Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.