Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeySet close memory leak #3158

Closed
raphi011 opened this issue Nov 4, 2019 · 32 comments
Assignees

Comments

@raphi011
Copy link
Contributor

@raphi011 raphi011 commented Nov 4, 2019

While benchmarking the new elektrad I experienced memory leak symptoms.
The memory usage would go up after creating new handles and calling kdbGet() but never down after ksClose() and kdbClose().

My first suspicion was the mmap caching - which probably also causes memory leaks because the mmaped memory is never freed - but disabling caching by building with -DPLUGINS="ALL;-cache" did not solve my problem.

Steps to Reproduce the Problem

Go >= 1.13

I've created two tests in the go-elektra repo. Both create a test keyset with 100000 keys.

  1. TestKeySetMemory creates handles and kdbGets keysets in a loop and after waiting for 1 second - immediately closes the keyset + handle again before start over.
  2. TestKeySetMemoryWithDelayedClose also creates handles and fills keysets with kdbGets - but delays closing the handle and keyset until after all 20 keysets have been loaded. This immitates the behavior of the elektrad webserver.

Both tests wait for 20 seconds after finishing to allow the tester to see the memory consumption of the test via htop or similar tools.

The first test, which immediately closes the handle and keyset retains the same memory foot print over the length of the test.

The second test that starts closing the handles and keysets only after every keyset is 'loaded' never frees any memory. Even after forcing garbage collection and waiting for 20 seconds.

At the moment I'm clueless why the behavior of these tests differ.

You can run the tests by cloning the go-elektra repo and running these two commands in the ./kdb subdirectiry:

PKG_CONFIG_PATH=<PATH TO elektra.pc FILE> go test  -run "^(TestKeySetMemory)\$"

PKG_CONFIG_PATH=<PATH TO elektra.pc FILE> go test  -run "^(TestKeySetMemoryWithDelayedClose)\$"

Expected Result

Memory gets freed after kdbClose and ksClose

Actual Result

Memory does not get freed

System Information

  • Elektra Version: master
  • Operating System: Arch Linux
  • Newest go-elektra version
@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 4, 2019

Relates to #3140

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 4, 2019

It seems unlikely to me that something like that would slip by our valgrind and ASAN tests. I have been running similar benchmarks in C and I could not observe the same. Even with mmap I can't observe anything similar. I'm not ruling it out though.

Are you sure that the bindings aren't leaking somewhere?

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 4, 2019

Yes, I'm sure - it is possible to inspect the memory usage of a go application during runtime, these stats do NOT contain allocations by C code and they are nowhere near the total memory usage of the application / test which would mean that the bulk of the memory is allocated in C.

My suspicion is that keys are not freed because the ref counter is != 0 even though it should be 0.

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 4, 2019

That means that the bindings call the C API in a way that triggers this memleak. Unfortunately I do not know go, so I don’t know why this happens.

Can you maybe extract a small C example which triggers the leak? I do something very similar in benchmarks/kdb.c and there is no leak there, so it needs to do exactly what your benchmark does to trigger the leak.

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 5, 2019

Thank you for reporting the issue!

I fully agree with @mpranj here, we need to reduce the problem to a minimal one, with multiple languages involved it is hard to find anything.

Can you maybe write a C program that reproduces the call sequence as issued from the go bindings? Once we have that, we can minimize the C program for a test case.

Btw. trying to run the command above I failed quite early: #3159

@markus2330 markus2330 assigned raphi011 and unassigned markus2330 Nov 5, 2019
@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 5, 2019

I will try to reproduce this in C until tomorrow

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 6, 2019

I successfully reproduced this issue here. Run the benchmarks/memoryleak.c benchmark and watch the memory go up while getting keysets - and NOT go down when freeing.

Don't forget to add a noticeable amount of keys in the DB .. i tested it with 100k keys.

@mpranj mpranj self-assigned this Nov 6, 2019
@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 6, 2019

Thank you for creating the example! I'll take a look.

At first glance, there is exactly one direct leak, you need to free the parentKey:
https://github.com/raphi011/libelektra/blob/afcbcf5c8138be8de6ba9b1a9e559bc673ff887f/benchmarks/memoryleak.c#L22

This small leak is probably not the reason for your observations.

As a side note: using the mmap cache, I notice even LESS memory consumption than without the cache. Maybe I need to add this to my thesis 😄. (cache:~200M vs. no cache:~600M)

It seems that we are "leaking" memory via dlopen() / dlclose(). The following can be observed by using kdb but NOT using kdb-static (just an excerpt):

valgrind --leak-check=full --show-leak-kinds=all ./bin/kdb ls user
[...]
==48451== 1,192 bytes in 1 blocks are still reachable in loss record 6 of 6
==48451==    at 0x483AB1A: calloc (vg_replace_malloc.c:762)
==48451==    by 0x400BB11: _dl_new_object (in /usr/lib64/ld-2.30.so)
==48451==    by 0x400642F: _dl_map_object_from_fd (in /usr/lib64/ld-2.30.so)
==48451==    by 0x4009315: _dl_map_object (in /usr/lib64/ld-2.30.so)
==48451==    by 0x400DB24: openaux (in /usr/lib64/ld-2.30.so)
==48451==    by 0x4DFE8C8: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
==48451==    by 0x400DF6A: _dl_map_object_deps (in /usr/lib64/ld-2.30.so)
==48451==    by 0x4013AC3: dl_open_worker (in /usr/lib64/ld-2.30.so)
==48451==    by 0x4DFE8C8: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
==48451==    by 0x401363D: _dl_open (in /usr/lib64/ld-2.30.so)
==48451==    by 0x496139B: dlopen_doit (in /usr/lib64/libdl-2.30.so)
==48451==    by 0x4DFE8C8: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
[...]

Your example opens many handles more handles, so it looks like this:

==48735== 72,704 bytes in 1 blocks are still reachable in loss record 8 of 8
==48735==    at 0x483880B: malloc (vg_replace_malloc.c:309)
==48735==    by 0x4F860A9: ??? (in /usr/lib64/libstdc++.so.6.0.27)
==48735==    by 0x400FD59: call_init.part.0 (in /usr/lib64/ld-2.30.so)
==48735==    by 0x400FE60: _dl_init (in /usr/lib64/ld-2.30.so)
==48735==    by 0x4013DBD: dl_open_worker (in /usr/lib64/ld-2.30.so)
==48735==    by 0x4A088C8: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
==48735==    by 0x401363D: _dl_open (in /usr/lib64/ld-2.30.so)
==48735==    by 0x48C739B: dlopen_doit (in /usr/lib64/libdl-2.30.so)
==48735==    by 0x4A088C8: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
==48735==    by 0x4A08962: _dl_catch_error (in /usr/lib64/libc-2.30.so)
==48735==    by 0x48C7B08: _dlerror_run (in /usr/lib64/libdl-2.30.so)
==48735==    by 0x48C7429: dlopen@@GLIBC_2.2.5 (in /usr/lib64/libdl-2.30.so)

In the example you're opening lots of KDB handles, which causes this effect to add up to significant memory consumption. If you open only one handle, it should be less. @raphi011 I don't know if you can use kdb-static for your benchmarks. When linking the benchmarks to the elektra-static library I can not observe the same any more:

==54836== HEAP SUMMARY:
==54836==     in use at exit: 0 bytes in 0 blocks
==54836==   total heap usage: 6,456,631 allocs, 6,456,631 frees, 272,753,180 bytes allocated
==54836== 
==54836== All heap blocks were freed -- no leaks are possible
==54836== 
==54836== For lists of detected and suppressed errors, rerun with: -s
==54836== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

I'm currently not sure what we can do about the problems with dlopen().

@mpranj mpranj changed the title KeySet close memory leak KDB: memory problems with dynamic library `dlopen() / dlclose()` Nov 6, 2019
@mpranj mpranj changed the title KDB: memory problems with dynamic library `dlopen() / dlclose()` KDB: memory problems with dynamic library dlopen() / dlclose() Nov 6, 2019
@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 6, 2019

This is a simplified simulation of the new elektrad server. Since we want to support multiple users configuring the system at the same time we NEED to support multiple handles (conflict handling etc).

I still think this is not the root of the problem. The problem is that Keys / KeySets are not freed. If you change the parent key to "system" which has a lot less keys you will see that memory consumption is much much lower.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 6, 2019

@markus2330 what do you think?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 6, 2019

EDIT: forget this.

I don't think that we are leaking memory there. Let's reduce it to a smaller example, without elektra:
https://gist.github.com/mpranj/bbdf00af308ed3f5b3f0f35bc832756f

I can observe the same thing you are describing with HTOP and the memory usage using the code from the gist above. I don't know, maybe the OS is not pre-emptively cleaning it up, but I think we're basically doing everything we can.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 6, 2019

So you're telling me it's absolutely normal that elektra's memory consumption never shrinks?

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 6, 2019

So you're telling me it's absolutely normal that elektra's memory consumption never shrinks?

I'm not saying that. I'm saying I observed the same regardless of elektra. I just noticed that my observation is only true when running with valgrind, so maybe valgrind is not free()ing until the end for other reasons.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 6, 2019

https://stackoverflow.com/questions/40917024/memory-leak-after-using-calloc-and-free

Seems like it's because you're changing the pointer before freeing it

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 6, 2019

Seems like it's because you're changing the pointer before freeing it

I am not doing that and valgrind shows no leak for the code in the gist. But forget the example, it is irrelevant as it only does not free immediately when run inside valgrind, otherwise the memory is free()d pretty much immediately.

@mpranj mpranj changed the title KDB: memory problems with dynamic library dlopen() / dlclose() KeySet close memory leak Nov 6, 2019
@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 6, 2019

What's more relevant: I observed what you're reporting by simply using benchmark_createkeys which does not use the KDB at all. There the resources are also not free()d immediately, but valgrind shows absolutely 0 leaks. I'm baffled.

@mpranj mpranj removed their assignment Nov 6, 2019
@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 6, 2019

@markus2330 what do you think?

With KeySets alone there should definitely no leak whatsoever. (With KDB we cannot fully control everything as plugins might load other libraries that leak.)

@raphi011 can you create a PR or point me to the minimal example reproducing the problem?

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 6, 2019

https://github.com/raphi011/libelektra/tree/memoryleak here you go. I've extended my previous example by printing how many keys were NOT freed by ksDel because of key references > 0.

If you run the benchmark without valgrind you can see that the keys are not freed.

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 7, 2019

I cannot compile this repo, I get the error:

CMake Error: Error processing file: /home/markus/Projekte/Elektra/repos/libelektra/scripts/cmake/ElektraManpage.cmake
make[2]: *** [src/bindings/intercept/env/CMakeFiles/man-kdb-elektrify-getenv.dir/build.make:61: ../doc/man/man1/kdb-elektrify-getenv.1] Fehler 1
make[1]: *** [CMakeFiles/Makefile2:17236: src/bindings/intercept/env/CMakeFiles/man-kdb-elektrify-getenv.dir/all] Fehler 2

Can you rebase to the main master please?

And please make a PR, it is much easier to see the changes, then.

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 7, 2019

If you run the benchmark without valgrind you can see that the keys are not freed.

Can you copy the output of a run here?

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 7, 2019

I observed what you're reporting by simply using benchmark_createkeys which does not use the KDB at all. There the resources are also not free()d immediately, but valgrind shows absolutely 0 leaks. I'm baffled.

I think we should first follow this trace, as it is much easier to understand if we only have KeySets.

Maybe we added a wrong suppression to valgrind?

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 7, 2019

As a side note: using the mmap cache, I notice even LESS memory consumption than without the cache. Maybe I need to add this to my thesis smile. (cache:~200M vs. no cache:~600M)

These are certainly great news.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 7, 2019

Here you go: #3172

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 7, 2019

Maybe we added a wrong suppression to valgrind?

Valgrind reported using no suppression when I ran the tests, but your mileage my vary.

I think I found the problem during my benchmarks. For me the problem is the eager allocation of the meta KeySet for each key. After a dirty test on my branch, removing the eager meta KeySet allocation the memory consumption quickly goes down after ksDel(). Maybe this is fixed by #3142 as it changes the code I'm talking about.

@mpranj mpranj referenced this issue Nov 7, 2019
2 of 16 tasks complete
@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 8, 2019

Ok, then let us see if #3142 fixes the problem!

@raphi011 did the problem also occur before #3081?

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 8, 2019

#3142 is now merged. @raphi011 can you check if the problem still occurs?

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 8, 2019

Unfortunately yes

@mpranj

This comment has been minimized.

Copy link
Member

@mpranj mpranj commented Nov 8, 2019

For me it seems to at least solve/mitigate some problem. I can now run the benchmark with millions of keys using ~2GB memory, whereas it crashed before because of using >20GB memory for the same benchmark.

@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 8, 2019

For me it seems to at least solve/mitigate some problem. I can now run the benchmark with millions of keys using ~2GB memory, whereas it crashed before because of using >20GB memory for the same benchmark.

I haven't compared it to the previous version yet, but I'm pretty sure that it uses less memory than before. Will check today!

Memory is still not freed though.

@manuelm

This comment has been minimized.

Copy link
Contributor

@manuelm manuelm commented Nov 9, 2019

You need to call malloc_trim(0) after every free (read: ksDel). This forces glibc to immediately return the memory to the OS. This should fix the "weird behavior" you guys are seeing. Oh and have fun reading and digging into glibc :-)

@markus2330

This comment has been minimized.

Copy link
Contributor

@markus2330 markus2330 commented Nov 9, 2019

I created #3183 for further testing/fixing.

I found one memleak (in a kdbGet return keys were not freed).

But the increasing number of "ksClose did NOT free 532 keys" is probably only the warnings collected in the parentKey. If you make NUM_RUNS higher, e.g. 100, it stagnates at some point, as the number of warnings is limited to 100. For me it goes to max. "ksClose did NOT free 901 keys". Having a parentKey per handle would fix this problem.

@raphi011 raphi011 referenced this issue Nov 10, 2019
0 of 16 tasks complete
@raphi011

This comment has been minimized.

Copy link
Contributor Author

@raphi011 raphi011 commented Nov 11, 2019

closed by #3183

@raphi011 raphi011 closed this Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.