Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

objects and memory leaks #3547

Closed
389-ds-bot opened this issue Sep 13, 2020 · 8 comments
Closed

objects and memory leaks #3547

389-ds-bot opened this issue Sep 13, 2020 · 8 comments
Labels
closed: fixed Migration flag - Issue
Milestone

Comments

@389-ds-bot
Copy link

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/50490

  • Created at 2019-07-11 16:44:39 by lkrispen (@elkris)
  • Closed at 2020-02-26 16:14:41 as fixed
  • Assigned to nobody

When I investigated memory leaks I came across one type of leak with objects, which also raises the question of teh safety of the object implementation.

The leak is (part of it, there are much more allocations inside replica_new):100:

 Indirect leak of 272 byte(s) in 1 object(s) allocated from:
     0 0x7f6780d19210 in calloc (/lib64/libasan.so.5+0xef210)
     1 0x7f678050d8d7 in slapi_ch_calloc /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/ch_malloc.c:175
     2 0x7f67713b106e in replica_new_from_entry /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_replica.c:159
     3 0x7f67713b0e0c in replica_new /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_replica.c:131
     4 0x7f67713a16be in multimaster_mtnode_construct_replicas /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_mtnode_ext.c:77
     5 0x7f67713a097a in multimaster_start /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_init.c:798
     6 0x7f6780622b22 in plugin_call_func /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:2034
     7 0x7f6780622787 in plugin_call_one /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1983
     8 0x7f67806214c4 in plugin_dependency_startall /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1737
     9 0x7f6780622716 in plugin_startall /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1955
     10 0x4483e1 in main /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/main.c:1153
     11 0x7f677cf8211a in __libc_start_main (/lib64/libc.so.6+0x2311a)

So there is a replica object created by a call to replica_new() and the added as an object to the mapping tree:

 ext->replica = object_new(r, replica_destroy);

it provides a destructor: replica_destroy(), but this is only called if refcount==1 when object_release is called.

It is hard to find where the balance of object_acquire() and object_release() gets broken.
I did some investigation and tentative fixes, adding calls to object_release() in places I thought they were missing - with mixed success. In some test scenarios the leak disappeared, in others the refcount to the last call of object_release was reduced to 2 (one still missing), in others I did get a crash because the object was already freed before the last call.

So there is definitely a mess in our acquire and release balance, but investigating this I also have a doubt how safe our object implementation is.
The only protection is a refcount to prevent freeing of an object while in use, but there are some conditions where it can go wrong.

First look at object_acquire:

 void
 object_acquire(Object *o)
 {
     PR_ASSERT(NULL != o);
     slapi_atomic_incr_64(&(o->refcnt), __ATOMIC_RELEASE);
     }

there is no guarantee that the object is valid when called. We have eg references in the prp replication protocol struct, which is then used again and again by the replication protocol.

And looking at object_release() we see that there could be race conditions not protected by the refcount:

 void
 object_release(Object *o)
 {
     PRInt32 refcnt_after_release;

     PR_ASSERT(NULL != o);
     refcnt_after_release = slapi_atomic_decr_64(&(o->refcnt), __ATOMIC_ACQ_REL);
     if (refcnt_after_release == 0) {
         /* Object can be destroyed */
         if (o->destructor)
             o->destructor(&o->data);
         /* Make it harder to reuse a dangling pointer */
         o->data = NULL;
         o->destructor = NULL;
         slapi_ch_free((void **)&o);
     }
 }

if acquire and release are called in parallel for an object with refcount of 1, release could decrement it to 0 before acquire has incremented it. the single ops are atomic, but the order in which they are called is open. So onen thread can decrement it, the other thread increment it and be happy, but it can be freed before use.

@389-ds-bot 389-ds-bot added the closed: fixed Migration flag - Issue label Sep 13, 2020
@389-ds-bot 389-ds-bot added this to the 1.4.1 milestone Sep 13, 2020
@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2019-07-11 19:33:06

Yeah I've had plenty of issues with this object acquiring and releasing when I worked on CleanAllRUV. It is definitely a mess, but at the same time we don't seem to have any known crashes related to this - at least I'm not aware of any of crashes. I guess the crash threat is more around removing a replica while the replica is processing updates?

Adding a lock might work, but I worry about the perf impact.

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2019-07-11 19:33:07

Metadata Update from @mreynolds389:

  • Custom field origin adjusted to None
  • Custom field reviewstatus adjusted to None

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2019-07-14 20:43:45

yes, we shoul not put any effort into the object mechanism, I am thinking about an approach in the other direction, at least for the replica objects.
A replica object is created and the refcount is set to 1, then there are or should be only balanced calls to acquire/release and at shutdown we do another release, which should free the replica. So in short we create a replica and it will live until shutdown, we just could forget about the acuire/release calls - just use it until shhutdown. At shutdown we have stopped the worker threads, closed the changelog, stopped the agreements, there is nothing left to access the replica.
The case of removing a replica is slightly different, we have to remove the agreements first, otherwise we could not delete the replica, the agmts are child entries, we close the changelog and set the reference to the replica object to NULL, this lets any further update operation detect that there is no replica. We probably cannot free it since there might be some ops having a reference to it. But the reference in the mapping tree is lost and we will leak it anyway.
I will investigate further in this direction, comments welcome

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2019-07-15 01:12:41

How hard would it be to remove the "object" mechanism from the replica struct, and then re-approach how we do refcounting?

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2019-07-15 17:35:27

Just a comment

I agree with @mreynolds389 and @elkris, objects code would benefit from a lock but considering the possible impact on performance and that we have not seen any crash we may postpone the addition of a lock.

I think object are not designed for long life object. We continuously access (acquire/release) an object that is here for sure until the end. It is a waste of energy and like you said @elkris it is buggy anyway as acquire/release are not balance.
I like the idea of moving long life structure out of object framework. We can allocate a replica and keep it until shutdown. IMHO if we stop all workers/tasks I think it is safe to free the replica at shutdown.

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2019-08-08 17:27:13

Metadata Update from @mreynolds389:

  • Issue set to the milestone: 1.4.1

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from lkrispen (@elkris) at 2020-02-26 16:14:31

fixed: #3579

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2020-02-26 16:14:42

Metadata Update from @elkris:

  • Issue close_status updated to: fixed
  • Issue status updated to: Closed (was: Open)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed: fixed Migration flag - Issue
Projects
None yet
Development

No branches or pull requests

1 participant