objects and memory leaks #3547

389-ds-bot · 2020-09-13T07:53:01Z

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/50490

Created at 2019-07-11 16:44:39 by lkrispen (@elkris)
Closed at 2020-02-26 16:14:41 as fixed
Assigned to nobody

When I investigated memory leaks I came across one type of leak with objects, which also raises the question of teh safety of the object implementation.

The leak is (part of it, there are much more allocations inside replica_new):100:

 Indirect leak of 272 byte(s) in 1 object(s) allocated from:
     0 0x7f6780d19210 in calloc (/lib64/libasan.so.5+0xef210)
     1 0x7f678050d8d7 in slapi_ch_calloc /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/ch_malloc.c:175
     2 0x7f67713b106e in replica_new_from_entry /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_replica.c:159
     3 0x7f67713b0e0c in replica_new /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_replica.c:131
     4 0x7f67713a16be in multimaster_mtnode_construct_replicas /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_mtnode_ext.c:77
     5 0x7f67713a097a in multimaster_start /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/plugins/replication/repl5_init.c:798
     6 0x7f6780622b22 in plugin_call_func /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:2034
     7 0x7f6780622787 in plugin_call_one /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1983
     8 0x7f67806214c4 in plugin_dependency_startall /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1737
     9 0x7f6780622716 in plugin_startall /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin.c:1955
     10 0x4483e1 in main /home/elkris/TEST/caa/ws/389-ds-base/ldap/servers/slapd/main.c:1153
     11 0x7f677cf8211a in __libc_start_main (/lib64/libc.so.6+0x2311a)

So there is a replica object created by a call to replica_new() and the added as an object to the mapping tree:

 ext->replica = object_new(r, replica_destroy);

it provides a destructor: replica_destroy(), but this is only called if refcount==1 when object_release is called.

It is hard to find where the balance of object_acquire() and object_release() gets broken.
I did some investigation and tentative fixes, adding calls to object_release() in places I thought they were missing - with mixed success. In some test scenarios the leak disappeared, in others the refcount to the last call of object_release was reduced to 2 (one still missing), in others I did get a crash because the object was already freed before the last call.

So there is definitely a mess in our acquire and release balance, but investigating this I also have a doubt how safe our object implementation is.
The only protection is a refcount to prevent freeing of an object while in use, but there are some conditions where it can go wrong.

First look at object_acquire:

 void
 object_acquire(Object *o)
 {
     PR_ASSERT(NULL != o);
     slapi_atomic_incr_64(&(o->refcnt), __ATOMIC_RELEASE);
     }

there is no guarantee that the object is valid when called. We have eg references in the prp replication protocol struct, which is then used again and again by the replication protocol.

And looking at object_release() we see that there could be race conditions not protected by the refcount:

 void
 object_release(Object *o)
 {
     PRInt32 refcnt_after_release;

     PR_ASSERT(NULL != o);
     refcnt_after_release = slapi_atomic_decr_64(&(o->refcnt), __ATOMIC_ACQ_REL);
     if (refcnt_after_release == 0) {
         /* Object can be destroyed */
         if (o->destructor)
             o->destructor(&o->data);
         /* Make it harder to reuse a dangling pointer */
         o->data = NULL;
         o->destructor = NULL;
         slapi_ch_free((void **)&o);
     }
 }

if acquire and release are called in parallel for an object with refcount of 1, release could decrement it to 0 before acquire has incremented it. the single ops are atomic, but the order in which they are called is open. So onen thread can decrement it, the other thread increment it and be happy, but it can be freed before use.

The text was updated successfully, but these errors were encountered:

389-ds-bot · 2020-09-13T07:53:03Z

Comment from mreynolds (@mreynolds389) at 2019-07-11 19:33:06

Yeah I've had plenty of issues with this object acquiring and releasing when I worked on CleanAllRUV. It is definitely a mess, but at the same time we don't seem to have any known crashes related to this - at least I'm not aware of any of crashes. I guess the crash threat is more around removing a replica while the replica is processing updates?

Adding a lock might work, but I worry about the perf impact.

389-ds-bot · 2020-09-13T07:53:04Z

Comment from mreynolds (@mreynolds389) at 2019-07-11 19:33:07

Metadata Update from @mreynolds389:

Custom field origin adjusted to None
Custom field reviewstatus adjusted to None

389-ds-bot · 2020-09-13T07:53:05Z

Comment from lkrispen (@elkris) at 2019-07-14 20:43:45

yes, we shoul not put any effort into the object mechanism, I am thinking about an approach in the other direction, at least for the replica objects.
A replica object is created and the refcount is set to 1, then there are or should be only balanced calls to acquire/release and at shutdown we do another release, which should free the replica. So in short we create a replica and it will live until shutdown, we just could forget about the acuire/release calls - just use it until shhutdown. At shutdown we have stopped the worker threads, closed the changelog, stopped the agreements, there is nothing left to access the replica.
The case of removing a replica is slightly different, we have to remove the agreements first, otherwise we could not delete the replica, the agmts are child entries, we close the changelog and set the reference to the replica object to NULL, this lets any further update operation detect that there is no replica. We probably cannot free it since there might be some ops having a reference to it. But the reference in the mapping tree is lost and we will leak it anyway.
I will investigate further in this direction, comments welcome

389-ds-bot · 2020-09-13T07:53:06Z

Comment from firstyear (@Firstyear) at 2019-07-15 01:12:41

How hard would it be to remove the "object" mechanism from the replica struct, and then re-approach how we do refcounting?

389-ds-bot · 2020-09-13T07:53:07Z

Comment from tbordaz (@tbordaz) at 2019-07-15 17:35:27

Just a comment

I agree with @mreynolds389 and @elkris, objects code would benefit from a lock but considering the possible impact on performance and that we have not seen any crash we may postpone the addition of a lock.

I think object are not designed for long life object. We continuously access (acquire/release) an object that is here for sure until the end. It is a waste of energy and like you said @elkris it is buggy anyway as acquire/release are not balance.
I like the idea of moving long life structure out of object framework. We can allocate a replica and keep it until shutdown. IMHO if we stop all workers/tasks I think it is safe to free the replica at shutdown.

389-ds-bot · 2020-09-13T07:53:07Z

Comment from mreynolds (@mreynolds389) at 2019-08-08 17:27:13

Metadata Update from @mreynolds389:

Issue set to the milestone: 1.4.1

389-ds-bot · 2020-09-13T07:53:08Z

Comment from lkrispen (@elkris) at 2020-02-26 16:14:31

fixed: #3579

389-ds-bot · 2020-09-13T07:53:09Z

Comment from lkrispen (@elkris) at 2020-02-26 16:14:42

Metadata Update from @elkris:

Issue close_status updated to: fixed
Issue status updated to: Closed (was: Open)

389-ds-bot added the closed: fixed Migration flag - Issue label Sep 13, 2020

389-ds-bot added this to the 1.4.1 milestone Sep 13, 2020

389-ds-bot closed this as completed Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

objects and memory leaks #3547

objects and memory leaks #3547

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 •

edited

Loading

389-ds-bot commented Sep 13, 2020

objects and memory leaks #3547

objects and memory leaks #3547

Comments

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 • edited Loading

389-ds-bot commented Sep 13, 2020

389-ds-bot commented Sep 13, 2020 •

edited

Loading