Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in sync-repl #5600

Closed
mreynolds389 opened this issue Jan 9, 2023 · 10 comments
Closed

Crash in sync-repl #5600

mreynolds389 opened this issue Jan 9, 2023 · 10 comments
Labels
needs triage The issue will be triaged during scrum

Comments

@mreynolds389
Copy link
Contributor

Issue Description

github CI found another random crash in sync repl plugin during the dynamic plugins test:

=================================== FAILURES ===================================
________________________ test_sync_repl_dynamic_plugin _________________________

topology = <lib389.topologies.TopologyMain object at 0x7f36f06dc510>
request = <FixtureRequest for <Function test_sync_repl_dynamic_plugin>>

    def test_sync_repl_dynamic_plugin(topology, request):
        """Test sync_repl with dynamic plugin
    
        :id: d4f84913-c18a-459f-8525-110f610ca9e6
        :setup: install a standalone instance
        :steps:
            1. reset instance to standard (no retroCL, no sync_repl, no dynamic plugin)
            2. Enable dynamic plugin
            3. Enable retroCL/content_sync
            4. Establish a sync_repl req
        :expectedresults:
            1. Should succeeds
            2. Should succeeds
            3. Should succeeds
            4. Should succeeds
        """
    
        # Reset the instance in a default config
        # Disable content sync plugin
        topology.standalone.plugins.disable(name=PLUGIN_REPL_SYNC)
    
        # Disable retro changelog
        topology.standalone.plugins.disable(name=PLUGIN_RETRO_CHANGELOG)
    
        # Disable dynamic plugins
        topology.standalone.modify_s(DN_CONFIG, [(ldap.MOD_REPLACE, 'nsslapd-dynamic-plugins', b'off')])
        topology.standalone.restart()
    
        # Now start the test
        # Enable dynamic plugins
        try:
            topology.standalone.modify_s(DN_CONFIG, [(ldap.MOD_REPLACE, 'nsslapd-dynamic-plugins', b'on')])
        except ldap.LDAPError as e:
            log.error('Failed to enable dynamic plugin! {}'.format(e.args[0]['desc']))
            assert False
    
        # Enable retro changelog
        topology.standalone.plugins.enable(name=PLUGIN_RETRO_CHANGELOG)
    
        # Enbale content sync plugin
        topology.standalone.plugins.enable(name=PLUGIN_REPL_SYNC)
    
        # create a sync repl client and wait 5 seconds to be sure it is running
>       sync_repl = Sync_persist(topology.standalone)

dirsrvtests/tests/suites/syncrepl_plugin/basic_test.py:638: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
dirsrvtests/tests/suites/syncrepl_plugin/basic_test.py:207: in __init__
    self.conn.open()
/usr/lib/python3.11/site-packages/lib389/__init__.py:1017: in open
    raise e
/usr/lib/python3.11/site-packages/lib389/__init__.py:1013: in open
    self.simple_bind_s(ensure_str(self.binddn), self.bindpw, escapehatch='i am sure')
/usr/lib/python3.11/site-packages/lib389/__init__.py:177: in inner
    return f(*args, **kwargs)
/usr/lib64/python3.11/site-packages/ldap/ldapobject.py:249: in simple_bind_s
    resp_type, resp_data, resp_msgid, resp_ctrls = self.result3(msgid,all=1,timeout=self.timeout)
/usr/lib/python3.11/site-packages/lib389/__init__.py:177: in inner
    return f(*args, **kwargs)
/usr/lib64/python3.11/site-packages/ldap/ldapobject.py:543: in result3
    resp_type, resp_data, resp_msgid, decoded_resp_ctrls, retoid, retval = self.result4(
/usr/lib/python3.11/site-packages/lib389/__init__.py:177: in inner
    return f(*args, **kwargs)
/usr/lib64/python3.11/site-packages/ldap/ldapobject.py:553: in result4
    ldap_result = self._ldap_call(self._l.result4,msgid,all,timeout,add_ctrls,add_intermediates,add_extop)
/usr/lib/python3.11/site-packages/lib389/__init__.py:177: in inner
    return f(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <lib389.DirSrv object at 0x7f36f0689610>
func = <built-in method result4 of LDAP object at 0x7f36f05ac510>
args = (1, 1, -1, 0, 0, 0), kwargs = {}, diagnostic_message_success = None

    def _ldap_call(self,func,*args,**kwargs):
      """
      Wrapper method mainly for serializing calls into OpenLDAP libs
      and trace logs
      """
      self._ldap_object_lock.acquire()
      if __debug__:
        if self._trace_level>=1:
          self._trace_file.write('*** {} {} - {}\n{}\n'.format(
            repr(self),
            self._uri,
            '.'.join((self.__class__.__name__,func.__name__)),
            pprint.pformat((args,kwargs))
          ))
          if self._trace_level>=9:
            traceback.print_stack(limit=self._trace_stack_limit,file=self._trace_file)
      diagnostic_message_success = None
      try:
        try:
>         result = func(*args,**kwargs)
E         ldap.SERVER_DOWN: {'result': -1, 'desc': "Can't contact LDAP server", 'errno': 104, 'ctrls': [], 'info': 'Connection reset by peer'}

/usr/lib64/python3.11/site-packages/ldap/ldapobject.py:128: SERVER_DOWN
------------------------------ Running instances -------------------------------
	supplier2
	supplier1
------------------------------ Stopped instances -------------------------------
	standalone1
---------------------------------- Core files ----------------------------------
@mreynolds389 mreynolds389 added the needs triage The issue will be triaged during scrum label Jan 9, 2023
@mreynolds389
Copy link
Contributor Author

ASAN Report

==370885==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60400057d880 at pc 0x7ff9100c13b8 bp 0x7ff86e404d50 sp 0x7ff86e404d48
READ of size 8 at 0x60400057d880 thread T19
    #0 0x7ff9100c13b7 in slapi_get_object_extension (/usr/lib64/dirsrv/libslapd.so.0+0x2c13b7)
    #1 0x7ff90d19fad3 in sync_get_operation_extension ldap/servers/plugins/sync/sync_refresh.c:896
    #2 0x7ff90d19fad3 in sync_srch_refresh_pre_result ldap/servers/plugins/sync/sync_refresh.c:284
    #3 0x7ff91015806f in plugin_call_func ldap/servers/slapd/plugin.c:2001
    #4 0x7ff910158545 in plugin_call_list ldap/servers/slapd/plugin.c:1944
    #5 0x7ff910195b98 in flush_ber ldap/servers/slapd/result.c:1782
    #6 0x7ff91019a4df in send_ldap_result_ext ldap/servers/slapd/result.c:642
    #7 0x7ff91018fb72 in send_ldap_result (/usr/lib64/dirsrv/libslapd.so.0+0x38fb72)
    #8 0x7ff910157b3b in slapi_send_ldap_result (/usr/lib64/dirsrv/libslapd.so.0+0x357b3b)
    #9 0x7ff9100b2eb5 in dse_modify ldap/servers/slapd/dse.c:2126
    #10 0x7ff910121716 in op_shared_modify ldap/servers/slapd/modify.c:1022
    #11 0x7ff91012630b in do_modify (/usr/lib64/dirsrv/libslapd.so.0+0x32630b)
    #12 0x5586497f8320 in connection_dispatch_operation ldap/servers/slapd/connection.c:653
    #13 0x5586497f8320 in connection_threadmain ldap/servers/slapd/connection.c:1805
    #14 0x7ff910ee4412 in _pt_root (/lib64/libnspr4.so+0x2c412)
    #15 0x7ff90fa8cdec in start_thread (/lib64/libc.so.6+0x8cdec)
    #16 0x7ff90fb1236f in clone3 (/lib64/libc.so.6+0x11236f)

0x60400057d880 is located 0 bytes to the right of 48-byte region [0x60400057d850,0x60400057d880)
allocated by thread T19 here:
    #0 0x7ff9108ba097 in calloc (/lib64/libasan.so.8+0xba097)
    #1 0x7ff910075c55 in slapi_ch_calloc (/usr/lib64/dirsrv/libslapd.so.0+0x275c55)
    #2 0x7ff9100c435c in factory_create_extension (/usr/lib64/dirsrv/libslapd.so.0+0x2c435c)
    #3 0x5586497e2ffe in connection_add_operation ldap/servers/slapd/connection.c:2086
    #4 0x5586497e2ffe in connection_make_new_pb ldap/servers/slapd/connection.c:977
    #5 0x5586497f6072 in connection_threadmain ldap/servers/slapd/connection.c:1614
    #6 0x7ff910ee4412 in _pt_root (/lib64/libnspr4.so+0x2c412)
    #7 0x7ff90fa8cdec in start_thread (/lib64/libc.so.6+0x8cdec)
    #8 0x7ff90fb1236f in clone3 (/lib64/libc.so.6+0x11236f)

@progier389
Copy link
Contributor

FYI: another signature of this issue without asan:

Thread 1 (Thread 0x7f2df63fe6c0 (LWP 1378)):
#0 0x00007f2e2feb3fc8 in sync_cookie_free (freecookie=0xf) at ldap/servers/plugins/sync/sync_util.c:823
#1 0x00007f2e2feb402a in sync_delete_SyncOpInfo (info=0x7f2df63fc720) at ldap/servers/plugins/sync/sync_refresh.c:878
#2 sync_operation_extension_dtor (ext=, object=, parent=) at ldap/servers/plugins/sync/sync_refresh.c:887
#3 0x00007f2e343b6c23 in factory_destroy_extension (extension=0x7f2e303af8e0, parent=0x0, object=0x7f2e303af800, type=) at ldap/servers/slapd/factory.c:340
#4 factory_destroy_extension (type=, object=0x7f2e303af800, parent=0x0, extension=0x7f2e303af8e0) at ldap/servers/slapd/factory.c:321
#5 0x00007f2e343e02f1 in operation_done (op=op@entry=0x7f2e02e080d0, conn=0x0) at ldap/servers/slapd/operation.c:209
#6 0x00007f2e343e0411 in operation_free (op=0x7f2e02e080d0, conn=) at ldap/servers/slapd/operation.c:241
#7 0x00007f2e343e53bc in pblock_done (pb=0x7f2e02e080c0) at ldap/servers/slapd/pblock.c:210
#8 0x00007f2e343e5487 in slapi_pblock_destroy (pb=) at ldap/servers/slapd/pblock.c:233
#9 0x00007f2e343f6282 in slapi_search_get_entry_done (pb=0x7f2df63fc8f0) at ldap/servers/slapd/plugin_internal_op.c:937
#10 0x00007f2e343e09df in op_shared_modify (pb=pb@entry=0x7f2e02e08000, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1077
#11 0x00007f2e343e25a2 in do_modify (pb=pb@entry=0x7f2e02e08000) at ldap/servers/slapd/modify.c:377
#12 0x00005624909e41c8 in connection_dispatch_operation (pb=0x7f2e02e08000, op=, conn=) at ldap/servers/slapd/connection.c:653
#13 connection_threadmain (arg=) at ldap/servers/slapd/connection.c:1805
#14 0x00007f2e33a4b413 in _pt_root () at /lib64/libnspr4.so
#15 0x00007f2e3413d14d in start_thread () at /lib64/libc.so.6
#16 0x00007f2e341bea00 in clone3 () at /lib64/libc.so.6

And *info (aka ext (the extension data)) is 0x7 telling that factory extension data got corrupted so it is likely the same problem

@mreynolds389
Copy link
Contributor Author

mreynolds389 commented Jan 16, 2023

Ok, so the issue is related to dynamic plugins. Basically on the same connection where the plugin is enabled, we already set the factory extension for the operation before turning the plugin on. Then in that same connection the plugin later registers two new extensions and it tries to use them, but the extension allocation was already done and then we read past the original extension allocation.

@Firstyear
Copy link
Contributor

I think I've commented before that dynamic plugins are horrendously unsafe wrt to multi thread setups so we shouldn't allow them :(

@mreynolds389
Copy link
Contributor Author

I think I've commented before that dynamic plugins are horrendously unsafe wrt to multi thread setups so we shouldn't allow them :(

While I agree, the issue is here is how plugins get called (not a threading issue or race condition in this case), as none of code expects plugins to be enabled after the server starts up. In this case the "factory" code expects all "extension types" to be registered during the start up process. When it happens after the server starts then things break. Not an easy fix (but fixable), but then again dynamic plugins are not officially supported :-)

@mreynolds389
Copy link
Contributor Author

Summary of issue

[1] Enable dynamic plugins
[2] Enable Sync Repl plugin
[3] The mod operation that enables the plugin gets assigned a factory "Operation" extension.
[4] Then Sync Repl registers two new "Operation" factory extensions which updates the global Factory config (increments "extension_count")
[5] Later in the same operation we call a Sync Repl preop result function
[6] This "result" function tries to get the operation factory extensions
[7] It uses the global factory store setting "extension_count" to know how far to read into the memory to find extensions.
[8] This causes a heap buffer overflow because this operation's extensions were allocated/added before Sync Repl added more extensions.

So the issue is that the factory extensions design simply does not work with any form of dynamic plugins. There are ugly hacks to get this working safely, but do we want to invest more time into dynamic plugins and all other features that could be impacted by it? Or just leave dynamic plugins unsupported?

@progier389
Copy link
Contributor

IMHO if we really want to support safely the dynamic plugin we should rather find a way to insure that no other operation is running when setting up the plugin (it should prevents all these multi thread race conditions that @Firstyear is speaking about.), But even that will not solve the factory extension issue (we will still have the problem
with the connection object)
BTW if you speak about adding the list size in the extension pointer list in the object, and check it while
getting/removing an extension, I would not call that an ugly hack ! -;)
So I have mixed feeling: Fixing the extension is move on the right direction but it will not be enough to support dynamic plugins

@mreynolds389
Copy link
Contributor Author

@progier389 yeah we should fix the factory extension in this case. I guess Thierry told you about my hack idea, and glad to hear you don't think it's too hacky :-) Well perhaps next sprint we can work on that - there is no rush to fix it at the moment...

@mreynolds389
Copy link
Contributor Author

mreynolds389 added a commit to mreynolds389/389-ds-base that referenced this issue Feb 17, 2023
…mic plugins is enabled

Description:

Our factory extension code was not designed to work with dynamic plugins.  It
assumes all extensions are regiestered at startup.  If extensions are added
after startup (when dynamic plgun is is enabled) then this breaks.  The fix is
to keep track in each "object" how many extensions were allocated, instead of
relying on the global extension count.

relates: 389ds#5600

Reviewed by: spichugi(Thanks!)
mreynolds389 added a commit to mreynolds389/389-ds-base that referenced this issue Feb 27, 2023
…mic plugins is enabled

Description:

Our factory extension code was not designed to work with dynamic plugins.  It
assumes all extensions are registered at startup.  If extensions are added
after startup (when dynamic plugins is enabled) then this breaks.  The fix is
to keep track of how many extensions were allocated per object, instead of
relying on the global extension count.

Patch written by Pierre Rogier - thanks!

relates: 389ds#5600

Reviewed by: jachapman (Thanks!)
mreynolds389 added a commit that referenced this issue Feb 27, 2023
…mic plugins is enabled

Description:

Our factory extension code was not designed to work with dynamic plugins.  It
assumes all extensions are registered at startup.  If extensions are added
after startup (when dynamic plugins is enabled) then this breaks.  The fix is
to keep track of how many extensions were allocated per object, instead of
relying on the global extension count.

Patch written by Pierre Rogier - thanks!

relates: #5600

Reviewed by: jachapman (Thanks!)
mreynolds389 added a commit that referenced this issue Feb 27, 2023
…mic plugins is enabled

Description:

Our factory extension code was not designed to work with dynamic plugins.  It
assumes all extensions are registered at startup.  If extensions are added
after startup (when dynamic plugins is enabled) then this breaks.  The fix is
to keep track of how many extensions were allocated per object, instead of
relying on the global extension count.

Patch written by Pierre Rogier - thanks!

relates: #5600

Reviewed by: jachapman (Thanks!)
@mreynolds389
Copy link
Contributor Author

41e7a81..15a8e72 389-ds-base-2.2 -> 389-ds-base-2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage The issue will be triaged during scrum
Projects
None yet
Development

No branches or pull requests

3 participants