Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with debugging a RPyC-related issue: KeyError in colls.py - how to debug this further? #526

Open
DavidAntliff opened this issue Feb 7, 2023 · 4 comments
Assignees
Labels
Triage Investigation by a maintainer has started

Comments

@DavidAntliff
Copy link

DavidAntliff commented Feb 7, 2023

I'm looking for some help debugging this, please.

I'm using RPyC 5.3.0 in a Ubuntu 22.04 Docker container, with Python 3.9, and rarely (but too frequently) I see the following occur, when trying to read a value from a remote memoryview (the variable mv here, of type <netref class 'rpyc.core.netref.type'>). This is running under pytest:

memory_access.py:59: in read
    return mv[offset]
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/netref.py:261: in method
    return syncreq(_self, consts.HANDLE_CALLATTR, name, args, kwargs)
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/netref.py:63: in syncreq
    return conn.sync_request(handler, proxy, *args)
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/protocol.py:718: in sync_request
    return _async_res.value
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <AsyncResult object (ready) at 0x7f08413a79f0>
    @property
    def value(self):
        """Returns the result of the operation. If the result has not yet
        arrived, accessing this property will wait for it. If the result does
        not arrive before the expiry time elapses, :class:`AsyncResultTimeout`
        is raised. If the returned result is an exception, it will be raised
        here. Otherwise, the result is returned directly.
        """
        self.wait()
        if self._is_exc:
>           raise self._obj
E           _get_exception_class.<locals>.Derived: ('subprocess.CompletedProcess', 187650931238928, 281473569049664)
E           
E           ========= Remote Traceback (1) =========
E           Traceback (most recent call last):
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 355, in _dispatch_request
E               args = self._unbox(args)
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 321, in _unbox
E               return tuple(self._unbox(item) for item in value)
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 321, in <genexpr>
E               return tuple(self._unbox(item) for item in value)
E             File "/usr/lib/python3.8/site-packages/rpyc/core/protocol.py", line 323, in _unbox
E               return self._local_objects[value]
E             File "/usr/lib/python3.8/site-packages/rpyc/lib/colls.py", line 110, in __getitem__
E               return self._dict[key][0]
E           KeyError: ('subprocess.CompletedProcess', 187650931238928, 281473569049664)
/home/docker/.local/lib/python3.9/site-packages/rpyc/core/async_.py:108: KeyError

My code performs many, many such read/write calls over a netref'd memoryview, and it's always one particular read that fails, if it fails at all. The access offset is correct.

I'm trying to gather clues so I have a few questions, please:

  1. How should I interpret this trace? The client is expecting to read a value of type int from the memoryview (return mv[offset]), but the server throws this exception instead. What does that indicate went wrong? What could be happening to cause this?
  2. is the key subprocess.CompletedProcess coming from an internal, RPyC mechanism, or is that from my own code (or perhaps third-party code I call)? There is nothing to do with subprocess in any of my code to do with the memoryview, but there are calls to subprocess elsewhere in my program (on the server side).

Recently, I was using RPyC 5.1.0 instead, and I was seeing something similar - the same server exception was being raised (as a debug-level log entry), but the client was oblivious to this and did not see an exception. The correct value would be returned to the client. But occasionally when this happens, the client would get a None value back from return mv[offset], which is, of course, incorrect, and caused downstream code to fail. Updating to 5.3.0 seems to result in the exception propagating to the client, rather than returning None.


EDIT: I've put a try/except/breakpoint around line 322 in _unbox, to catch the error when label == consts.LABEL_LOCAL_REF - and inspected self._local_objects. It's very large (589 entries) and has six that have subprocess.CompletedProcess in the key. All six of these have a match with the first number in the key's tuple, but none of them match with the second number (and the number in the key is significantly larger than any of the numbers in the _local_objects collection).

So it looks like this subprocess.CompletedProcess value is from my own code, but there seems to be a missing entry when RPyC unboxes it. What could lead to this happening? It seems to occur slightly more often if I load the server's CPU up to 100% (to slow it down a bit).

@comrumino comrumino self-assigned this Mar 10, 2023
@comrumino
Copy link
Collaborator

This is difficult to triage without a test case to reproduce locally. But, here are my first thoughts on the topic

  1. ('subprocess.CompletedProcess', 187650931238928, 281473569049664) is the id pack for an initialized class of subprocess.CompletedProcess. See rpyc.lib.__init__.get_id_pack. This basically is how we uniquely identify objects w/ RPyC
  2. It's trying to unbox a netref that it has not seen before. Which, seems problematic to say the least. If it is happening as load increases, it is most likely a race condition.
    Are you able to reproduce this issue on the latest version?

@comrumino comrumino added the Triage Investigation by a maintainer has started label Mar 10, 2023
@DavidAntliff
Copy link
Author

Thanks for getting back to me. I will try with 5.3.1 as time allows.

This does seem like quite a rare thing - it only occurs within our fairly intensive and long-running integration test system, and rarely at that at an individual test level, but we see it often since we run so many tests. We've discovered that skipping specific tests seems to ensure it does not occur, so it seems to be related to certain tests and therefore perhaps certain kinds of workloads. It would be impractical to share this code as it's company-owned, I'm afraid, and I haven't been successful with a slimmed down minimum reproducible example either.

In our basic "does it work" tests where we thrash connections and messages, RPyC seems 100% robust, so there's an unusual situation involved somewhere, I think.

I'll do some more testing when I get time and get back to this thread when I have more info. Thanks again.

@comrumino
Copy link
Collaborator

@DavidAntliff fair enough. You may want to try the new bind_threads setting if you all use the threading module or service. The benchmarks seem to show comparable performance, and it helps with some of RPyC's less intuitive behaviors.

@comrumino
Copy link
Collaborator

RPyC seems 100% robust

Things get more sketchy when threading is involved. You may want to look into where __del__ is called on the netref as it would decrement the local object ref count. I would venture a guess that it is decremented to zero and dropped from local cache before desired. I do think bind_threads might solve this issue for you. There are some fixes on master that aren't released yet as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triage Investigation by a maintainer has started
Projects
None yet
Development

No branches or pull requests

2 participants