Issues with long-lasting RPC calls #37

makmanalp · 2012-06-28T05:01:33Z

Test case for Reproduction:

import zerorpc

import time
from gevent import monkey; monkey.patch_all()                                                                                                                                                           

class Foo(object):

    def wait(self):
        print "Start waitin"
        time.sleep(15)
        print "End waitin"
        return "derp"

s = zerorpc.Server(Foo())
s.bind("tcp://0.0.0.0:3333")
s.run()

and the client:

import zerorpc

s = zerorpc.Client("tcp://0.0.0.0:3333", timeout=3000)                                                                                                                                                      
print s.wait()

The result for me on the server is:

/!\ gevent_zeromq BUG /!\ catching after missing event /!
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': 'e09d15b1-6599-495b-a54d-031e6e7b4039', 'zmqid': ['\x00\xa9\x15J\x94\x8b|O\xf1\xae\x17\xb3M\x82\x8b\xee\x17'], 'message_id': 'e09d15b4-6599-495b-a54d-031e6e7b4039', 'v': 3} [...]

while on the client I get "derp" fine. Without the monkey patch, predictably, the connection is lost due to lack of heartbeat because sleep blocks.

This is a simplification of a larger bug I've been having when dealing with a bunch of workers that all call subprocess to spawn an external executable and do something that takes a while. (With the newest gevent from trunk, there is a subprocess monkey patch).

Any ideas? Zerorpc 0.2.1, gevent 1.0.

makmanalp · 2012-06-28T05:30:12Z

I got a chance to pull this closer to what I'm experiencing. Server:

import zerorpc

import time
from gevent import monkey; monkey.patch_all()

class Foo(object):

    def wait(self, x):
        print "Start waitin %s" % x
        time.sleep(15)
        print "End waitin %s" % x
        return "derp %s" % x

s = zerorpc.Server(Foo(), pool_size=2)                                                                                                                                                                      
s.bind("tcp://0.0.0.0:3333")
s.run()

Client:

import zerorpc

c = zerorpc.Client("tcp://0.0.0.0:3333", timeout=3000)

work = ["a", "b", "c", "d"]
futures = [c.wait(x, async=True) for x in work]
[future.get() for future in futures]

Result: Same errors as above, but an additional "LostRemote: Lost remote after 10s heartbeat" on the client side before the server can complete and send the result, even though time is monkeypatched! Also, the server keeps continuing on to c and d even though the error comes on the client side after a and b:

truffle:trial makmanalp$ ipython main.py
Start waitin b
Start waitin a
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c75-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbfE'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c76-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c74-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbfE'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c77-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c75-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbfE'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c7a-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c74-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbfE'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c7b-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
End waitin b
Start waitin c
End waitin a
Start waitin d
End waitin c
End waitin d

So it looks like at least the tasks are getting submitted correctly, but the server is oblivious of the client disconnection.

UPDATE: While messing around, I found that removing async fixes the problem, I figured it was because now the code does only one request at a time. Then, this led me to mess around with pool size and it turns out that if I leave async in but remove the pool size limit, the LostRemote doesn't happen! Am I missing some basic assumption about how this is supposed to work?

bombela · 2012-06-28T18:00:20Z

Hi,

So as you said in your first example, using time.sleep() without monkey patching, sleep the whole process, thus freezing it to for any asynchronous IO activities. The client then doesn't receive any heartbeat from the server anymore and complain after 10s (default heartbeat frequency is 5s, and it abort after 2 missing heartbeats).

If you really want to do stuff like, you can disable the heartbeat on both sides (heartbeat=None) in both the constructor's parameters of the server and client.

But this is probably not what you want, you want to use a monkey patched or more specifically, a gevent compliant version of subprocess. I cant help you much with gevent 1.0 since zerorpc was never tested against it (still using the version available on pypi).

I believe gevent 1.0 ship a gevent friendly version of subprocess. Else, you can also try with [pip install gevent_subprocess - https://github.com/bombela/gevent_subprocess] for gevent < 1.0. (pip install gevent_subprocess).

In your second comment, initially you limited the pool_size to 2 possible concurrent requests. When you call 4 times wait() in a quick sequence, only two call can be processed right away, and both take 15s to process. Meanwhile, the two pending calls are still not connected, and the client's heartbeat give up after 10s.

Intuitively, it would make more sens for the 2 pending requests to wait until the server can accept a new requests (at least, until the 30 timeout kick in). But because for the moment there is one heartbeat per requests (!!! backward compatibly !!!), the server will not start heart-beating until a request is being processed.

This discrepancy will be fixed at some point, we anyway need to fix it here at dotCloud to be able to go further in load-balancing strategy (and it will be fixed with respect to backward-compatibility, so everything will be able to speak happily to each others).

For the moment, if its a really big problem for you, you can disable the heartbeat on both sides (but then streaming can't be used anymore).

Regards,
fx

makmanalp · 2012-06-30T00:26:07Z

Hey,
Thanks for taking the time for replying. The latter was more of a concern for me. The 1.0 subprocess works well.

Yeah, I'd expected the server to have a task queue of sorts and the heartbeat being per-server instead of per-request. As a workaround, I'll just have to limit how many tasks I send at a time instead of dumping all the tasks in at once and letting the server handle the limiting.

Meanwhile, I'll leave the report open if that sounds fine to you. If you need a hand at fixing the way heartbeats work, I'd be glad to have a go at it at some point.

Cheers,
~mali

v3ss0n · 2013-04-20T17:33:02Z

I am also looking forward into this. But zeromq seems to be slowing down in development . Not any sensible update for a month already. Real updates are 2 months ago.. ZeroRPC is major product and thing that making dotcloud proud right?

bombela · 2015-06-16T10:05:48Z

If it makes you feel better, I took over the maintenance of zerorpc recently. Hopefully I can keep it maintained, and eventually move it forward.

dhm116 mentioned this issue May 15, 2013

Server becomes unresponsive #61

Open

maxekman mentioned this issue Jun 10, 2013

Heartbeat trying to emit on None object #63

Closed

bombela closed this as completed Jun 16, 2015

rirze mentioned this issue Jul 31, 2016

pokecli.py and web.py must *both* run on python2 or 3 jekirl/poketrainer#324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with long-lasting RPC calls #37

Issues with long-lasting RPC calls #37

makmanalp commented Jun 28, 2012

makmanalp commented Jun 28, 2012

bombela commented Jun 28, 2012

makmanalp commented Jun 30, 2012

v3ss0n commented Apr 20, 2013

bombela commented Jun 16, 2015

Issues with long-lasting RPC calls #37

Issues with long-lasting RPC calls #37

Comments

makmanalp commented Jun 28, 2012

makmanalp commented Jun 28, 2012

bombela commented Jun 28, 2012

makmanalp commented Jun 30, 2012

v3ss0n commented Apr 20, 2013

bombela commented Jun 16, 2015