Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with long-lasting RPC calls #37

Closed
makmanalp opened this issue Jun 28, 2012 · 5 comments
Closed

Issues with long-lasting RPC calls #37

makmanalp opened this issue Jun 28, 2012 · 5 comments

Comments

@makmanalp
Copy link

Test case for Reproduction:

import zerorpc

import time
from gevent import monkey; monkey.patch_all()                                                                                                                                                           

class Foo(object):

    def wait(self):
        print "Start waitin"
        time.sleep(15)
        print "End waitin"
        return "derp"

s = zerorpc.Server(Foo())
s.bind("tcp://0.0.0.0:3333")
s.run()

and the client:

import zerorpc

s = zerorpc.Client("tcp://0.0.0.0:3333", timeout=3000)                                                                                                                                                      
print s.wait()

The result for me on the server is:

/!\ gevent_zeromq BUG /!\ catching after missing event /!
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': 'e09d15b1-6599-495b-a54d-031e6e7b4039', 'zmqid': ['\x00\xa9\x15J\x94\x8b|O\xf1\xae\x17\xb3M\x82\x8b\xee\x17'], 'message_id': 'e09d15b4-6599-495b-a54d-031e6e7b4039', 'v': 3} [...]

while on the client I get "derp" fine. Without the monkey patch, predictably, the connection is lost due to lack of heartbeat because sleep blocks.

This is a simplification of a larger bug I've been having when dealing with a bunch of workers that all call subprocess to spawn an external executable and do something that takes a while. (With the newest gevent from trunk, there is a subprocess monkey patch).

Any ideas? Zerorpc 0.2.1, gevent 1.0.

@makmanalp
Copy link
Author

I got a chance to pull this closer to what I'm experiencing. Server:

import zerorpc

import time
from gevent import monkey; monkey.patch_all()

class Foo(object):

    def wait(self, x):
        print "Start waitin %s" % x
        time.sleep(15)
        print "End waitin %s" % x
        return "derp %s" % x

s = zerorpc.Server(Foo(), pool_size=2)                                                                                                                                                                      
s.bind("tcp://0.0.0.0:3333")
s.run()

Client:

import zerorpc

c = zerorpc.Client("tcp://0.0.0.0:3333", timeout=3000)

work = ["a", "b", "c", "d"]
futures = [c.wait(x, async=True) for x in work]
[future.get() for future in futures] 

Result: Same errors as above, but an additional "LostRemote: Lost remote after 10s heartbeat" on the client side before the server can complete and send the result, even though time is monkeypatched! Also, the server keeps continuing on to c and d even though the error comes on the client side after a and b:

truffle:trial makmanalp$ ipython main.py
Start waitin b
Start waitin a
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c75-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbfE'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c76-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c74-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbf
E'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c77-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c75-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbfE'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c7a-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
zerorpc.ChannelMultiplexer, unable to route event: _zpc_hb {'response_to': '03698c74-6d6a-4fa0-b741-5383ff08a10f', 'zmqid': ['\x00n>\xf0\x88\xbf
E'\xbc\xa1\xec\xc3\xdb\xaf\xd4"'], 'message_id': '03698c7b-6d6a-4fa0-b741-5383ff08a10f', 'v': 3} [...]
End waitin b
Start waitin c
End waitin a
Start waitin d
End waitin c
End waitin d

So it looks like at least the tasks are getting submitted correctly, but the server is oblivious of the client disconnection.

UPDATE: While messing around, I found that removing async fixes the problem, I figured it was because now the code does only one request at a time. Then, this led me to mess around with pool size and it turns out that if I leave async in but remove the pool size limit, the LostRemote doesn't happen! Am I missing some basic assumption about how this is supposed to work?

@bombela
Copy link
Member

bombela commented Jun 28, 2012

Hi,

So as you said in your first example, using time.sleep() without monkey patching, sleep the whole process, thus freezing it to for any asynchronous IO activities. The client then doesn't receive any heartbeat from the server anymore and complain after 10s (default heartbeat frequency is 5s, and it abort after 2 missing heartbeats).

If you really want to do stuff like, you can disable the heartbeat on both sides (heartbeat=None) in both the constructor's parameters of the server and client.

But this is probably not what you want, you want to use a monkey patched or more specifically, a gevent compliant version of subprocess. I cant help you much with gevent 1.0 since zerorpc was never tested against it (still using the version available on pypi).

I believe gevent 1.0 ship a gevent friendly version of subprocess. Else, you can also try with [pip install gevent_subprocess - https://github.com/bombela/gevent_subprocess] for gevent < 1.0. (pip install gevent_subprocess).

In your second comment, initially you limited the pool_size to 2 possible concurrent requests. When you call 4 times wait() in a quick sequence, only two call can be processed right away, and both take 15s to process. Meanwhile, the two pending calls are still not connected, and the client's heartbeat give up after 10s.

Intuitively, it would make more sens for the 2 pending requests to wait until the server can accept a new requests (at least, until the 30 timeout kick in). But because for the moment there is one heartbeat per requests (!!! backward compatibly !!!), the server will not start heart-beating until a request is being processed.

This discrepancy will be fixed at some point, we anyway need to fix it here at dotCloud to be able to go further in load-balancing strategy (and it will be fixed with respect to backward-compatibility, so everything will be able to speak happily to each others).

For the moment, if its a really big problem for you, you can disable the heartbeat on both sides (but then streaming can't be used anymore).

Regards,
fx

@makmanalp
Copy link
Author

Hey,
Thanks for taking the time for replying. The latter was more of a concern for me. The 1.0 subprocess works well.

Yeah, I'd expected the server to have a task queue of sorts and the heartbeat being per-server instead of per-request. As a workaround, I'll just have to limit how many tasks I send at a time instead of dumping all the tasks in at once and letting the server handle the limiting.

Meanwhile, I'll leave the report open if that sounds fine to you. If you need a hand at fixing the way heartbeats work, I'd be glad to have a go at it at some point.

Cheers,
~mali

@v3ss0n
Copy link

v3ss0n commented Apr 20, 2013

I am also looking forward into this. But zeromq seems to be slowing down in development . Not any sensible update for a month already. Real updates are 2 months ago.. ZeroRPC is major product and thing that making dotcloud proud right?

@bombela
Copy link
Member

bombela commented Jun 16, 2015

If it makes you feel better, I took over the maintenance of zerorpc recently. Hopefully I can keep it maintained, and eventually move it forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants