Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rosbridge_server hanging randomly under load #425

Closed
ghost opened this issue Aug 8, 2019 · 61 comments
Closed

rosbridge_server hanging randomly under load #425

ghost opened this issue Aug 8, 2019 · 61 comments
Labels

Comments

@ghost
Copy link

ghost commented Aug 8, 2019

Expected Behavior

rosbridge_server does not hang.

Actual Behavior

rosbridge_server hangs under some load (several connections, quite high rates (50Hz), but relatively little data throughput).

Steps to Reproduce the Problem

  1. Run rosbridge_server v 0.11.2 on an RPi 3B+
  2. Connect severally to it from a web browser.
  3. Wait.

Specifications

  • ROS Version (echo $ROS_DISTRO): kinetic
  • OS Version (grep DISTRIB_CODENAME /etc/lsb-release): Raspbian GNU/Linux 9 (stretch)
  • Rosbridge Version (roscat rosbridge_server package.xml | grep '<version>'): 0.11.2
  • Tornado Version (python -c 'import tornado; print tornado.version'): 4.4.3

We are having trouble with rosbridge_server (0.11.2) freezing when under a fair load (number of connections on a somewhat flaky network, rather than throughput). The freeze appears to be random, and occurs after some period, typically 10-30 seconds.

I investigated, and it looks like it's a race condition for _write_lock, similar to that in issue #403, but not solved by the change of #413 (which is present at 0.11.2).

After some testing, and reading the issue reports, I noticed that somebody had suggested that the behaviour of "yield" inside a "with lock" section wasn't certain. So I tried:

def prewrite_message(self, message, binary):
    # Use a try block because the log decorator doesn't cooperate with @coroutine.
    try:
        with self._write_lock:
            future = self.write_message(message, binary)

        # When closing, self.write_message() return None even if it's an undocument output.
        # Consider it as WebSocketClosedError
        if future is None:
            raise WebSocketClosedError

        yield future

That is, I un-indented after the line "future =", so the lock is released once the call has been made, but before the yield call.

Immediately, the behaviour seems to be fixed, and it has now been running robustly for several minutes without freezing again. I cannot comment at this stage on whether there have been any side effects, or data corruption, but the upstream signals look fine so far.

@mvollrath
Copy link
Contributor

Somebody else was having issues with Tornado 4.4.3, you might try 4.5.3. Yielding without the lock disconnects websocket backpressure and can lead to OOM on the server if it can't keep up with outgoing messages. If your application is not very demanding, this is fine.

@ghost
Copy link
Author

ghost commented Aug 13, 2019

Thanks for your help with this.

This is a production system and raspbian.raspberrypi.org are currently shipping 4.4.3, so I'm not brave enough at this juncture to pursue that test.

For information, here's our latest:

  • We've rationalised the numerous signals moving through rosbridge to a single stream; this seems to eliminate the freezing problem, without the patch, but we haven't soak tested it...

  • ...because, for my peace of mind, we're going with the patched version of websocket_handler.py described above.

I'll report back if we learn anything further - thanks again.

@mvollrath
Copy link
Contributor

I hope to solve most Tornado issues by removing Tornado from rosbridge.

@reinzor
Copy link

reinzor commented Sep 18, 2019

Experiencing same issues, some data is not being published anymore and some data is keeping old value. Will try the #426

@nickvaras
Copy link

nickvaras commented Feb 13, 2020

Same random hanging
EDIT
The issue in my case seems to be at the roslibjs level, not the websocket server, and it's not random anymore, it happens when I switch networks. The specific symptom is some callbacks not executing, but the websocket is connected succesfully. Good luck with this issue!

@reinzor
Copy link

reinzor commented Feb 13, 2020

What is the "hanging" you experience? We sometimes experience that some old messages are received via the websocket on the client side. And some messages are just not coming in anymore on the client side.

@reinzor
Copy link

reinzor commented Feb 14, 2020

The hanging occurred again but when I visit the websocket page, it seems to work fine:

image

Some messages of queues or not coming in anymore. Especially the large ones.

@Axel13fr
Copy link

The hanging occured again with version 0.11.4. In my case, when the websocket hangs, I can still rosnode ping it, I can still net cat to its TCP port but nothing else. For example an HTTP GET on it as above doesn't give anything back.

And when it starts hanging, clients get disconnected, they can't reconnect and the /connected_clients list doesn't get updated.

@mvollrath
Copy link
Contributor

I'll be working on an automated stress test to reproduce these kinds of issues.

@Axel13fr
Copy link

Axel13fr commented Mar 1, 2020

@mvollrath here are some more details of all the testing we did and their results. I haven't started diving into the code yet.

@nickvaras the problem occurs when switching networks as we reproduced it but not limited to that.

We managed to reproduce the problem by doing so:

  • setting a ping timeout and a ping interval so that non responsive clients get kicked out by the server
  • using a web front end with ROS lib js to connect to the websocket server (which subscribes to about 30 topics and 15 services)
  • after a successfull connection from the front end, suddenly disconnect it by for example switching off Wifi (after some time, due to the ping timeout, the server will disconnect the client). The front end won't detect that it's not connected anymore due to long websocket timeouts.
  • connect Wifi/network back so the front end suddenly receives a "close" socket.
  • when the front end reconnects a few secs later, the server registers the topics as before and then HANGS.

Note that this problem happens with 0.11.3 and 0.11.4 so it seems independant from the websocket implementation. We have had other occurences of this hanging but couldn't reproduce them.

Note as well that we made the following stress test: opening about 10 front end instances at the same time so that the rosbridge server load would go 100% CPU on our embedded PC when registering all topics. This didn't cause any issues. But when we do the same thing with just one client connecting as described above, it did. So I'm not sure at all that this is related to a load issue but rather a TCP socket still alive problem...

@mvollrath
Copy link
Contributor

Ok, I'll see about automating a WebSocket client abort. The stress test was working but not revealing anything particularly interesting to this hanging problem.

You might try making sure that all topic subscriptions have non-zero queue_length, when that value is zero (the default) one client can block all the other clients. Having a queue decouples it.

@nickvaras
Copy link

Thanks for the update @Axel13fr , that's encouraging progress. When the issue is reproduced, does the issue include the failure to display the "splash" screen, for lack of a better word, when visiting the websocket server on the browser? i.e., http://<ip_address>:9090

AutobahnPython 17.10.1 I am not Web server, but a WebSocket Endpoint.

Thanks again!!

@Axel13fr
Copy link

Axel13fr commented Mar 2, 2020

@nickvaras yes it does include the failure of the websocket splash screen when the issue is reproduced.

@Axel13fr
Copy link

Axel13fr commented Mar 4, 2020

@mvollrath I tried a queue length of 1 and of 5 for all my subs but it didn't help. Reproducing the issue is easier when under high load.

The issue doesn't happen under kinetic on 16.04 (partially because its tornado version doesn't kick clients after ping timeout).

On 18.04 with 0.11.4, when setting very high timeouts so that suddenly disconnected clients are not kicked out, on reconnection, the websocket is on high load (probably due to TCP buffers full), ROS messages stop coming BUT the websocket thread is still alive (the autobahn splash screen still responds). Compared to 16.04, the websocket hanged in some cases still...

@nickvaras
Copy link

ROS messages stop coming BUT the websocket thread is still alive (the autobahn splash screen still responds).

I've seen this too. Furthermore, I've seen what appears to be a frozen topic for a few minutes, only to quickly unravel a bunch of buffered messages after a while, i.e., messages that publish at 1Hz stop coming through, and then all of a sudden the all come in quick succession at a much higher frequency.

@Axel13fr
Copy link

Axel13fr commented Mar 4, 2020

The bunch of buffered messages is due to the following: when the rosbridge server doesn't kick out on timeout, it assumes the client is still connected and continues to write on the TCP socket. The TCP socket OS buffer will fill up so that when your front end / client comes back online, it will receive the whole buffered content as fast as the connection allow transfers for it.

But after that, I've seen the websocket hanging in some cases.

@mvollrath
Copy link
Contributor

I tried to repro this with a couple of different mobile devices but couldn't quite hit the magic. Then I tried sending SIGSTOP to a Firefox client and got the condition where other clients are hanging.

The problem in rosbridge is that while a QueueMessageHandler thread is blocking in the websocket OutgoingValve, it also blocks adding new messages to the queue and pushing off the old ones, which then blocks the rospy topic thread. This means the queue decoupling I mentioned doesn't actually work.

The solution is to not hold the queue Condition while pushing data to the OutgoingValve. Please try this PR and see if it fixes things for you.

@Axel13fr
Copy link

Axel13fr commented Mar 5, 2020

Cheers, I'll thoroughly test it next week on all our cases and variants.

@Axel13fr
Copy link

Axel13fr commented Mar 9, 2020

So the PR solves the issue when reconnecting a front end while the rosbridge assumed the connection was still alive (no kick out by timeout).

But if a ping timeout parameter is configured and the sudden disconnection exceeds it;resulting in the rosbridge kicking the client, on reconnection, the hanging still occurs.

@mvollrath
Copy link
Contributor

This could be the issue with unregistration, when I isolated it today I found that under some conditions a reconnecting client can't register the topics it had previously if it was the only client on those topics. The known hack for this is to never unregister anything, see #399

You can try this branch to see if it fixes the issue now described. Yet to be seen if this is something we can merge.

@Axel13fr
Copy link

We tested the branch to never unregister anything but it still hangs on reconnection after kicking the client due to timeout.

In this case, the websocket server itself seems to hang as GET HTTP requests on its port do not return the usual blue autobahn banner mentionned above by nickvaras.

@nickvaras
Copy link

nickvaras commented Mar 10, 2020

Yes, I tested the changes of 464 on a never-unregister clone, and it still hangs.

@reinzor
Copy link

reinzor commented Mar 13, 2020

Same here, tested using 464. What we are seeing:

  • Autobahn screen is responsive
  • Some topics are not coming in anymore
  • Latched topics hold old value, are not updating

@mvollrath
Copy link
Contributor

mvollrath commented Mar 13, 2020

Anybody who is experiencing this issue while using this branch (which includes #464) please comment with this survey on which features of rosbridge you are using, and maybe we can find correlation.

My ROS distro:
* [ ] Publisher (publishing from websocket to ROS)
* [ ] Latched Publisher
* [ ] Subscriber (from ROS to websocket)
* [ ] Latched Subscriber
* [ ] Services
* [ ] Actionlib
* [ ] tf republisher
* [ ] rosapi
* [ ] png compression
* [ ] bson encoding
* [ ] cbor encoding
* [ ] websocket compression
* [ ] websocket ping
* [ ] authentication
* [ ] wifi/other wireless networking
* [ ] Python 3 (when running server)

The application I'm working on does NOT experience the issue:

My ROS distro: melodic

  • Publisher (publishing from websocket to ROS)
  • Latched Publisher
  • Subscriber (from ROS to websocket)
  • Latched Subscriber
  • Services
  • Actionlib
  • tf republisher
  • rosapi
  • png compression
  • bson encoding
  • cbor encoding
  • websocket compression
  • websocket ping
  • authentication
  • wifi/other wireless networking
  • Python 3 (when running server)

@reinzor
Copy link

reinzor commented Mar 16, 2020

For our application:

ROS Melodic:

  • Publisher (publishing from websocket to ROS)
  • Latched Publisher
  • Subscriber (from ROS to websocket)
  • Latched Subscriber
  • Services
  • Actionlib
  • tf republisher
  • rosapi
  • png compression
  • cbor encoding
  • websocket compression
  • websocket ping
  • authentication
  • wifi/other wireless networking
  • Python 3 (when running server)

@nickvaras
Copy link

nickvaras commented Mar 16, 2020

In mine (does experience hang):

My ROS distro: Melodic

  • Publisher (publishing from websocket to ROS)
  • Latched Publisher
  • Subscriber (from ROS to websocket)
  • Latched Subscriber
  • Services
  • Actionlib
  • tf republisher
  • rosapi
  • png compression
  • bson encoding
  • cbor encoding
  • websocket compression
  • websocket ping
  • authentication
  • wifi/other wireless networking
  • Python 3 (when running server)

@mvollrath
Copy link
Contributor

Thank you, I've updated the survey with "authentication" and "bson encoding", please update if you use either of these features. Also let me know if there are any more features I'm forgetting or don't know about that might be relevant.

@Axel13fr
Copy link

Axel13fr commented Apr 1, 2020

Hi Gents,
Let's make sure rosbridge doesn't get sick :-)

My ROS distro: Melodic

  • Publisher (publishing from websocket to ROS)
  • Latched Publisher
  • Subscriber (from ROS to websocket)
  • Latched Subscriber
  • Services
  • Actionlib
  • tf republisher
  • rosapi
  • png compression
  • bson encoding
  • cbor encoding
  • websocket compression
  • websocket ping
  • authentication
  • wifi/other wireless networking
  • Python 3 (when running server)

@J-Rojas
Copy link

J-Rojas commented May 13, 2020

I'm running into this regularly when I publish a high frequency /tf topic at 100hz across the bridge. Client disconnects/reconnects are not possible and the entire bridge process appears to be deadlocked and needs to be killed. Only way to mitigate this is to slow down the publishing rate.

@mvollrath I will try out your fix! Great timing!

@mvollrath
Copy link
Contributor

I'll roll out 0.11.7 with #496 and we will see what happens. Quite possible we will find more concurrency glitches in rosbridge_library since we've closed the door on deadlock and opened the door to race conditions.

@nickvaras
Copy link

nickvaras commented May 14, 2020

Note: this is for the standard released version, not mvolrath's fork which I'm very eager to try when I have a chance.

Not sure if this is mentioned above, might be old news but I'll post it just so people can find it a bit easier.
After several minutes of the websocket being in the hung state, the plug finally comes off and all those buffered messages come rushing in at warp speed, while the following messages are logged:

[ WARN ] [/rosbridge_websocket]: StreamClosedError: Tried to write to a closed stream
[ WARN ] [/rosbridge_websocket]: WebSocketClosedError: Tried to write to a closed websocket

@Axel13fr
Copy link

Axel13fr commented May 14, 2020

We have tested #496 under high load, it seems to not hang anymore but we've noticed 2 issues so far:

  • some clients manage to receive all their subscribed topics while at the same time, some other newly connected clients get only some topics. When inspecting the front end, the client receives only some specific topics and always these (not like sometimes topic A or B or C but always topic A and never B or C)
  • over time, during the same session, a client not receiving all topics may receive more. This sounds like a performance issue.
  • when using this version in combination with ping timeout parameter, when the websocket kicks a client, the following error happens

2020-05-14 12:26:44+0000 [RosbridgeWebSocket,0,192.168.7.200] Unhandled Error Traceback (most recent call last): File "/workspace/install/lib/rosbridge_server/rosbridge_websocket", line 320, in <module> reactor.run() File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1243, in run self.mainLoop() File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1255, in mainLoop self.doIteration(t) File "/usr/lib/python2.7/dist-packages/twisted/internet/epollreactor.py", line 235, in doPoll log.callWithLogger(selectable, _drdw, selectable, fd, event) --- <exception caught here> --- File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 103, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 86, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 122, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 85, in callWithContext return func(*args,**kw) File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 627, in _doReadOrWrite self._disconnectSelectable(selectable, why, inRead) File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 255, in _disconnectSelectable selectable.connectionLost(f) File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 289, in connectionLost protocol.connectionLost(reason) File "/usr/lib/python2.7/dist-packages/autobahn/twisted/websocket.py", line 128, in connectionLost self._connectionLost(reason) File "/usr/lib/python2.7/dist-packages/autobahn/websocket/protocol.py", line 2467, in _connectionLost WebSocketProtocol._connectionLost(self, reason) File "/usr/lib/python2.7/dist-packages/autobahn/websocket/protocol.py", line 1098, in _connectionLost self._onClose(self.wasClean, self.remoteCloseCode, self.remoteCloseReason) File "/usr/lib/python2.7/dist-packages/autobahn/twisted/websocket.py", line 171, in _onClose self.onClose(wasClean, code, reason) File "/workspace/install/lib/python2.7/dist-packages/rosbridge_server/autobahn_websocket.py", line 248, in onClose self.protocol.finish() File "/workspace/install/lib/python2.7/dist-packages/rosbridge_library/protocol.py", line 274, in finish capability.finish() File "/workspace/install/lib/python2.7/dist-packages/rosbridge_library/capabilities/subscribe.py", line 346, in finish subscription.unregister() File "/workspace/install/lib/python2.7/dist-packages/rosbridge_library/capabilities/subscribe.py", line 88, in unregister manager.unsubscribe(self.client_id, self.topic) File "/workspace/install/lib/python2.7/dist-packages/rosbridge_library/internal/subscribers.py", line 219, in unsubscribe self._subscribers[topic].unsubscribe(client_id) exceptions.KeyError: u'/registered_simulators'

@mvollrath
Copy link
Contributor

Thanks @Axel13fr , now working on thread safety for capabilities.

@mvollrath
Copy link
Contributor

Please try #500 which adds thread safety at capability level.

@Axel13fr
Copy link

Axel13fr commented May 18, 2020

We tested #500, the hanging happened on that version: clients cannot connect to the socket anymore and an HTTP get on the autobahn websocket port doesn't respond.

@ifollowbenoit
Copy link

We tested out with #496 under very high load, and it does not hang anymore, as @Axel13fr stated. Even after a very long time without any connection it still manages to recover

@mvollrath
Copy link
Contributor

Thanks @Axel13fr I'll add a handler for the ping kick exception instead as a quick fix, since we need to finish shutting down the connection.

@mvollrath
Copy link
Contributor

#502 should fix the badness caused by the ping kick error, but it will still be logged.

@Axel13fr
Copy link

@mvollrath we tested #502 and we can still cause it to hang.
It seems to hang in the following case: when disconnecting a client (cleanly or not(cable unplug)), if the bridge is sending a message from ROS to the front end at that moment, it will hang (with the HTTP get on the autobahn websocket port not responding anymore).

When stopping ROS msg transmission to the bridge subs so that nothing is sent to the front, the hanging didn't happen.

@mvollrath
Copy link
Contributor

I've updated #502 to finish the protocol in the IncomingQueue thread, this should break the incoming/outgoing loop in onClose and unblock the reactor. There may be exceptions logged during shutdown.

@nickvaras
Copy link

nickvaras commented May 20, 2020

(Update: I'm beginning to believe I have a bit of this too: #437)

Original post:

Thank you @mvollrath for keeping the hope alive. Question: Does #502 include #464 ? I tested today and it seems like an improvement regardging the hanging from sudden disconnection (autobahn splash screen remains accessible, services over websocket continue working), but (at least some) topics ceased to come through when an addiitional client was connected, even though they are still being published (rostopic hz ...etc).
(I use the never unregister patch/hack)

After 5-10 minute wait, the clients started receiving the topic again.

I seem to be able to replicate the hanging at will by:
-connecting a client (A) thru the building wifi network on device 1
-connecting another client (B) directly to the robot's wifi on device 2
-abruptly turning the wifi on device 2
-connecting again on device 2, this time through the building wifi

That seems to do it.

Also, just timed how long it takes the topics to come back to life: ~16 minutes.

@mvollrath
Copy link
Contributor

Question: Does #502 include #464 ?

Yes.

I seem to be able to replicate the hanging at will by:

Interesting, and it doesn't happen when reconnecting on the same wifi? Does rosbridge show the usual log messages for removing then adding the client?

@mvollrath
Copy link
Contributor

Also if you're not already, make sure all your topics have non-zero queue_length or else they're all coupled to the rospy.Subscriber thread together.

@nickvaras
Copy link

@mvollrath you are the man! This latest version along explicit, non zero queue_length for topic has finally done it. I did try non zero queue_length on a previous server version and still couldn't get it to stay up. Now I can pull the plug on one client and other clients keep receiving their messages the way one would expect. I've spent the grater part of these last 20 hrs looking into this and finally things are looking up. You have my profound, heartfelt gratitude! This really helped! 👍

@mvollrath
Copy link
Contributor

Releasing #502 in 0.11.8 for kinetic and melodic. Thanks everybody that has helped track this down and test.

@benmitch
@reinzor
@nickvaras
@Axel13fr
@BrodieBirko
@ifollowbenoit
@basecoplan
@J-Rojas

Let me know how it goes of course, my application only uses a small set of rosbridge's features so we depend on your reports.

@Axel13fr
Copy link

Axel13fr commented May 26, 2020

Congratulations @mvollrath for supporting us all along on this, we have tested it along with the queue length in the frondend and so far no way to reproduce the hanging ! We will continue production test in the coming weeks and keep you posted.

Shouldn't there be a default queue length to 1 in the roslibjs to make sure this will not happen again to anyone ?

@reinzor
Copy link

reinzor commented Jun 12, 2020

Hanging still seems to occur with 0.11.8, I missed the non-zero queue_length setting. We will try this know.

@reinzor
Copy link

reinzor commented Jun 12, 2020

When setting this queue_length setting, I discovered that this is not possible for the ActionClient. The subscribers are created without the queue_length argument so it will default to zero. If it is indeed fixed with a non-zero queue_length, this should also be addressed.

@github-actions
Copy link

This issue has been marked as stale because it has been open for 180 days with no activity. Please remove the stale label or add a comment to keep it open.

@stribor14
Copy link

This issue has the most detailed info about the bug that is still happening:

  • ROS version : noetic
  • Linux version : Ubuntu 20.04
  • rosbridge_server version : 0.11.16-1focal.20221124.044710

I believe this is also linked to the newer open issues #829 #765 and maybe #720

I am posting an update here because the following comment lists the exact steps we use to reproduce this issue:

  • first client on Device 1 through Network 1
  • second client on Device 2 through Network 1
  • we connect Device 2 to Network 2
  • refresh Device 2
  • All devices hang (rosbridge server doesn't serve topic msgs to clients anymore)
  • interestingly, services still work

(Update: I'm beginning to believe I have a bit of this too: #437)

Original post:

Thank you @mvollrath for keeping the hope alive. Question: Does #502 include #464 ? I tested today and it seems like an improvement regardging the hanging from sudden disconnection (autobahn splash screen remains accessible, services over websocket continue working), but (at least some) topics ceased to come through when an addiitional client was connected, even though they are still being published (rostopic hz ...etc). (I use the never unregister patch/hack)

After 5-10 minute wait, the clients started receiving the topic again.

I seem to be able to replicate the hanging at will by: -connecting a client (A) thru the building wifi network on device 1 -connecting another client (B) directly to the robot's wifi on device 2 -abruptly turning the wifi on device 2 -connecting again on device 2, this time through the building wifi

That seems to do it.

Also, just timed how long it takes the topics to come back to life: ~16 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants