Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues scaling across processes with the Redis store #686

Closed
brettkiefer opened this issue Dec 15, 2011 · 12 comments
Closed

issues scaling across processes with the Redis store #686

brettkiefer opened this issue Dec 15, 2011 · 12 comments
Labels
bug Something isn't working

Comments

@brettkiefer
Copy link

First, I'd like to say thank you to the Socket.io devs for the great work, and for providing a way to scale this across processes with the Redis store! I think that this is a really promising thing, even though we've been encountering a few issues.

We've been having some problems scaling up socket.io with the Redis store. The problem appears to be that the PubSub is too chatty. For instance, each time a socket disconnects, every other process subscribes to the dispatch:[id] channel for that socket and waits for the close message, then unsubscribes from that channel. So in the case where you have to restart your entire service (let's say on a non-backward-compatible release), you have a TOTAL number of sockets connected to all of your worker processes (let's call that n) and a number of processes (let's call that p) and each process has to subscribe and unsubscribe for each socket, and the complexity of an unsubscribe in Redis is O(n) where n is the number of already-subscribed listeners on a channel. So that would seem to be O(n*p^2) work for Redis, and anyway is O(n) for every worker process.

Since we are only targeting websockets, we are working around this by disabling all pubsub channels except handshake, and only doing the initial handshake broadcast (so all processes other than the one that completes the handshake simply discard the data from the 'handshaken' hash as though the handshake was never finished). This appears to mitigate the issues sufficiently that we can run websockets via socket.io in production across our 36 node.js processes on 3 machines, but it still obviously won't scale indefinitely because every process still has to process a handshake message on the 'handshake' channel for each handshake initiated.

I propose that this would scale across processes if it used Redis as a store rather than publishing all data to all workers in these cases. For instance, rather than publishing on 'handshake', write a handshake:[id] key in Redis with a timeout of 30s and then let the process that completes the handshake update that key (or delete it and write a handshake-completed:[id] key).

I understand that this gets more complicated for other transports, where buffering responses becomes important, but it seems like this might be better-solved by storing the buffer as a capped list of JSON objects in Redis (our AJAX short-polling fallback uses that approach, and it works well), rather than using pubsub to pass them around.

I apologize for the lengthy and unsolicited suggestion, and we'd love to help if y'all are interested in going this direction. Also, if we're just doing something in a completely wrong or unanticipated way, please let me know. Thanks!

@looksgood
Copy link

+1

@3rd-Eden
Copy link
Contributor

We already talked on IRC about this, and i'm definitely +1 on this, as i'm also seeing the same behavior in my own application.

It would be just as easy to have one listener for handshakes,disconnects etc instead of adding new listeners for each connection. Which would probably make the stores much smoother to work with and gain a better overal performance.

Marked as bug + store. But I would love to see pull request against this ;)

@brettkiefer
Copy link
Author

Right now the decision is for us to continue to run with our current hack (turning off all pubsub channels except for the inital handshake propagation). We're not a good candidate for fixing the store, since we only run websockets in production and are happy with our short-polling fallback there for downlevel browsers (which are anyway decreasing in number). Also, it looks like guille's work on websocket.io and engine.io may obviate this kind of work.

@tglines
Copy link

tglines commented Feb 21, 2012

+1

@masterkain
Copy link

News about this?

@masterkain
Copy link

@brettkiefer mind sharing the hack? :)

@brettkiefer
Copy link
Author

@masterkain
I believe this was our final set of patches: https://gist.github.com/2428207
Keep in mind that this only works if you turn off all transports except websockets.

The important part is where we turn off subscription and unsubscription on all channels other than handshake.

@Fauntleroy
Copy link

+1

@smoku
Copy link

smoku commented Sep 3, 2012

+1 any forks?

@magcks
Copy link

magcks commented Sep 12, 2012

+1

@smoku
Copy link

smoku commented Sep 12, 2012

I implemented brettkiefer's hack and handle messages myself (each node.js process subscribes to messages channel). I use websocekts and flashsockets - those two transports seem to cover all major cases. However, I didn't test it on IE6 and older browsers because I don't care :P

Sticking only to socket transport is also very good for your server - xhr-polling uses a lot of "resources" when there's a lot of messages.

@ruanb7
Copy link

ruanb7 commented Apr 9, 2014

I just implemented the hack and so far it seems to be working. Will send it into testing and see if anything else happens.

How will socket.io v1.0 affect this hack? Will the team have fixed the scaling across clusters issue by then, to enable us to revert from using the hack?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants