Skip to content
This repository

High rate of "client not handshaken should reconnect" #438

Open
squidfunk opened this Issue August 01, 2011 · 286 comments
Martin Donath

I am running a chat server with node.js / socket.io and have a lot of "client not handshaken" warnings. In the peak time there are around 1.000 to 3.000 open TCP connections.

For debugging purposes I plotted the graph of actions succeeding the server-side "set close timeout" event, because the warnings are always preceded by those, so the format is:

Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 2098080741242069807
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - xhr-polling closed due to exceeded duration
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 330973265416677743
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - setting request GET /socket.io/1/xhr-polling
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 10595896332140683620
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 10595896332140683620
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 21320636051749821863
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 21320636051749821863
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 3331715441803393577
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   warn  - client not handshaken client should reconnect

The following plot explained:

  • x axis: The passed time between the first and last seeing of a client id.
  • y axis: total amount of clients for a given time x terminating with a specific message (client not handshaken, cleared close timeout, etc.)

PlotFull size

I did not change the default timeouts and intervals provided by socket.io, but I think it is very strange, that there is a peak of handshake errors at around 10 seconds (even surpassing the successful cleared close timeouts!). Did anyone experience a similar situation?

Best regards,
Martin

Martin Donath squidfunk closed this August 04, 2011
Dennis

Hi,

were you able to solve it? I am still having this problem with 0.7.8. I am not able to reproduce it on my machine, but i can see it in the debug logs, some clients go really crazy (looks like more than 50 reconnects/second). I have this problem only with jsonp and xhr connections, turning off one of them didn't help though.

   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393095&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393150&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   ...

I am using custom namespaces btw.

Dennis

TCP dump of one of those terror-connections, if that helps (this goes on continously).

10:21:13.450665 IP (removed).8433 > (removed).65471: Flags [P.], seq 1973377331:1973377336, ack 537321861, win 54, length 5
@.@....(!....E ...u.Y3 ...P..6L....2::.
10:21:13.480742 IP 77.12.111.190.64720 > 188.40.33.215.8433: Flags [P.], seq 29040:29700, ack 6557, win 4169, length 660
....E...V.@.z...M.o..(!... ..%.)M...P..I^p..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072071&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)


10:21:13.481279 IP (removed).8433 > (removed).64720: Flags [P.], seq 6557:6706, ack 29700, win 9911, length 149
...$!).q..E.....@.@....(!.M.o. ...M....%..P.&..7..HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.504212 IP (removed).64725 > (removed).8433: Flags [P.], seq 21428:21915, ack 6249, win 4356, length 487
.M.o..(!... ....\W,..P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)


10:21:13.504593 IP 188.40.33.215.8433 > 77.12.111.190.64725: Flags [P.], seq 6249:6391, ack 21915, win 7846, length 142
...$!).q..E.....@.@....(!.M.o. ...W,.....CP.......HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 7
Connection: Keep-Alive
Access-Control-Allow-Origin: *

7:::1+0
10:21:13.542058 IP (removed).64720 > (removed).8433: Flags [P.], seq 29700:30360, ack 6706, win 4132, length 660
....E...V.@.z...M.o..(!... ..%..M.._P..$Ro..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072149&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)


10:21:13.542557 IP (removed).8433 > (removed).64720: Flags [P.], seq 6706:6855, ack 30360, win 9921, length 149
...$!).q..E.....@.@....(!.M.o. ...M.._.%.QP.&.....HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.567452 IP (removed).64725 >(removed)8433: Flags [P.], seq 21915:22402, ack 6391, win 4320, length 487
.M.o..(!... ....CW,.^P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)
Martin Donath

Hi denisu,

actually for the last 2 weeks these high rate of handshake warnings was not as important as the crashes of my client's chat server due to memory leaks. These we're hopefully fixed with the latest pull of 3rd-eden into the master. In the next days I will be at my client's site again to check whether the chat server has crashed and investigate those handshaken warnings again.

For now they don't seem to be very severe (at least not as severe as the crashes). I will keep you updated here.

Dennis

Hi, thank you for the reply!

You are right, its not that critical. The server can handle this high rate of connections easily. I hope it doesn't cause any problems on the client side. I still wasn't able to reproduce it, but until now I had this problem with clients using Firefox 5.0, Opera 11 and IE 8.0 over the xhr and jsonp transports.

I will let you know if I found out more.

Stephen Pair

I've been having problems with this too...wondering whether there's something I should be doing on the client side that I'm not.

Dennis

Yeah, for about 500 connected clients, there are 2 to 5 clients constantly reconnecting (absolutely no delay). Clients with a super fast internet connection were reconnecting so fast that I had to block their IP in the firewall, since it was affecting the server.

@squidfunk can you reopen this issue?

Martin Donath squidfunk reopened this August 16, 2011
Martin Donath

I tried to investigate the problem but didn't find anything so far. However It may probably related to another problem, which is caused by the fact that socket.io doesn't enable Flashsocket transport by default. This leads to a problem in IE 6 and 7 when the chat server is not on the same domain as the client script, as IE 6 and 7 seem to prohibit cross-origin long polling (no problem with IE 8 though). Therefore it is not possible with IE 6 or 7 to connect. Maybe this is related.

I reopened it. Actually I closed it because I thought I was the only one experiencing the problem and that it may be too specific. But it doesn't seem to be.

Rafael Brizola

This problem happened to me when I'm using Firefox 3.6.18 and change the codification from UTF-8 to ISO-8859-1.

Martin Donath

@rafaelbrizola: can you give a little more information? Did the "client not handshaken" warnings increase or weren't you be able to connect at all? Actually my chat server is running (has to run) in an ISO-8859-1 environment (sadly enough).

Dennis

My app runs in an ISO-8859-1 environment too, and the node server is not on the same domain as the client script. Maybe this could be some hint on whats wrong :). But I have seen all common browsers (except the ones that support Websockets) having this problem. Still investigating...

Alexi Kostibas

I've been experiencing the same thing since moving to 0.7.7 (now on 0.7.9). Also with a cross-domain Node/webserver setup. The biggest problem is that I have to set the log level to 0 to avoid filling up the tiny EC2 disk.

Let me know if I can provide anything useful for debugging.

Scott Rushforth

Hi there,

Also experiencing this issue intermittently with brand new deployment of socket.io 8.4.

Seems to occur 20-30 minutes after server start. Until then, everything is normal, everyone is happy.

But after about 20-30 minutes in, 90% of the messages in the log are simply 'client not handshaken should reconnect'.

Once node instances are restarted ( we are running 8 of them ), then the issues goes away, until it comes back 20-30 minutes later.

For what its worth, we are load balancing (with ip based session persistence) all 8 of the node processes.

Would be happy to provide any more details on our setup. Cheers, and thanks for Socket.IO! Awesome!

Arnout Kazemier
Collaborator

I suspected that changing https://github.com/LearnBoost/socket.io/blob/master/lib/transport.js#L157 from .end to .close
But I havent been able to verify my finds yet. But I also see the same rate of requests coming in.

But it's combination of fail from the server and fail from the client side code.

Scott Rushforth

@3rd-Eden - I have put your suggested fix into production for almost 45 minutes now, and rate of 'client not handshaken...' messages has significantly dropped. There is still occasionally these messages coming in, but they are about equal to log messages of real actual message passing activity. I will continue to watch this issue for a bit, but you may have solved it.

Thanks a million! Will keep this thread up to date if my situation changes.

Scott Rushforth

I may have spoke a bit too soon. Given enough time, it seems that all the node instances do still fall back into the loop of client not handshaken messages. But for some reason it seems as though it lasts longer because they need to be restarted.

Joe Faron

I'm seeing this issue as well with about 700 connected clients.. How do you re-connect?

My log is flooded with:
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect
warn - client not handshaken client should reconnect

tommypowerz

having the same poblem... also about 500-1000 connections...

but it is also crashing with this from time to time.. :

node: src/uv-common.c:92: uv_err_name: Assertion `0' failed.

and i dont know how to catch this...

Joe Faron

I cleaned up a lot of my code on quit()'s on disconnects and fixed my client side socket.io code to not poll longer than 20 and now I'm not getting this.. pretty sure its the client side code goin wacky.. reconnecting and starting dozens of connections per client.

xtassin

Only got this when using XHR polling or JSONP transports. It stopped when forcing to Sockets and FlashSockets only.

Ryan T

I'm seeing this happen with IE clients using XHR polling

nickiv

XHR polling seems most suitable for me, so I keep digging it.
I think the problem is in reconnect method of socket. It does call handhsake method again and again while handhsake requests (both jsonp and xhr) are not being cancelled at all.
Under certain network conditions responces for handhake can be delayed and when they eventually arrive fierce reconnect begins.

Now I got the method to reproduce bug. Suppose we have socket.io server running on 8080. Connect a client via xhr-polling from FF. Then add firewall rule on server:
iptables -A OUTPUT -p tcp -m tcp --sport 8080 -j DROP
You can see in the network section of firebug some handhake requests pending. Then drop rule:
iptables -D OUTPUT -p tcp -m tcp --sport 8080 -j DROP
After that reconnection begins.

In my opinion we should not call handhake unless previos one fails. And there must be a timer in it to decide failure.

Dominiek ter Heide

Any news on this? This is pretty serious

Joe Lanman

I have this issue on socket.io 0.85, it takes up 100% CPU, and even if I kill node and restart, it starts straight back up with loads of 'client not handshaken' warnings. Any workarounds?

Update:

Tracked it down to a client running on IE9 - killed it and issue has gone, but surely a single client shouldnt be doing this?

Ryan T

@joelanman make sure you ie9 client a socket.io client version 0.8.5.

Joe Lanman

thanks - must have just been a client/server mismatch

Scott Rushforth

Have been having this problem, even with matching client+server on 0.8.5, but this patch in this thread: #534 seems to have definitely helped.

Arnout Kazemier
Collaborator

@cowboyrushforth helped or fixed?

Scott Rushforth

I deployed the patch about 36 hours ago, and it seems that the handshake errors continue to slowly decrease over time. (As connected clients finally refresh and download new client side code)

I will continue to keep an eye on the rate of these things and report back in another day or two.

Scott Rushforth

Ok its now been 72 hours+ since applying the patch in issue 534, things seem much more stable. No out of control clients, and no high rate of handshake errors. Cheers

Liam Don

In which file should I apply the patch? Is this going to be fixed in 0.8.6?

Arnout Kazemier
Collaborator

@liamdom it's already the master of socket.io-client so yes, it will be available for the next release

Dennis

I don't think the base of that problem is solved in 0.8.6. The warnings are gone, but the clients are still reconnecting at a really high rate.

io.sockets.on('connection', function(client) {
    console.log('Client ' + client.id + ' connected');
});

... outputs overall at the same rate like before.

Liam Don

Agreed, neither the above patch nor 0.8.6 have solved this - I'm getting a very high rate of warn - client not handshaken client should reconnect when using xhr-polling. @cowboyrushforth is the fix still working out for you?

Mattias Pfeiffer

Also seeing this with 0.8.6.

Scott Rushforth

@liamdon and others - sorry for misleading you, the patch I thought had worked to solve it completely turns out that the issue still comes back, albeit I've seen it much less, seems like its overall an infrequent occurrence now. I have not been extremely scientific in this, just recording/perusing/munging logs, etc, but as I look now they are piling up with these handshake requests again.

Scott Rushforth

Also, fwiw, am running socket.io behind load balanced setup. In recent days a theory has formed that it only occurs only on some clients after the node.js/socket.io server they are speaking to may go away (ec2 drama, or node/app crash), and then the load balancer assigns them to another node.js/socket.io server.

In every testing environment failing scenario with every browser from ie8,9-ff3-7,safari,etc, this works fine, the clients realize the session is invalid after a few seconds (sometimes the polling length, 20 or so), but somehow, in production, with some weird browsers, I think this causes the client to go into a fierce reconnection loop. Haven't been able to reproduce this one reliably which is the crux of it. Anyone else having this issue also behind load balancers or high availability setups where the socket.io session may be severed?

Mattias Pfeiffer

I'm seeing this with clients connecting directly to node.js/socket.io - no load balancers.

Dominiek ter Heide

Same here

Liam Don

@cowboyrushforth Yep I'm behind an Amazon Elastic Load Balancer and using the redis store. I haven't been able to reliably reproduce it either, except in production where, after 20 mins, our socket.io servers are effectively DDOS'd by the handshaking clients.

Do you have a current interim workaround? Don't use xhr-polling?

Ryan T

the hardest thing about this is that no one is able to reproduce it, yet we all see it happening. i'm still getting these, not as much, but they do still happen.

@cowboy i originally thought it was load balancing so i forced all of my traffic to one node for a week, it still continued to happen.

fwiw, the only time i was able to reproduce this was runing ie9 with a client/server version mismatch. i had 0.8.4 on the client, and 0.8.5 on the server. make sure you have your clients request the latest version when you update the server.

thats all i got, sorry...

thekiur

A high number of 'client not handshaken client should reconnect' warnings can be produced by restarting the server quickly.
It seems that the client will attempt to use old handshake ID. A refresh on the client side is required to get a working connection. Not sure on what browsers that happens, but if i enable logging, i see lots of those warnings if i restart the server quickly enough. They will graudually fade away.

hovu96

Same on my EC2 instances when using 'xhr-polling' transport. I cannot reproduce it myself (with my browsers) but the log file of the production instance contains tons of this warning.

Is there anything I can try that helps finding out what's going wrong pleas let me know!

Dennis

I also noticed that many clients which produced handshaking warnings in 0.8.5 are in 0.8.6 recognized as connected for about half a second or less, that makes my node instance running at almost 100% cpu. 0.8.5 run at 5% to 10%.

hovu96

I just tried version 0.8.5 but get the same result (as with 0.8.6): after about 20 seconds after launching the node server (about 100 connections) I get a lot of the "client not handshaken client should reconnect" warnings...

Christian Rishøj

I was seeing a spew of "client not handshaken client should reconnect" with socket.io-client and Node v0.5.10-pre.
After downgrading to Node v0.4.12 the issue seems to have disappeared.

Dennis

I have this problem with v0.4.8.

Mattias Pfeiffer

I'm seeing this with v0.4.8 and v0.4.12.

I've tried reproducing it, but have been unsuccessful so far. It happens after a restart of the node instance, where some clients reconnect like crazy - this can cause the node instance to fail and when restarted the reconnecting-loop starts again.

I'm seeing this issue more frequently in socket.io v0.8.6 than v0.8.5.

Guillermo Rauch
Owner

The solution for this will be in v0.8.7

Mattias Pfeiffer

@guille You're the man! :-)

Dominiek ter Heide

Does this mean we've actually found the problem and been able to reproduce it?

I'm saying this because there's been a lot of "reappearances" of this problem in the past months :(

brettkiefer

@guille Great! Can you give us any more concerning the problem? I'm testing socket.io in production right now with 0.8.6, so I am interested in what's going on.

Liam Don

Can you guys shed any light on what the problem is? We'd like to patch in a fix without waiting for the next release!

Dennis

The warnings still appear at a high rate with 0.8.7, but the short connections of 0.8.6 are fixed, so everything just seems like it was with 0.8.5 :D.

peepo

the peepo.com server has a similar issue, but only from boxes on the same side of the firewall as the server.
they never connect.

so if anyone has client-side tests, scriptlets or whatever, let me know...

3 macs: 2 intel, 1 ppc

Martin Tajur

I am experiencing the exact same issue while I launched http://listhings.com on NodeJS version 0.6.0, Socket.IO version 0.8.7. Whenever I turned on xhr-polling as a transport option, I had a ton of "broken" connections with the "warn - client not handshaken client should reconnect" messages.

Right now I disabled xhr-polling in production but I can thus not support all browsers at the moment.

Martin Tajur

Okay, I seem to be getting a high rate of these mesages even when xhr-polling transport is not enabled. Can someone describe what the problems and implications surround these messages from the Socket.IO side of things — e.g. is there anything I could do to help speed up patching this issue?

Edit: The reason I saw these errors even with xhr-polling disabled was that I had jsonp-polling enabled as well. Now I am only allowing websocket and flashsocket connections, and Socket.IO performs relatively good. Some clients with heavy HTTP-specific firewalls now block connections, though.

peepo

and for whatever reason, websocket and flashsocket connections only, dont work this side of our simple firewall....

Dominiek ter Heide

I just upgraded to 0.8.7 and re-enabled Websockets. Do I have to pray now?

Einar Otto Stangvik

@dominiek, I'd upgrade to master/head instead.

Liam Don

Is there a confirmed fix in HEAD? It seems like we still don't have a good test case for this. I have struggled to recreate the issue except in production with a few thousand clients.

Martin Tajur

I was able to reproduce this easily on localhost with 7-8 tabs open in Google Chrome, with the following steps:

  • First, Socket.IO is configured to only allow xhr-polling (and/or jsonp-polling)
  • The server process was instructed to disconncet all connected clients immediately upon SIGTERM
  • The clients are, in turn, instructed to try to reconnect with 5 seconds interval to the server upon disconnect
  • Once I then terminated the server, and re-started it, all those 7-8 clients started to reconnect but what happened was that (somewhere in the middle) those messages ("client not handshaken") start to appear for one of the clients. Not every time though.

The most worrying thing (and the reason this was a showstopper for me in production) is that the Socket.IO client, upon receiving such error message from the server, goes into neverending loop, and just tries to do the same request that again triggers the error message, that makes the client to make the request again, that again triggers the error message... quite fast, CPU usage went to 100%.

So it seems @nickiv said it right - the problem lies in the client code, which doesn't treat the error message it receives properly.

Dominiek ter Heide

I could recreate this EXACTLY like @martinajur said. This is very very worrisome indeed.

Dominiek ter Heide

(with version 0.8.7!)

Einar Otto Stangvik

@liamdon, @martintajur, @dominiek, I've yet to see this issue on any of my production environments, but if I do manage to reproduce it with the HEAD, I'd be happy to spend time chasing down the cause.

That said, this is open source software. Those of you who are able to reliably reproduce the issue aren't violently far from obligated to attempt a fix.

Up until two months ago I had nothing to do with the project, but upon needing support for the HyBi websocket protocol, I sat down and wrote just that. Since that time I've contributed this and that, and intend to keep doing so. If you're on any level getting anything good from the project, my friendly suggestion is: Pay it forward, don't just consume.

Martin Tajur

@einaros I am absolutely on the same page with you, and I will try to take a stab at this issue in the client side code.

Dominiek ter Heide

@einaros Noted. Totally with you that this is and should be a collaborative thing. I've recently contributed a unit test to the unicode issues.

I can confirm that I could recreate this issue by opening 8 tabs on both master/HEAD, 0.8.6, 0.8.6 and 0.7.9.

Dominiek ter Heide

I tried recreating this into a test case but wasn't successful. It seems that to create this issue you need to simulate many client (browser) instances of the client with one server. Simply creating many connections with 'force new connection' doesn't work.

Diego Varese

Just adding my grain of sand to this issue, I can only reproduce it when I have more than one socket.io server under HAProxy. If I have only one, with or without HAProxy I have no issues, but as soon as I add 2 node.js instances under HAProxy and open a number of tabs, I get random connections and reconnections and once in a while I get the neverending ultra-fast reconnect bug.

Guillermo Rauch
Owner

If you open many tabs make sure you're not saturating the socket limit per host (use many subdomains).

gdiz

I have AVG antivirus [Free]. Enabling the Surf Shield causes this problem on IE and Safari [Chrome and FF work fine]. Disabling the Surf Shield resolves the issue. This is a MAJOR issue as a lot of customers will be using AVG and similar firewalls. Anyone have a solution yet? [ive tried ports 80,8080,843 and 443 without any luck]

Einar Otto Stangvik

@diegovar, I have 12 running production applications, under one haproxy instance. So there's more to it than that.

gdiz

I can confirm that FF,IE 8/9,Safari(windows),Chrome work fine if i go and disable the AVG Surf Shield. So this must be some sort of networking issue.

Also, i am using the entire stack
io.set('transports', [ // enable all transports (optional if you want flashsocket)
'websocket',
'flashsocket',
'htmlfile',
'jsonp-polling',
'xhr-polling'

]);

Dominiek ter Heide

I'm sure firewalls like AVG can have an impact on this issue, but it's definitely NOT a fix for this issue. We could recreate this issue easily without any firewalls.

Dominiek ter Heide

After a day of debugging we found out that this issue doesn't show up when you point your server towards Mecca and touch your nose three times!

No really: we've isolated this problem into a unit test. It's a very deep one and we suspect this "reconnect loop" can be triggered in other ways than we're illustrating here: LearnBoost/socket.io-client#339

Here's a short description of the problem:

The infinite loop that happens can happen due to several causes. In this unit test we highlight one of those causes, which could be that the transport takes longer to "get ready" than the server-side handshake garbage collector allows (30 seconds). In the case of XHR this could happen due to the browser needing to load slow include files on the page (this in fact happens when you open 8 tabs in Chrome often). Here are the steps of this specific scenario:

  • The client does a io.connect
  • In connecting, the client does a successful handshake, the server stores the client id into its handshake buffer
  • For whatever reason, the client's transport is not ready yet (this means XHR#open will not start yet). Once the document.load event is triggered by the util.defer, the io.connect will continue and call XHR#open which does a get call.
  • If more than 30 seconds have passed, the server will have removed the client id and tell the client to reconnect- The client receives the error packet, but continues to do a XHR#get on each successful XHR request. The error is escalated to Socket#onError, but due to the fact that Socket#connected is still false, it will never attempt a reconnect and XHR#get will continue to loop. The server will keep on telling to reconnect (although it really also means that the client should re-handshake).

The delay in transport loading is just one potential case in which this death spiral can be triggered.

Mattias Pfeiffer

@dominiek Great work - awesome with a test-case!

Diego Varese

Just wanted to report that for some reason if I run socket.io on port 80 I get this infinite reconnect behavior, but if I do so on some other port then it works correctly. This infinite reconnect behavior only happens for transports other than websockets, if I leave websocket as the only avaiable transport then the connection is simply never made (the connecting handler is called but nothing else)

Bruno Carvalho

@guille 0.8.7 is out and you said it would be fixed there. Is it fixed in 0.8.7 ?

Thanks,
Bruno

Kristian Faeldt

I appear to have the same behavior as diegovar when running on port 80, and sadly japanese carrier Softbank only allows port 80 for http communication on Android.

Mikkel Høgh

I seem to have this problem at very high rates – I had ~133k instances of this error within a 3 hour period with about 20 concurrent users yesterday…

arnesten

I get this in my logs when I use IE9 in IE8 mode with XHR-polling. As long as I am in IE8 mode no messages are received to the browser from the server. If I turn off IE8 mode and use real IE9, messages are received correctly and I don't get the error in my logs anymore.

Sergii Boiko

I've applied patch from @dominiek by myself and it works perfectly!
Thanks, @dominiek.

Dominiek ter Heide

Any news on this guys? My patch actually only fixed one part of the problem, it's definitely still happening in our multi-node production setup. Logs are full of it

We are kind of hoping (making a little SocketIO prayer before hitting "cake deploy") that the reconnect refactor will solve this for us

Cheers

Dominiek

Scott Rushforth

Fwiw, have tried all patches in this thread in production, and none have solved the issue, including 3rd-eden's bugs/reconnect branch, dominiek's patch, and both combined.

Guillermo Rauch
Owner

@cowboyrushforth have you tried extending the garbage collection timer ?

Scott Rushforth

@guille no, but I am happy to give that a shot, can you point me to some docs/info on how to experiment with that? Thanks!

Guillermo Rauch
Owner

Please keep me posted

Scott Rushforth

@guile, fix deployed, will keep you posted. So far so good. I have set it to 90,000

Scott Rushforth

No change unfortunately, the hankshake errors are piling up as usual.

From my debugging, this appears to happen when the load balancer decides one of the node.js/socket.io servers is unavailable (for example when it restarts) and moves the client to a new node.js/socket.io server. From there, the handshake errors immediately start.

It feels like the client tries to continue its session on a new node.js/socket.io server, but this server never had the handshake details to begin with, and says the 'client should reconnect', but for some reason this drives the client haywire and goes into the infinite re-connection storm, until the client does a full browser refresh.

Dominiek ter Heide

Yes, its vital to test this on load balanced setups. We are using NginX with ip_hash based load balancing and are also seeing a lot of these errors.

We also noticed that if we have any normal HTTP call not responding on our node app instances, the load balancer will decide to use a different machine. This will trigger this behavior as well.

Now that we have fewer of these loadbalancer switchovers the "storm" is less, but there is still a pretty heavy wind blowing

Sergii Boiko

I use sio without load-balance and checking the log, observed that "not handshaken" is reproduced, but with very-low pace compared to case without @dominiek patch.

Guillermo Rauch
Owner

Are you guys doing sticky load balancing?

Scott Rushforth

@guille yes, all clients are always re-routed to the same node server. If that node server is restarted or crashes however, then clients are immediately routed to a different node server.

Guillermo Rauch
Owner

Well the problem is that if the server crashes, it loses the sessions, therefore it will advise clients to reconnect. Sounds like expected behavior. Why is your server crashing?

Scott Rushforth

the crashing is irrelevant. it happens just the same on server restart. ie - when we deploy new code for a new feature.

Scott Rushforth

the real problem is that when it advises clients to reconnect, this functionality is broken. it only works on some clients. for example with webkit browsers, it seems to work as advertised. but with firefox3 and ie, it doesn't. when the clients are advised to reconnect, they get into an infinite re-connection loop, which doesn't sound like expected behavior.

Dominiek ter Heide

Yes, I forgot to report this, after my fix we also noticed reconnect loops (infinite) still happening, but this was also happening on connection failures before we had a load balancer.

In XHRPolling#get I put in a hard stop if NUM_ERRORS_HAPPENED exceeds 50. In the onPacket I've put the NUM_ERRORS_HAPPENED++ when an error is received.

This is severely fucked indeed

Guillermo Rauch
Owner

I see. So this is exclusively a client-side issue. Got confused. I'll review the code for the client asap.

peepo
Deleted user

The content you are editing has changed. Reload the page and try again.

We have been having this issue a lot recently, after reading closely to this thread, as well as stackoverflow threads we updated with client and server to 0.8.7 and it cut's down the errors for the first 10 or so minutes it starts up again.

There's people stating that it's IE specific where in our case it's pretty impossible because our socket.io client side is in a Google Chrome Extension.

Updating from 0.8.5 to 0.8.7 with 6-8K connections reduced our CPU load from around 30% on a single core to around 10% with a peak of 30% when emit's get fired.

So good work with the new version but this issue still persists using only the sockets transport.

Sending Request…

Attach images by dragging & dropping or selecting them. Octocat-spinner-32 Uploading your images… Unfortunately, we don't support that file type. Try again with a PNG, GIF, or JPG. Yowza, that's a big file. Try again with an image file smaller than 10MB. This browser doesn't support image attachments. We recommend updating to the latest Internet Explorer, Google Chrome, or Firefox. Something went really wrong, and we can't process that image. Try again.

Deleted user

The content you are editing has changed. Reload the page and try again.

I have also been experiencing this issue. My temp solution is to use node v0.5.3 as it seems that the socket.io-client websockets functionality was broken between node 0.5.3 and 0.5.4.

http://groups.google.com/group/socket_io/browse_thread/thread/e12b27f2b16ec8eb

Sending Request…

Attach images by dragging & dropping or selecting them. Octocat-spinner-32 Uploading your images… Unfortunately, we don't support that file type. Try again with a PNG, GIF, or JPG. Yowza, that's a big file. Try again with an image file smaller than 10MB. This browser doesn't support image attachments. We recommend updating to the latest Internet Explorer, Google Chrome, or Firefox. Something went really wrong, and we can't process that image. Try again.

Luke Anderson

I have been looking at this issue for a couple of days now, and I think I have tracked down an issue in the client side code in both 0.8.5 and 0.8.7, that only occurs in older browsers. The issue doesn't appear in the latest version of Firefox or Chrome, but I installed Firefox 3.6 and Firebug shows thousands of reconnects. When the browser connects to the server with an invalid handshake ID, socket.io returns an error packet (starting with a 7), with a flag indicating that socket io should reconnect.

The part of the code I found is giving the issue is in socket.io.js (client side) on line 1871 where the socket.io client handles errors.

Socket.prototype.onError = function (err) {
  if (err && err.advice) {
    if (err.advice === 'reconnect' && this.connected) {
      this.disconnect();
      this.reconnect();
    }
  }
  this.publish('error', err && err.reason ? err.reason : err);
};

In chrome this.connected is true, and so the disconnect and reconnect is performed. However, in old firefox, this.connected is false, and so the two statements this.disconnect(); this.reconnect(); are not executed. This means that the browser does not perform a new handshake with the notify server, and repeatedly sends a connect request with the invalid connection identifier.

I changed the above code to the code listed below, and the problem was resolved for me.

Socket.prototype.onError = function (err) {
  if (err && err.advice) {
    if (err.advice === 'reconnect') {
      if (this.connected) {
        this.disconnect();
      }
      this.reconnect();
    }
  }
  this.publish('error', err && err.reason ? err.reason : err);
};

I'm yet to try it on our web pool of 10K connections, but it certainly made a big difference on my test rig.

Guillermo Rauch
Owner

It seems that's likely to be the cause. Please let me know and thanks!

Dennis

Nice, this is the first attempt that looks legit. I have just deployed it to a site with about 2k connections, i will let you know in a day or two how it goes.

Scott Rushforth

I have also deployed this fix. Will report back asap. Thanks for the efforts!

jasonaward

This definitely fixed the issue for me! Great thanks to @spble for the fix!

Dennis

Unfortunately it doesn't seem that spble's fix solves the issue. Clients who definitely refreshed their socketio.js file, still reconnecting without a delay, some clients even have the latest Chrome.

Scott Rushforth

After 10 hours in production things still appear fixed for me. I can confirm all clients are running the latest code. Still keeping an eye on this and will report back if things change.

Scott Rushforth

And after 20 hours in, it is still working pretty good, much better than it ever has, but there is still some clients that are still effected, however much, much less.

The remaining clients that are furiously reconnecting all have one thing in common however (whereas previously they did not). They all do not pass my applications authorization routine and so the initial handshake is failing.

For clients that have normal healthy authentication, things are stable.

Luke Anderson

My previous fix did seem to overcome the issue whereby older browsers would very rapidly attempt to send data with the same session ID due to an incorrect value of this.connected. However, I then suffered the same issue as mentioned by @denisu whereby even the latest version of chrome would try and reconnect thousands of times. I found 3 issues: one due to multiple workers running with cluster, another due to socket.io default timeout settings, and finally my previous fix caused another bug.

I am using node 0.4.12 with the cluster plugin (https://github.com/learnboost/cluster), which runs node in a number of worker processes so as to utilise multi-core machines. I found that when I ran cluster with only 1 worker process, it all worked fine. However if I ran cluster with more than one worker, Chrome would attempt to perform multiple handshakes repeatedly. This was because Chrome would perform a handshake, get an ID, then initiate the websocket connection. It would then receive the packet: 7:::0+1, which indicates 'Error, please reconnect'. The socket.io client would take the advice and attempt a new connection. After I changed the Node.js app to only have 1 worker process, it would not send the error packet. I assume in this case, the websocket connection that followed the handshake would go to a different worker process than the one who issued the handshake, and it would not recognise the ID, hence returning an error. I resolved this by using the RedisStore mechanism for Socket.io which will work between multiple processes/servers. The code for this is below.

sio = io.listen(app);
redis_server = {host: 'localhost', port: '6379'};
redis_options = {redisSub: redis_server, redisPub: redis_server, redisClient: redis_server}
RedisStore = io.RedisStore;
sio.set('store', new RedisStore(redis_options));

So after I found this, I allowed all the web traffic back to my host and still encountered lots of connect/reconnects. Turns out socket.io doesn't work well with it's default timeouts if the server is under heavy load, and my server is under very heavy load. Basically, it would connect and receive a handshake ID from socket.io, but then either not recieve anything over the websocket connection and timeout (causing it to re-handshake), or it would recieve an error from the server causing it to re-handshake. I suspect this error is sent from the server because it has run out of some resource.
To overcome this, I modified the 'reconnection limit' option that appears around line 1501 of the client side socket.io.js. Having this set to Infinity as it is by default disables the exponential back off that occurs around line 1970, and so socket.io just spams reconnections without waiting for a delayed response.
By changing 'reconnection limit': Infinity to 'reconnection limit': 18000, the client script will exponentially increase the 'reconnection delay': 500 time until it is 32000ms. This should give your server enough time to respond. If not, at least it won't spam it and compound the issue.

Then I found that in some scenarios where timeouts were reached, two active websocket connections would exist simultaneously, both with active heartbeats from the server. I found this to be due to the fix I provided above, and have revised my previous code to remove the check for this.connected completely. This way this.disconnect() does not get erroneously skipped.

Socket.prototype.onError = function (err) {
  if (err && err.advice) {
    if (err.advice === 'reconnect') {
      this.disconnect();
      this.reconnect();
    }
  }
  this.publish('error', err && err.reason ? err.reason : err);
};

I am yet to put this live and see the true results, but it seems to work okay on my testing rig. I will post results if this makes a large impact.
Good luck!

Luke Anderson

I have found a solution that works for me. As @guille suggested earlier, changing 3E4 on line:
https://github.com/LearnBoost/socket.io/blob/master/lib/manager.js#L964
worked for me.

I added a console.log call in the if statement linked above, and noticed that garbage collection was happening A LOT. It seemed like when the server was under high load it would try garbage collecting before it even finished GCing previously. This would then recurse until I was using 100% cpu all the time, and clients would get their handshake deleted before they could even talk to the server.

It seems that this timeout is much too short, so I changed 3E4 to 3E5 and now I get significantly less client not handshaken client should reconnect errors.
This solution seems to be working for me, though I don't think it's the best solution. I think the garbage collection needs to check that it's not already garbage collecting before it starts again or something along those lines.

There are still several errors in the client side socket.io code that could be causing this error, and I'm still trying to identify what those are.

Robert Schultz

I have created a test case that reproduces this bug.
Just a simple single node.js socket.io server and Chrome 16.

Here it is: https://github.com/Sembiance/client_not_handshaken_test_case

Basically:
start server
client connects
server is issued a signal causing express' app.close() to be called and each socket to be issued a .disconnect()
server waits 10 seconds before process.exit()
client detected it was booted and sets a timer to reload the page after 5 seconds (just like a human might do)
after 5 seconds client reloads page and tries connecting again
some how this is possible in chrome, despite app.close() having been called (maybe cache?)
in FF after 5 seconds the page reload fails, as you would expect
after 10 seconds the server exits
start server again (do this soon after the server exits, don't wait too long)
the single, chrome client now causes an infinite 'client not handshaken client should reconnect' loop.

So app.close() being called may not be a use case to be handled by socket.io but hopefully this relatively simple test case can shed some light on the problem

Vyacheslav

Many thanks to Sembiance for his test, it helped a lot.
I created LearnBoost/socket.io-client#374 patch to client, which seems to solve problem. There is a bit of description inside.

Christopher Hunt

Unfortunately I've not been able to observe socket.io exceeding 200 connections at one time. I think the issue I'm observing is similar to the issue being discussed here given that I also see the, "client not handshaken client should reconnect" message.

A similar test script for a non Node/socket.io library has shown that thousands of connections are possible, so I think that my client code is ok.

I've created a test case that can be obtained using:

git clone git://gist.github.com/1698196.git gist-1698196

'hope that this is useful.

Joan Roca

Hi SINPacifist,
I've tried your request code but it doesnt seem to work at all. Is it really working for you?

Martin Tajur

@joanroca - I have applied @SlNPacifist patch to my Socket.IO implementation, and after testing I also put it to production few days ago.

I can confirm I have no longer seen Socket.IO go into infinite reconnect loops that were described in this thread earlier with xhr-polling.

I have enabled only xhr-polling as transport method, and everything is working perfectly.

Joan Roca

Ok, but is it necessay to force xhr-polling? I havent forced any trasnport and my chat stopped working for all clients

Martin Tajur

I'm not sure whether it is necessary or not...

I'm forcing xhr-polling because websocket transport did often not work when a client was behind some strict corporate firewall which only allowed "standards-compliant" HTTP traffic, and for some reason Socket.IO sometimes also failed to switch to xhr-polling by itself.. I did not investigate the cause of this further, though, as I am satisfied with xhr at the moment.

Robert Schultz

@martintajur - By the way, the websocket problem behind corporate firewalls is something LearnBoost is aware of. They are working on a from scratch, new socket.io which addresses the problem you described. See the 'Goals' section here: https://github.com/LearnBoost/engine.io

Joan Roca
Vyacheslav

@joanroca, after applying this patch i've noticed reduced number of such logs at project server, but they still were appearing. I've moved to another transport library about a week ago because socket.io client seems to be very unstable - there are really lots of 'forgotten' callbacks and connection states that are not handled.

peepo
Joan Roca

So any word from LearnBoost about this?. We are about to launch a chat for our Facebook games but this is really blocking it.

Austin Hammer

I have applied SINPacifist's patch and the error still occurs, albeit less frequently. However, when using "cluster", the problem is amplified even more.

Martin Tajur

I believe the fact that the error is still happening is normal. However, it doesn't cause infinite loops any more; instead - it behaves as it should - it forces the client that received the error to actually do a reconnect.

Josh Smith

"I seem to have this problem at very high rates – I had ~133k instances of this error within a 3 hour period with about 20 concurrent users yesterday…"

I experienced about the same rate. I'm sure Loggly just loves me right now.

Is there any progress on the pull request that's been submitted?

Diego Varese

I managed to mitigate this issue by setting the load balancer to balance by connection IP. Otherwise it was completely unusable, I'd get the reconnect loop every time even with SINPacifist's patch.

Dominiek ter Heide

We also set our front server to ip_hash rotation, but trust me, you'll still get A LOT of reconnects

Guillermo Rauch
Owner

I'm working on a way to mitigate this issue from core. Not only will it allow reloads of your code without dropping sessions, but also increase performance / eliminate the need for load balancing without the need for replication. This is not easy to tackle, but thanks to the clean separation we're moving towards between engine.io/socket.io, it's much more doable.

At this point however, the safest bet is to mitigate it at the load balancer level.

Josh Smith

Thanks for the update G.

kseptor

Applying the patch from @SlNPacifist #438 (comment) to 0.8.7 was a huge win!

The logs still spam "warn: client not handshaken client should reconnect", but at least now haproxy isn't getting backed up with thousands of node sessions.

If you just read through 7 months of commentary and find yourself at this comment, your next move should be to try the patch.

Guillermo Rauch
Owner

@kseptor
Thanks for the feedback about the patch. It will make it into 0.9.1 along with another important fix

Binh Le

I'm running 0.9.0 with @SlNPacifist 's patch and no load balancer upfront. After about an hour the log contains almost nothing but "warn: client not handshaken client should reconnect" and my CPU is at 100%. Looking forward to a fix.

David Chouinard

@guille Did this actually make it in 0.9.1?

Chris Shoemaker
shoe commented March 02, 2012

If this patch made it into 0.9.1, then I don't think it worked completely. I'm still seeing very high rates of this warning.

Dennis

The problem still exists in 0.9.1. Until there is a proper fix, you can just comment line 677 in ./lib/manager.js, which is sending the reconnect command to the client after an invalid handshake.

This is quite brutal and not a fix, but a least its not causing a 100% cpu load on those reconnecting clients.

Nate Morse

@denisu Thanks for that line 677 stop-gap measure, saved the server from melt-down

Austin Hammer

Seconding the above posters about this problem still existing in 0.9.1-1

oh and thanks for the stop-gap measure!

David Fooks

How "brutal" is this line 677 fix? Seems to work fine in my quick tests.

David Fooks

Ok, I've tested this 677 "fix" and it does stop this issue. However, the browser causing the "warn: client not handshaken client should reconnect" errors now periodically reconnects (roughly every 10 seconds) and then immediately disconnects. So now I get the message only once every 10 seconds.

Diego Varese

Does this line 677 fix cause reconnects not to work?

David Fooks

@diegovar
No, the normal reconnects still work. After the 677 fix in the case that the clients browser gets stuck in this "warn: client not handshaken client should reconnect" loop the client won't try to reconnect again for about 10 seconds. This means the client no longer spams the logs and eats your CPU. That client (the one causing the reconnects) however still won't successfully reconnect until they refresh their page. But seeing as they never managed to reconnect previously this is a massive improvement.

Diego Varese

I wonder if adding a setTimeout there might not be a better temporary fix. This would prevent the reconnect spamming and allow clients to reconnect after a while in case it fixes itself after some time

Kyle Mathews

Anyone tested 0.92?

Kyle Mathews

To answer my own question, 0.92 still has the same problem. Commented out (now) line 683 again.

Kristian Faeldt

Since the patch is now merged, and the unit tests for this issue now also seem to pass, does this mean that the cause of the remaining issues is currently unknown?

Kristian Faeldt

My understanding of socket.io is too shallow for me to get to the bottom of this even after multiple attemps but, what about something like the below for a workaround?

socket.js (client)

function Socket (options) {
  [...]
  this.reconnectAdviceCount = 0;
  this.lastReconnectAdvice = Date.now();
  [...]
}

[...]

Socket.prototype.onError = function (err) {
    if (err && err.advice) {

      if (err.advice == 'reconnect') {
        if (this.lastReconnectAdvice - Date.now() > 300000)
        {
          this.reconnectAdviceCount = 0;
        }

        this.lastReconnectAdvice = Date.now();

        if (this.reconnectAdviceCount++ > 20) {
          this.publish('error', 'advicereconnectspam');
          return;
        }
      }

      if (err.advice === 'reconnect' && (this.connected || this.connecting)) {
        this.disconnect();
        if (this.options.reconnect) {
          this.reconnect();
        }
      }
    }

    this.publish('error', err && err.reason ? err.reason : err);
  };

The idea being here that you can subscribe to the error event in your application and upon receiving a 'advicereconnectspam' error you could do something drastic like recreate the socket or reload the page.

Perhaps this is a horrible idea, but failing everything else this is the best I could come up with. Thoughts?

Charles

Ran into this too.

James Bathgate

Still seeing this in 0.9.3.

Andrei

It seems the problem is not the server but the client.

Tried the server and the client in 4 configurations:
windows server windows client
windows server linux client
linux server windows client
linux server linux client

Windows/Linux server both worked, but when I tried the Windows client (chrome 17) , it did not work. Linux client (chrome 17), worked like a charm.

Same thing with firefox.

James Bathgate

@matzipan I've noticed the same thing, when using a Windows client both Firefox and Chrome fail, but IE does work. However on my Mac both Firefox and Chrome work. I've also noticed that Firefox and Chrome both work under Parallels on my Mac with a Windows installation.

Andrei
Guillermo Rauch guille closed this March 31, 2012
Guillermo Rauch
Owner

Fixed in 0.9.4

Kyle Mathews

@guille to satisfy the curiosity of those who've been struggling with this -- could you link to the commit(s) that fixed the problem?

Guillermo Rauch
Owner

Yup it's pretty simple. In Socket.IO an error packet is not really a first-class transport error, so it's conveyed with the status code 200 like any other packet.

Which brings us to the polling cycle. Upon a successful response, a request is immediately reopened. Notice onData followed by get:

https://github.com/LearnBoost/socket.io-client/blob/master/lib/transports/xhr-polling.js#L81

The only thing that can prevent get from triggering again is the internal open check:

https://github.com/LearnBoost/socket.io-client/blob/master/lib/transports/xhr-polling.js#L72

Which is now set to false by onPacket (triggered by onData) before get is executed:

https://github.com/LearnBoost/socket.io-client/blob/master/lib/transport.js#L83

So the commit is that one highlight in transport.js.

Andrei
peepo
peepo commented April 01, 2012
James Bathgate

I'm still seeing tons of these warnings in my logs.

arnesten

I'm also still seeing this after upgrading from 0.8.7 to 0.9.4, but it seems to be a little less.

David Fooks

So we had a look at this and after a couple of weeks of testing this fix seems to have worked. You can find our fix here c382a6b

For us, it was an issue with the loadbalancers closing XHR (and possibly other transports) connections between the transports handshake and connect steps. The client then fails to connect since the transport has been incorrectly removed by the disconnect in transport.end
https://github.com/LearnBoost/socket.io/blob/master/lib/transport.js#L467

We added a timeout to transport.end which stops the transport from being disconnected instantly. This allows the connect step (on another connection) to jump in and replace the transport in handleClient:
https://github.com/LearnBoost/socket.io/blob/master/lib/manager.js#L649
thus keeping the transport open. If the transport has not been replaced after 5 seconds we assume that the client has actually disconnected and remove the transport as before.

This fix may not be the perfect solution but its working for us at the moment and we haven't seen any issues with it yet. To my knowledge there is no good reason why XHR polling should require a keep-alive connection (and certain browsers or loadbalancer backends might not support connection keep-alive's) so this is a Socket.IO bug.

guille's fix fixes a symptom of this issue where a client will have the issue above but instead of starting over at the handshake it just tries to connect again, but nothing has changed since the transport ID has been completely removed (by the disconnect) it then tries again and again... basically DOS attack on your server. guille's fix causes the client to correctly go back and restart at the handshake creating a new transport ID. However, if the clients connection is closed again on the reconnect then the same issue will still happen (but now with an exponential falloff).

David Chouinard

@davidfooks Thanks for good work on this.

Any idea when this will be released?

David Fooks

@DavidChouinard

I'm not a part of the Socket.IO team. Its 8 months this has been open so we decided to take it on ourselves. You can just apply that patch locally but I would recommend running your own tests before using it. There is probably a better way to do it than this quick fix but for now this works fine.

Guillermo Rauch
Owner

@davidfooks
You make a good point. Thanks for bringing this to my attention so clearly.
I applied this patch but the tests are not passing, but I'm going to try to release this solution (or a similar one) asap.

Guillermo Rauch
Owner
crickeys

how is this fixed in 0.9.5? I also get a TON of these still in 0.9.6

Andrei
crickeys
Andrei
mindon

still in v0.9.6 client.

BTW, it seems increasing the 'heartbeat timeout' value helps

crickeys

I've been serving the CLIENT library directly so I'm not sure if that has anything to do with it as it might be possible for some old client to have a previous version of the client library that is different from my server version.

HOWEVER, using the "677 line fix above" really helped in a number of ways:

  1. It completely got rid of the client not handshaken warnings
  2. It stopped those rogue clients from DDOS'ing my server
  3. That freed up CPU and made everything snappier.

I'm not really sure why that suggestion to reconnect is even in there? It just seems like a surefire way to DDOS a server, especially if the server needed to reboot and then everyone reconnects with a bad ID. Why not just let the AUTO CONNECT code do it's job?

Martin Tajur

Sadly, I updated to 0.9.6 too and got TONS of these messages as well. After that I decided to spend a few hours and migrate my stuff entirely over to Engine.IO. It's now done and Engine.IO works like charm, even though it's not officially "production ready" as I understand. The transition was quite easy since I had built my own reconnection logic, etc.
Happy to share some code if someone's interested.

Anthony Webb

Sadly, still seeing this in 0.9.6

Guillermo Rauch
Owner

@martintajur It'd be nice if you could help us get engine.io into socket.io 1.0 branch

Miguel Espinoza

I developed my app in socket.io 0.8.4 and did not have this problem. Yesterday I published my app and newbie mistake I upgraded the production server to socket.io version 0.9.6 and ran into this problem. I really wish I could understand the inner workings of socket.io and nodejs in general to help with the debugging but my app is fairly simple, if there's anything I can do to help please let me know, for now I can only report that in 0.9.6 I do get this problem

Diego Varese

Why is this closed if the issue still persists?

Guillermo Rauch guille reopened this May 02, 2012
Guillermo Rauch
Owner
guille commented May 02, 2012

I'm not seeing this in production. If anyone has reproduction steps, please share and I'll take a look right away

Kyle Mathews

I saw this issue consistently on earlier versions but haven't had any problems since upgrading.

Amir Malik
ammmir commented May 03, 2012

just upgraded to 0.9.6 and am still seeing these flood the logs. in our case, we have only the xhr-polling transport enabled. nginx at the front. we see this problem when we have a clients connected to the app and do a hot deploy (old and new versions coexisting for a while), and i guess the clients freak when some server-side state changes unexpectedly...

Guillermo Rauch
Owner
guille commented May 03, 2012

Can you reproduce it without Nginx in front ? It'd be really useful to know that

crickeys

I get this without nginx on 0.9.6 but I am using stunnel, so I wonder if that has to do with it to. However, applying the "677 line fix" above, got rid of the problem.

Paul Jensen

I managed to replicate this problem by running an application across both CPU cores on my MacBook Air using Node's cluster API, and then trying to refresh the page a couple of times. I was using Socket.io with MemoryStore (which I understand from the docs only runs on a single process, not 2 processes across separate CPU cores). I don't know how helpful that is, but I thought I'd let you know.

Derek Kent
dak commented May 11, 2012

I can easily reproduce this error consistently using more than 1 worker with Node's Cluster module. If I use exactly 1 worker (or don't use Cluster), the problem completely vanishes.

I posted some of my code on Stack Overflow: http://stackoverflow.com/questions/10544350/socket-io-and-node-js-core-cluster
Also referenced this bug rep before finding this issue: #881

I've tried using both MemoryStore and RedisStore. I was using WebSockets exclusively for my testing.

If it would be helpful, I can try to strip my code down to the bare essentials and post it to make it easier to reproduce the error.

David Chouinard

I confim what @dak1 experiences. I, too, experience a very high number of these errors when using Node's native Cluster module.

Derek Kent
dak commented May 11, 2012

I'm still testing this to make sure I'm not missing something, but I am not seeing a single error any more and have an extremely stable connection (about an hour now, ~500 concurrent connections), after installing hiredis and redis (npm install hiredis redis) and using that to create the clients for the RedisStore.

redis@0.7.2
hiredis@0.1.14

Running redis 2.5.9 (should say 2.6.0rc3) on port 6379.

Update: I'm quite confident at this point that this was a fix for this issue (at least as it relates to using Socket.io with Node's native Cluster module).

mindon
mindon commented May 15, 2012

without cluster, my reproduce method:

  1. start server app

    var io = require('socket.io').listen(8080);
    io.set('log level', 1);
    
    io.sockets.on('connection', function(socket) {
      console.log(+new Date);
    });
    
  2. start benchmark app (keep it running)

    var io = require('socket.io-client');
    
    for(var i=0; i<10000; i++) {
      setTimeout(function() {
        var socket = io.connect('http://localhost:8080',
           {'force new connection': true, 'try multiple transports': false});
        socket.on('connect', function() {});
      }, i);
    };
    
  3. kill & restart server app ...

Jorge Rodriguez
j4rs commented May 22, 2012

Any good new about this issue? I reproduced at my laptop using 0.9.6 version, node cluster support, redis store, and throwing clients using the @mindon method (I keep the script running using 'forever' module).

However, If I don't use cluster support I don't see any errors, but and this is what is rare, the log just print until 247 connections (throwing for example 500 or more clients), and after 10 or 15 minutes I saw a couple of warns - client not handshaken in the log.

Pratik Khadloya

I think the issue happens in first place because of a node server crash. Socket.io does a handshake with any new clients that tries to connect to it. So when node dies due to some problem in the application code and its restarted, node/socket.io forgets about any of the handshakes that happened with the existing connected clients and when the clients try to send requests, socket.io complains that they have not done any handshake before trying to send requests.

Diego Varese

Well, in that scenario, socket.io should reconnect with the client and everything should continue smoothly, right?

Pratik Khadloya

Yes seems like, but reconnecting with hundreds of clients would probably strain the server resources.

Jorge Rodriguez
j4rs commented May 22, 2012

I guess this paragraph is a very interesting reading to explain what really happens:

"In a single-threaded environment like node, scheduling tasks is handled with a queue. When we start to see rising roundtrip times, it means we've started adding tasks to the queue more quickly than they can be processed. Once the server hits that point, performance is going to very rapidly degrade until the rate of adding new tasks falls below the jamming threshold. Even then, it'll take a while to clear out the queue and get back to acceptable performance. Perhaps even more troubling, when the queue is being jammed, setTimeout seems to stop working reliably; you can schedule a task 1000ms into the future but if the queue is jammed that task seems to wait more or less until the queue clears."

Took it from http://drewww.github.com/socket.io-benchmarking/ (Analysis section).

David Fooks

@j4rs yup you need to be very weary of this when working with JavaScript especially when working with Socket.IO as you need to make sure the tasks in the queue can run in order to avoid missing heartbeat timeouts. Also note that you need to be careful to make your code "yield" (using process.nextTick) when doing long operations (big for loops for example) to allow these heartbeat timeouts to trigger in time.

Jorge Rodriguez
j4rs commented May 24, 2012

@davidfooks So it seems there's some code causing long roundtrip times and affecting the heartbeat timeouts, right?

David Fooks

@j4rs No that's off topic (but something to be aware of). I've already explained what I think this is if you scroll up. Not sure if they have put a version of our fix in 0.9.5 (it doesn't look like it). Since adding our fix we haven't had any problems with this bug. We are however, still running on Socket.IO 0.8.7 and restricting the transports to websockets and XHR polling only.

Our fix: c382a6b not yet tested on 0.9.5

crickeys

@davidfooks do you have any memory leaks running websockets on 0.8.7?

David Fooks

@crickeys No I'm not seeing any noticeable memory increase over the last month. Watch out as it will look like your leaking memory if you look over short periods of time. V8 only garbage collects once every 2 days when running our server.

crickeys

If I use xhr polling I'll stay at around 100 mb of RAM with up to 5000 connected clients. If I introduce websockets then memory gradually reaches over 2 gigs until I run out of memory. Been trying to work with @einaros and @nicokaiser on this. Looks like they may have fixed the issue in the ws library but NOT over her on socket.io. Is socket.io dead while we wait for engine.io? Seems like none of my issues are responded to. Wondering if it's time to switch to faye or sockjs :(

Jorge Rodriguez
j4rs commented May 25, 2012

@crickeys I had to switch to faye, and is scaling well with ~5k connections, still have to make more tests, but this issue killed me ugly :(, a few connections and 20 minutes later my CPU at 100%, no way to scale socket.io out.

Einar Otto Stangvik
einaros commented May 28, 2012

Socket.io will soon make the switch to use websocket.io + ws for websockets. That will improve this whole situation drastically.

crickeys

Any idea when that's happening?

Einar Otto Stangvik
einaros commented May 28, 2012

@guille, what's the word?

Shashwat Srivastava

@crickeys, I have recently started using node.js and socket.io. I was initially using websockets but the memory used to increase continuously (200MB for 100 connected clients). I next switched to xhr polling transport but still the memory increases slowly and it reaches around 80 MB for 50 - 60 connected clients in 12 hours. I am very interested to know as to how you are supporting 5000 clients with 100 MB RAM. Are you using xhr polling as the only transport? Does the RAM usage stay constant? Or do you restart the server after certain interval of time (like daily)? Thanks.

Shashwat Srivastava

@davidfooks, have you configured V8 engine to do garbage collection every 2 days? or does it happen on your server by default? does socket.io internally control this?

crickeys

@darklrd To keep memory stable I'm only using htmlfile and xhr-polling on the server side. I did NOT configure V8 to do garbage collection every 2 days, just went with the default. I also have to use socket 0.9.4 because there is something wrong with disconnects when a browser tab gets closed with xhr-polling in 0.9.5. I addressed it here #461 but nothing has come of it.

Anyone else feel like socket.io is dead and waiting for engine.io?

Shashwat Srivastava

@crickeys, thanks a lot for replying and providing this useful. Have you customized socket 0.9.4? Or have you just the default code? I will switch to it immediately and try it out. Do you use redisstore as well? Also, are you using this version of socket.io on nodejs 0.6.18?

Yes, I am also facing the disconnection issue. Sync disconnect is never fired and it waits until close timeout. Eagerly waiting for 1.0 release.

crickeys

I used 0.9.4 without modification. The Sync issue will disappear when you use that version because 0.9.5 is what caused it. I am NOT using redisstore.

Shashwat Srivastava

@crickeys, thanks a lot! I have been desperately trying to find a solution. I will try this ASAP. Thanks again.

Shashwat Srivastava

@crickeys, doesn't xhr-polling work on all browsers? is htmlfile better?

crickeys

Sorry for the delay. I'm not sure which is better, but I put that in there just in case xhr-polling throws some security issues.

Shashwat Srivastava

@crickeys, thanks a lot! I will try it out.

Superfeedr

We do get that error as well... Interestingly enough, it happens when we're proxying the traffic thru HAProxy, but it doesn't when we access the node process directly.

Also, it seems that it does not happen as often (at all?) with a single backend server.... but that kind of breaks the purpose of HAProxy!

Diego Varese

Has anyone tried using having a DNS round robin with several socket.io servers instead of a load balancer?

Ignacio Tolstoy

I have a unique instance of node in a production server.

We had socket.io 0.9.0 and this errors where happening. I updated client and server to 0.9.9.

I should update node to last version to? we are running under 0.6.11

Anyway I'm monitoring errors to know if it's really fixed with socket.io update or I must do other fixes.

PD: we are using websockets as transport.

Thanks!

Ignacio Tolstoy

I'm still having this bug, and not any kind of help :(

Jørn A. Myrland

+1

This is a big issue in our production environment using node v 0.8.8 and socket.io v 0.9.10.

With around 1k connected clients (using the XHR polling transport, mostly IE8 clients) the CPU goes quickly through the roof. The only fix is restarting the server.

We was able to stabilize the server when we enabled flashsockets (instead of XHR polling).

Guillermo Rauch
Owner

@jmyrland are you using a single process? Any proxies behind socket.io?

Jørn A. Myrland

@guille That is a good question.

When using multiple processes with XHR as transport, the client would some times not connect to the server. It became quite unstable. When using a single process, the client always got connected, thus we stayed with a single process (though set up with a redis store).

However, now that we are using flash- and websockets, we can cluster up to multiple processes without any connection instability.

No proxies that I'm aware of.

Let me know if I should provide any logs, to help you :)

Ignacio Tolstoy

XHR polling makes a really change?

I'm get same error as you get.

The server goes down with 35k+ unique connections every time

Now I have a cron job checking for node and restarting it, but isn't the way...

When node restarts the job I get more warnings than before so. Restart is the worst solution i came with.

Jørn A. Myrland

98% of the ~1k clients (IE8) connected to the server were using XHR polling. When moving to the flashsocket transport, the problem was marginally reduced - avoiding server meltdown.

So yes, in my case the XHR polling transport is the issue.

Ignacio Tolstoy

In my case i'm using websocket connection.
The numer of errors should go down with XHR polling?

Marc Harter

@guille is this something addressed in engine.io?

Guillermo Rauch
Owner

@naxhh restart is precisely what explains the warning: you kill all the sessions, then subsequent GET requests in the polling cycles don't find the id.

Ignacio Tolstoy

@guille yes but in the theory, the socket.io sends a request to restart the client

"user should reconnect"

But the client side seems to never reconnect that socket.

So the server get's a lot failling request and finally restart itself because can't handle requests.

There is anything i'm missing?

I have to add to this.
The error is given a lot when i restart the server, of course.

But seems to appear some times whithout restarting anything

Anthony Webb

Still broke in 0.9.10 only recourse was to apply the "line 677" fix, which has move in 0.9.10 to line 711, commenting out that line saved my server... Hopefully someday this bug will be solved, amazing it has lingered for so long, has to be the all-time longest thread in github history :)

David Fooks

@anthonywebb I thought the changes in the client fixed the need for the "line 677" fix. That should stop the DOS attack that the clients are doing (or at least throttle it a lot more).

Did you try my hack? Its not tested so I wouldn't fully trust it but it works for us:
c382a6b.

Anthony Webb

@davidfooks to clarify, if I use your hack I dont need the line 677 fix right? I'll give it a try.

Anthony Webb

@davidfooks your hack (I removed the line 677 hack) seems to be working pretty solid, at least it looks good in the logs, no more DOS attack. CPU load spikes every 5 seconds or so to 100% but then comes back down. Not sure if this is normal, but I will watch it.

Another note, I see people talking number of "unique connections" is there some easy way to get at that number? I'm curious what mine is.

David Fooks

No, my change fixes the specific issue we were seeing with clients being disconnected even though they require a keep-alive connection (I'm not sure if it was put into the release) which resulted in the DOS. #438 (comment)

The 677 fix just stops the DOS spamming but not the original issue causing the DOS. This DOS bug fix was applied client-side see #438 (comment).

David Fooks

Ok, after reviewing all this (its been a long time) I remember that they put a version of this hack and the DOS fix into the release 0.9.5. We haven't seen the issue since. Are you sure that your load-balancer is redirecting reconnecting clients to the same machine? That can cause this behavior.

Anthony Webb
George Ornbo

Same issue here. Applying the '677 hack' (t-shirts anyone?) to what is now line 711 of lib/manager.js in v0.9.10 reduced CPU usage.

Running Socket.IO in production has been painful - before running on 0.8.14 we were seeing memory leaks even after disabling websockets. When running websockets there were also significant memory leaks. See this ticket.

Maximiliano Guzenski

I have a m1.small on EC2, just with node (0.8.14) and socket.io (0.9.11) and I have this issue. After a while, nodejs got a "EMFILE" many times, without stop or restart (just with manual restart).

I think that '677 hack' help a lot with 'EMFILE' error as well.

Michel Hiemstra

I'm running a node server to keep track of a user's playing time. On disconnect the duration gets saved to redis. I'm running node 0.8.11 with socket.io 1.1.62 and I also have this issue, but in some cases its different:

When running the node app single process without RedisStore I can get up to 15k concurrent users, when the CPU can't handle it anymore (load around 100%) I'm getting the warn - client not handshaken client should reconnect messages allot.

When I run the node app clustered (with cluster mod), I run 4 processes and using RedisStore they each can handle around 1.5k concurrent users (~5k total), the loads are distributed and a lot lower then running in single mode but I'm getting the handsake warnings allot sooner. It looks like it's loosing his client session's or something I also don't see any records in the Redis db is this normal?

But anyway, it's no problem to run it in single mode, but I want to be able to scale the application up to 30k concurrent users. When running in single mode occasionally I get a (node) warning: possible EventEmitter memory leak detected. warning.

I did not apply the 677 hack by the way.

Any fixes yet for the handshake warnings? Maybe my way of benchmarking is not sane I don't know but haven't found a decent benchmark tool for node yet. I'm now just using a rampup scene wich I wrote in a node app. Running it from a external server to simulate connections that get killed in a random time after connection.

softwareprojects

The amount of time we spent wrestling with these "client not handshaken" errors is absolutely mind boggling!

This ticket is now a year old and as of socket.io version 0.9.10 the bug is still very much alive and kicking.

The issue reproduces when you have multi-instance / multi-process nodes, with clients getting switched between the different node servers. Sure, under normal circumstances you should use stickiness and have the same client connect to the same node server, but servers go down. And when they do, you -will- end up having a client that was previously talking to node_instance_1, attempt to reconnect to node_instance_2.

We tried everything to make this scenario work. But it just doesn't work.

Applied several different patches to the socket.io code (as recommended by users) including a few patches to node.js, trying to force socket.io to gracefully handle a disconnect-reconnect.

The only way to fix this is by using redisstore, so that all node instances communicate with each other, sharing the session data. redisstore eliminates about 95% of this issue. However, from time to time, even with redisstore, the nasty "client not handshaken" error re-appears. When that happens, the only recourse is a full page refresh.

After spending A LOT of time struggling to make this work, we eventually came up with a clean solution, that works 100% of the time, no ifs or buts.

We ended up ditching socket.io and replacing it with sockjs.
Sure it has no bells & whistles, no emit function and no auto-reconnect.

It just works.

Michel Hiemstra

Hi,

Yes its very sad to see how little support from the package maintainers on such a important issue. I have my node application running now in single process mode and it can support ~10k concurrent users. But my project requires it to scale up to ~30k and I want to have multiserver to guarantee uptime. So i'll look into your tip using sockjs, thanks.

Ignacio Tolstoy

Here I have 40k users now and a lot of handshake errors... thinking on change sockets too :(

David Fooks

SockJS is good and I've never seen this error in my testing. I've been meaning to switch to it at some point. Best thing about it is that it is light weight and just like WebSockets API (futureproof) unlike socket.io which is far too powerful for what we are actually using it for.

Marc Harter

we switched our code base from socket.io to sockjs and generally its been a lot easier to maintain (adding room/context and reconnect support was pretty simple), the module does less which turned out to be a big maintenance win

Guillermo Rauch
Owner

This issue doesn't exist in engine.io in my testing, so you basically need to wait until it's integrated into socket.io. Reproducing transport work on both projects at this point is less than ideal.

Maximiliano Guzenski

To say that this issue only happens when you have multi-instances/multi-process is not true.
I have a single instance (EC2 m1.small) running node with a single process... and I have this issue (that "677 hack" helps a lot).

node 0.8.14 and socket.io 0.9.11. Socket.io is direct open in port 3000 to clients (no nginx, no haproxy, no anything).

softwareprojects

@guille, love everything you're doing - really appreciated!

But your comment about "so you basically need to wait", doesn't work for us
We don't want to wait :-)

Having such a critical bug unattended for over a year is not reassuring. Two of our engineers spent a full week wrestling with this, applying various patches and reaching out to all the other users experiencing the same issues when socket.io runs in a multi-instance/multi-process setup.

There are more than 5,000 results in Google for "socket.io client not handshaken” !
We probably went through all of them.

Lots of pain, with no one offering any real solutions (other than integrating redisstore or using different hostnames).

No matter what we tried, socket.io failed to gracefully handle a disconnect-reconnect, when the server switches.

For us, sockjs was a life saver

George Ornbo

@guille firstly thank you for Socket.IO!

This issue doesn't exist in engine.io in my testing, so you basically need to wait until it's integrated into socket.io.

The direction of Socket.IO 0.9.x isn't really working at the moment for those of us using it in production. Just look at this thread. It would be great to have an understanding of

  • Whether Socket.IO 0.9.x is still supported / actively used / officially deprecated.
  • When and if to expect Socket.IO 1.x.x.
  • Is there a branch for Socket.IO 1.x.x?
  • Can the community help push the Socket.IO 1.x.x effort forward?
  • Whether to migrate to engine.io / sockjs now?
Ryan Smith

I've also switched to SockJS from socket.io, and have had great success running it on a production environment with tens of thousands of people on one server.

While sockJS may not include all the features that socket.io includes like reconnecting and 'rooms', but rolling my own implementation of this along with writing my own redis backend has proven to be much more efficient than what I experienced with socket.io.

@guille I appreciate what you have done for the socket.io community, however I feel that letting these issues continue to run amuck isn't the best way to approach this. I wish you guys had a more solid documentation area, that describes production scaling and clustering setups along with load balancer layouts, until these issues are addressed I'll be sticking with sockjs.

Pierre-Yves Gérardy

Haters' gonna hate... Socket.IO is a gift, not a due.

Race conditions in distributed systems are sometimes horrible to track down, and there are only so many hours in a day.

With Engine.IO around the corner, I don''t think it's worth it to track this one down.

Keep on rocking, Guille!

softwareprojects

@pygy this is not about hate.

We are all extremely appreciative of the work put into socket.io by Guille and others.
It's a very impressive package and has helped thousands of apps go real-time, across multiple devices.

Having said that, this critical bug is simply preventing the use of socket.io in a multi-server architecture.

Once your servers go down, clients will disconnect and fail to successfully re-connect again.
To the end-user this translates to an error-message that doesn't go away until a full page refresh.

engine.io sounds promising. If you can wait for it, by all means do so.

This is not a propaganda against socket.io, but merely a description of a problem we ran into and how we eventually solved it.

We posted on this thread to provide value for others.

Hopefully others who run into this problem, can take-away from our experience and save some time.

David Lojudice Sobrinho

@guille How can we help testing the new version with Socket.io + Engine.io?

Guillermo Rauch
Owner
Allan Dumaine

I don't know if this has a direct impact on all the different configurations that everyone is using, but I think it may provide a clue as to what might be causing some of the "client not handshaken client should reconnect" issues.

Try using an ip address on the client side instead of a hostname.

To load test our application (an auction site) we used another node instance with socket.io-client on a remote server to create connections and bid in an auction. I found that we can get no more than 12 connections but lots of "client not handshaken client should reconnect". The server acts very strange as the connection count seems to drift as the connections try to be maintained or reconnect.

After much head banging I decided to try using the ip address of the server on the client and was immediately able to make 1,000 connections. I found I had to use setInterval to stagger creating the connections by about 30ms otherwise it overloads the server. At connection 1012 I get a new error "warn - error raised: Error: accept EMFILE". I added "ulimit -n 200000" and tried for 2000 connections. The EMFILE errors stopped. Socket.io connection count went to 1105 and then stabilized at 1017. Not sure why. Some sort of limit from one ip address?

I have not had a chance to dig in to why an ip address has no issue connecting. It could be dns is causing a timing or handshake issue.

Using node 0.8.14 and socket.io 0.9.11 on both sides running on ubuntu 10. Socket.io is listening on port 3300 under a node https server.

Client side:

var client = require("socket.io-client");

var socketAddr= 'https://199.2xx.xxx.xxx';
//var socketAddr = 'https://xxx.mydomain.com';

var pid = '786';

var idStart = 1;  //test user id
var idEnd = 2000;


var socks ={};

sCount = idStart-1;
doConnect = setInterval(function()
        {
            sCount +=1;
            if (sCount >= idEnd)
            {
                clearInterval(doConnect);
                console.log("Cleared connect loop");
            }
            var uid = sCount;
            console.log("New sock count="+sCount);
            var socket = makeSocket(uid);

        // put into the array
            socks[uid]=socket;

        },  30);


function makeSocket(uid){
    var s = client.connect(socketAddr,{'force new connection': true,'port':'3300','connect timeout':'5000'});
    init(s, uid); // emit connect to application
    return s;
}

I hope this helps shed some light on this. I would very much like to get this resolved and have a stable socket.io.

hashi101

I have same issue on node v0.6.x, or v0.8.x with socket.io v0.9.10
I'm gonna try 677 fix but maybe a ADumaine solution works? Anyone knows?

Can 677 fix makes some trouble?

Michel Hiemstra

I ported my application from socket.io to SockJS and running it on multiple servers load balanced by HAProxy. My benchmark resulted in ~50k concurrent users with a minimal load. Without any problems.

softwareprojects

@mdahiemstra , we are about to go the same path

Have been running on sockjs for the last few weeks and loving it! The next step is putting it behind HAProxy

Would you be so kind and share your haproxy config, or point us in the right direction?
We're running sockjs on SSL because of sockjs/sockjs-client#94

Much appreciated!

Michel Hiemstra

@softwareprojects Hi, Hmm I should look into running sockjs on SSL thanks for the heads up.

I used (modified) this haproxy config: https://github.com/sockjs/sockjs-node/blob/master/examples/haproxy.cfg

The file that worked for me on our staging environment is: http://cl.ly/code/2k1o1C0t1Z43

The configuration in production I can't provide because it's handled by a outsourced company.

hashi101

Should I wait for Socket.io 1.0 or port my app too Sockjs?

I'm using Node v0.6.2 with Socket.io 0.9.10 right now, and even after 677 fix (which has move in 0.9.10 to line 711)
still CPU usage increase.

I commented this line //transport.error('client not handshaken', 'reconnect');

Eric Mill

Only you can make that decision. :) SockJS is good stuff, and I just switched to it - but if you were used to some of socket.io's nice features (like a named event API, and Redis interaction), then you will have to put in more work to replace those things. It's not super hard, but it is a time investment.

hashi101

I really like Redis solution. :) Is Socket.io 1.0 going to has Redis too? And maybe anyone knows about Socket.io 1.0 releasing date?

Silva

Going to SockJS....

guilloche

when this issue will be solved?

hashi101
guilloche

when release socket.io 1.0? exactly this problem would not be?

hashi101
Guillermo Rauch
Owner

You can keep track of progress here:
https://github.com/learnboost/socket.io/tree/1.0
https://github.com/learnboost/socket.io-client/tree/1.0

It's very close. Working on tests, documentation, website and a thorough document about changes and scalability.

guilloche

very close - this is when? in a week, month, year?

Pierre-Yves Gérardy

The usual answer for open source projects is "When it's done".

Keep an eye on the 1.0 repo, and don't put pressure on the author who offers you something for free.

Evan Prodromou

I just switched to sockjs instead.

Jonathan Gros-Dubois topcloud referenced this issue from a commit in topcloud/socket.io December 29, 2012
Jonathan Gros-Dubois Fixed issue #438 which caused high rate of "client not handshaken sho…
…uld reconnect" - With correct indentation :p
fb17aa6
hashi101

topcloudsystems, may it be a reason of memleaks in Socket.io 0.9.x?

Danny Gershman

The error happens even more when using secure vs non-secure.

Jonathan Gros-Dubois

hashi101 - I don't know yet. I'm working on a project right now and after doing some stress testing, I've noticed a few memory leaks which I will be debugging shortly... Probably socket.io-related - I might have an answer to that soon.

Silva

One year after and this prolems still persist...

hashi101

If I will make application without Native Redis Support from Socket.io, and make just my own synchronization by Redis, should it work correctly? Without any strange things like memleaks or reconnecting without reason?

lessmind

@hashi101 I had the same problem about one year ago, no Redis, no anything strange, but it still happended again and again.

Jeremie Pelletier

We had the same issue. Our setup uses node-redis, secure connections and process forking. The very first client to connect would trigger the bug every time.

Turns out it was the "cluster" module of node.js, we disabled it and the bug went away.

Walter Zheng

i didn't use cluster module, but it has the same warning with about 1000 connections

TT

i am still facing this problem...

TT magickaito referenced this issue in sockjs/sockjs-node March 20, 2013
Closed

unable to connect from iphone client. #114

sbellone

Hello,
I am also able to easily reproduce it with HAProxy, like @diegovar.

Issue reproduction:
In my case, the problems comes from the Socket.IO's handshake mechanism: There is two requests, which mean that if we load-balance them, a server instance will receive the first part and the second server will receive the persistent transport session, and this will fail with this error message.

Actually, with the recommended HAProxy configuration (http://stackoverflow.com/questions/4360221/haproxy-websocket-disconnection/4737648#4737648), it seems to work at first. But that just because the first www request is redirected to the first available www_backend server (i.e. server1), and then, the socket request is redirected to the first available socket_backend server (i.e. server1, and as it point to the same address, it works). Same with the 2nd client, etc, etc...

But if we restart HAProxy, all clients will try to reconnect at the same time, and the load balancing will mess up the handshake process: we have a huge amount of "client not handshaken client should reconnect".

Solutions:
One solution is to use the source algorithm, which will ensure that for a client, both requests are redirected to the same server. But this will not result in an optimal load balancing.
The second solution is to use the cookies mechanism of HAProxy. This works fine with clients coming from a browser. But I did not find a solution to use cookies with the socket.io-client lib.

Questions:
So, as I would like to use the roundrobin algorithm of HAProxy AND the socket.io-client lib, I have two question:

  • Would it be possible to deactivate the handshake process an have a single direct connection? I don't need to access to the header data.
  • Is it possible to access, in the socket.io-client, to the cookies got during the first part of the handshake and to send them back when trying to establish the persistent connection?

Thanks.

sbellone

Ok, I think you can forget my comment.
Using a RedisStore instead of the MemoryStore is the solution in my case. It works super fine! :+1:
Thanks, and keep up the good work!

Matthew Madden mmadden referenced this issue in socketstream/realtime-transport April 20, 2013
Open

rtt-faye module #1

Jonathan Gros-Dubois

Ok I decided to have another look at this issue recently.
The solution I posted a while ago (https://github.com/LearnBoost/socket.io/pull/1120/files) only reduced the occurrence of these errors, it did not stop them altogether.
It seems that the cause of the problem is a race condition related to the clustering socket.io across multiple processes.
This issue occurs while using the default redis store and with also socket.io-clusterhub.
If you look around line 800 of lib/manager.js, you can see that socket.io responds to the client's connection request before it publishes the handshake notification to other socket.io workers:

res.end(hs); // responds here

self.onHandshake(id, newData || handshakeData);
self.store.publish('handshake', id, newData || handshakeData); // publishes here

So occasionally, the client will know about the handshake before any other socket.io worker does.
When the client gets the response and tries to continue with the connection, it may be dealt with a worker
which is not yet aware of the handshake notification, hence the client is not handshaken (according to that particular worker).
It's not practical to check that every worker has in fact received the handshake notification so I think the best way
to solve this issue is to give the worker a second chance to check the handshake in case it doesn't see it the first time (after some timeout).

Something like this might work around line 710 (will probably need to have a 'handshake timeout' config property to use as the timeout):

Replace:

if (transport.open) {
transport.error('client not handshaken', 'reconnect');
}

transport.discard();

With:

if (transport.open) {
setTimeout(function() {
// If still open after timeout, THEN we will kill that connection
if(transport.open) {
transport.error('client not handshaken', 'reconnect');
}
transport.discard();

}, replaceThisWithTheHandshakeTimeoutVariable);
}

Note that I am not making a pull request for this because it refers to the pre-1.0 version.
This issue may have been resolved in the new version.

Steve Edson

@topcloud I'm interested in trying this, what would you recommend the timeout variable to be set to?

Jonathan Gros-Dubois

@SteveEdson I guess you could add a 'handshake timeout' property as a soket.io config option. That may not completely get rid of the error. There might be other places in the code which have similar race conditions. It did reduce the number of failures.

Steve Edson

@topcloud Thanks, I'll give it a try. Any idea what the actual value should be? Would something like 1 second be to high? etc

Jonathan Gros-Dubois

@SteveEdson 500ms sounds about right for inter-process communication. You should experiment. Also note that I made a mistake in the code I pasted above, it should be if(transport.open && !this.handshaken[data.id]) instead of the second if(transport.open)...

Billydwilliams

@topcloud
@SteveEdson
It's been a couple months, how have the changes worked for you?

Steve Edson

I think it helped, but I've just realised that when I updated socket.io last month, it would have overwritten the change. I'll have to reapply it and see how it goes. Cheers.

Justin Warkentin

Hopefully this helps someone. I just did a little more investigating to try to figure out what's happening in my case and found some interesting stuff. I have several instances of my Socket.io app running behind a load balancer. Normally this is just fine, but sometimes something goes wrong and if falls back to long polling. When this happens the connections get bounced back and forth between the servers behind the load balancer. If it hits any server other than the one it originally authenticated with then it floods the logs with these errors. The simple solution is to use the RedisStore stuff to share session information between running instances. There are also other solutions. Here are some links that may be useful:

http://stackoverflow.com/questions/9617625/client-not-handshaken-client-should-reconnect-socket-io-in-cluster
http://stackoverflow.com/questions/9267292/examples-in-using-redisstore-in-socket-io
https://github.com/LearnBoost/Socket.IO/wiki/Configuring-Socket.IO
https://github.com/fent/socket.io-clusterhub

Klaus L. Hougesen klh referenced this issue in balderdashy/sails November 12, 2013
Open

ditch socket.io #upgrade #1096

ruanosaur

@topcloud @SteveEdson Hi guys, any news on whether this would be fixed in v 1.0? Currently using redisCloud on heroku and experiencing the same error - going to try the timeout if v1.0 doesn't fix it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.