frequent socket hangups #116

Closed
tj opened this Issue Nov 13, 2012 · 51 comments

Comments

Projects
None yet
Contributor

tj commented Nov 13, 2012

<3 node

Error: socket hang up
    at createHangUpError (http.js:1264:15)
    at CleartextStream.socketCloseListener (http.js:1315:23)
    at CleartextStream.EventEmitter.emit [as emit] (events.js:126:20)
    at SecurePair.destroy (tls.js:938:22)
    at process.startup.processNextTick.process._tickCallback [as _tickCallback] (node.js:244:9)
---------------------------------------------
    at registerReqListeners (/home/vagrant/projects/thumbs/node_modules/knox/lib/client.js:38:7)
    at Client.Client.putStream [as putStream] (/home/vagrant/projects/thumbs/node_modules/knox/lib/client.js:264:3)
    at Client.putFile (/home/vagrant/projects/thumbs/node_modules/knox/lib/client.js:232:20)
    at Object.oncomplete (fs.js:297:15)

looking into it, seems ridiculous to accuse s3 here, but it wouldn't surprise me either way

Contributor

domenic commented Nov 13, 2012

Yeah we've run into this very sporadically as well. Not really sure where the blame goes either.

Contributor

tj commented Nov 13, 2012

my gut says node, because this would be completely unacceptable availability for such a service, 1/5 concurrent requests fails, but if it really is s3's backlog denying connections or similar then.. wtf.. lol I know some kernels will silently drop denied sockets without any notice so it could be that

Contributor

domenic commented Dec 25, 2012

I just found a "solution" on the nodejs mailing list that I'd never seen before:

https://groups.google.com/d/msg/nodejs/kYnfJZeqGZ4/uHVOfFneroAJ

If someone gets this reproducibly it's worth trying some of those fixes.

Contributor

tj commented Dec 25, 2012

forgot about this, closing until we're running this portion in prod because it might just be my crappy local canadian connection haha

tj closed this Dec 25, 2012

Contributor

tj commented Feb 8, 2013

still a problem in prod, either knox is busted, or node is busted. I'll try and take a closer look at the packets soon and see wtf is going on

tj reopened this Feb 8, 2013

Contributor

domenic commented Feb 15, 2013

Some of the recent changes in Node 0.8.20 look related; might be worth giving it a shot.

Contributor

tj commented Feb 15, 2013

oh really? which ones?

Contributor

domenic commented Feb 15, 2013

From http://blog.nodejs.org/2013/02/15/node-v0-8-20-stable/

http: Do not let Agent hand out destroyed sockets (isaacs)
http: Raise hangup error on destroyed socket write (isaacs)

Hmm not sure.

Contributor

tj commented Feb 15, 2013

hmm worth a try ill update our node

Contributor

tj commented Feb 15, 2013

no dice

7Ds7 commented Feb 20, 2013

This is not just happening with knox but anything with socket.io.

I am not a node expert by any means, but i reproduce this error when a client closes the window and does not warn socket.

ie: Android v4.0.2, closing a tab that is listening to sockets on the tab manager, does not send a disconnect or window.onbeforeunload event, not doing so, makes another request from another browser try to send to that hung up socket crashing the server with "socket hang up" error

Contributor

domenic commented Feb 20, 2013

@7Ds7 that particular case is pretty much expected behavior, as outlined here. If the user hangs up on your socket, you of course will get a socket hang up error.

I'm pretty sure Knox either returns event emitters that you can listen to the "error" event on, or properly catches all "error" events and transforms them into err parameters to the callback. Willing to be proved wrong though.

I'm able to reproduce this in our code. Didn't have time to isolate in a test case yet.
Apparently S3 is not very fond of keeping idle connections alive.

< <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>AFB17A5C4F0A0B56</RequestId><HostId>zOLEoTDEZg8QAY9ZVzBTPNpFNvFBxXd5J1E62slzLuollhMHpLztnK0Z2aHuXi40</HostId></Error>
Contributor

rauchg commented Feb 23, 2013

Wow interesting find Dan

On Fri, Feb 22, 2013 at 1:57 PM, Dan Milon notifications@github.com wrote:

I'm able to reproduce this in our code. Didn't have time to isolate in a
test case yet.
Apparently S3 is not very fond of keeping idle connections alive.

< RequestTimeoutYour socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.AFB17A5C4F0A0B56zOLEoTDEZg8QAY9ZVzBTPNpFNvFBxXd5J1E62slzLuollhMHpLztnK0Z2aHuXi40


Reply to this email directly or view it on GitHubhttps://github.com/LearnBoost/knox/issues/116#issuecomment-13973408.

Guillermo Rauch
LearnBoost CTO
http://devthought.com

I've encountered this as well, with the same timeout error reported by @danmilon.

My original use case was piping an HTTP response directly into S3 using putStream. I'd been using that for ~2 months and have never seen this issue. I ran into this for the first time today after I switched to putFile (I needed to add some local pre-processing so I now write to a temp file first). Not sure if there's a difference between putFile and putStream or it's purely coincidence.

I'm using somewhat old versions of knox (0.4.2) and Node (0.8.6). I'll update to the latest and let you all know if I see this again.

Contributor

kof commented Mar 18, 2013

I have this error reproducible very stable using node v0.8.21

And I think I have nailed the issue: it happens if the maxSockets of the agent is lower than the amount of requests we are doing.

If I set https.globalAgent.maxSockets = 50; and do 50 parallel requests - after some seconds the error will be there.

If I do 40 parallel requests - I am able to download thousands of files from S3.

Possible solutions:

  1. I think first of all it is a documentation issue on node as well as knox. Both of them should mention, that the default Agent has maxSockets == 5. Node should mention this not only in place where maxSocket option is described but also in 3-4 other places where the users read how to create requests.
  2. Knox could set for its engine the maxSockets value to something much more higher than 5, f.e. 500, because knox will be often used with multiple connections per host. Also knox could expose and document maxSockets option which is then forwarded to the Agent.
Contributor

kof commented Mar 18, 2013

Possibly node could throw something more meaningful than socket hang up in this special case?

Contributor

kof commented Mar 18, 2013

or the queueing logic is somewhere wrong in node ...?

@kof, could you share the code to reproduce this?

Contributor

kof commented Mar 19, 2013

its a script with some dependencies to the main project, I need to reduce it to the pure reproducible snippet .... but I can post the original script if somebody wants to play with it.

Contributor

kof commented Mar 19, 2013

setting agent=false solves the issue for me too.

Contributor

domenic commented Mar 19, 2013

setting agent=false? Can you explain?

@substack says the problem can be solved by calling https.request with { pool: false }. That might be the way to go for Knox? Or at least make it an option that is on by default?

Contributor

kof commented Mar 19, 2013

I mean agent=false on request options:

https://github.com/LearnBoost/knox/blob/master/lib/client.js#L139

http://nodejs.org/docs/v0.8.21/api/all.html#all_http_request_options_callback

There is no documented option pool=false, but agent=false will do exactly this thing:

"false: opts out of connection pooling with an Agent, defaults request to Connection: close."

This will fix the issue, while it will be possible to open unlimited amount of sockets to the same host, where one can run into OS limits.

I suppose this is the reason why maxSockets option exists.

  1. I don't understand why nodes default is that way low (5)
  2. It seems like there is an error in nodes pool logic

It looks like the right way would be to pass an own Agent instance with a good default maxSockets value, which will work for all OSs, but when the limit is reiched, nodes pull logic will be an issue again.

I misspoke earlier. { agent: false } is the correct thing. You should basically never, ever use anything other than { agent: false } in any program ever. The default value in core is completely absurd.

Contributor

andrewrk commented Mar 19, 2013

@substack can you link to more information about this? sounds like you're onto something but obviously the casual reader should look more into it before taking your word at face value

Contributor

tj commented Mar 19, 2013

fwiw we're not using node for this anymore but even increasing the max sockets to a very high number fixed nothing, so it seems like the comment about it being related the pooling logic could be right

domenic closed this in 0bc5729 Mar 25, 2013

puckey commented Apr 13, 2013

From Amazon S3 best practices: "Also, don't overuse a connection. Amazon S3 will accept up to 100 requests before it closes a connection (resulting in 'connection reset'). Rather than having this happen, use a connection for 80-90 requests before closing and re-opening a new connection."

I just upgraded to 0.7.0 and am running on 0.8.16 and am still seeing socket hang ups. Just using the default behavior of no agent...

Anyone still seeing issues?

We are doing a really large number of s3 put requests (10-15 a second) when seeing this behavior.

My bigger concern is that it is crashing my app. I am handling callback errors but am guessing I am missing something?

Any hints would be appreciated

gabceb commented Apr 19, 2013

I am also seeing this issue with only one upload using node-multiparty. This gist shows the code I am using. The code came from one of the examples on the multiparty package. The upload code starts at line 68.

All I get from the request is <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message></Error>

I am using Node 0.10.4, knox 0.7.0 and multiparty 2.1.5

gabceb referenced this issue in pillarjs/multiparty Apr 19, 2013

Closed

problem with s3 example #2

Contributor

kof commented Apr 19, 2013

@puckey I don't understand this statement: "don't overuse a connection." knox creates for every api call a separate connection.

puckey commented Apr 19, 2013

Okay, in that case ignore it. I thought it might be relevant to this problem.

Contributor

kof commented Apr 19, 2013

I just took a look via lsof, I am definitely not doing too much connections, appr. 30 in parallel. Agent is false. Amazon is doing something weird.

Contributor

kof commented Apr 19, 2013

I just had the problem again. I wrote a script which has to process 17k images. Now I got it working without hangups. Hangups do still happen, but I just use retry https://npmjs.org/package/retry

Without retry I had up to 80 hangups for 17k requests.

I am not sure if knox might want to use retry in the client. But it might be a good idea.

Contributor

domenic commented Apr 19, 2013

@kof @puckey:

I don't understand this statement: "don't overuse a connection." knox creates for every api call a separate connection.

I think Node.js itself maintains a connection pool, and that's part of what the agent: false business disables. So that was a pretty good find, most likely.

@gabceb

That isn't actually the same problem as in this thread, which discusses "socket hang up" errors. See the OP. That is a classic error when you set an incorrect Content-Length; I'd try asking the multiparty maintainers. Maybe you're trying to upload something smaller than 5 MB? (Not sure how multiparty works.)

@addisonj

Sorry to hear you're still having this problem. I wonder if they fixed it in later versions of Node? It'd be worth trying on Node.js 0.10.4 if you can, or at least 0.8.20 per above.

The crash was out fault. Simply forgetting to listen to an error event.

I will see if I can get things on 0.8.20, but for now we have something written by one of our guys, https://github.com/jergason/intimidate, which retries with back-off. We hit s3 so often I would not be surprised to see failed requests every so often but this has worked well so far.

gabceb commented Apr 19, 2013

Thanks @domenic. I am following up on this issue on the multiparty repo

Contributor

domenic commented Apr 19, 2013

@addisonj that's awesome! Submit a pull request to add it to our "Beyond Knox" section in the readme :).

@jergason jergason added a commit to jergason/knox that referenced this issue Apr 19, 2013

@jergason jergason Add section on `intimidate` to Readme.md
Per discussion in issue #116, adding a blurb about intimidate, a wrapper for retriable uploads with exponential backoff, to the Beyond Knox section of the readme.
ba6fbaf

stalbot commented Nov 18, 2013

Anyone have a good way to reproduce this? We are getting this issue occasionally (i.e. once every few days on a lightly used production machine), but I can't reproduce deliberately in any environment. I will try some of the fixes here and see if they end the issue, but it would be great to have a way to make sure.

Contributor

domenic commented Nov 18, 2013

@stalbot if you can create a reliable reproduce I will jump all over fixing this.

Contributor

dweinstein commented May 14, 2014

Disabling the http agent seems like a way for some people to shoot themselves in the foot. If you're not careful and make too many requests (e.g., doing a head or headFile request) in a loop without limiting the number of simultaneous connections, you're likely to end up with something like the following result:

Possibly unhandled Error: connect EMFILE
    at errnoException (net.js:901:11)
    at connect (net.js:764:19)
    at net.js:842:9
    at asyncCallback (dns.js:68:16)
    at Object.onanswer [as oncomplete] (dns.js:121:9)

And wonder WTF....

Therefore I made sure the agent was re-enabled by doing (e.g., knoxClient.agent = require('https').globalAgent;) and instead tweaking the maxSockets field.

Contributor

domenic commented May 14, 2014

I assume that's on a Mac, which has a pretty horrible global limit on file descriptors?

Contributor

dweinstein commented May 14, 2014

Yes that was on a Mac.

@dweinstein just bump your limit for maximum # of sockets

I can reproduce this by creating a 1000 empty text files touch {1..1000}.txt and then trying to push them up to S3. 90% of the time there will be a socket hangup. I also get this exact same thing with a go package I'm working on, putting each request in its own goroutine. Inspecting the tcpdump, I can see Amazon is sending an RST packet and closing the connection which returns a ECONNRESET. The only way I can think of solving this is baking in retry.

Contributor

kof commented Jul 12, 2014

is there a limit for concurrency?

Am 12.07.2014 um 12:33 schrieb Matt Harrison notifications@github.com:

I can reproduce this by creating a 1000 empty text files touch {1..1000}.txt and then trying to push them up to S3. 90% of the time there will be a socket hangup. I also get this exact same thing with a go package I'm working on, putting each request in its own goroutine. Inspecting the tcpdump, I can see Amazon is sending an RST packet and closing the connection which returns a ECONNRESET. The only way I can think of solving this is baking in retry.


Reply to this email directly or view it on GitHub.

The only limit I've seen documented is the one mentioned above. But that pertains to reusing a connection, which I'm not doing.

Contributor

tj commented Jul 12, 2014

FWIW s3 has a concurrent access limit on like-named prefixes. I can't find the thread but apparently due to how they store things if you have say "foo-{1,10000}" and try say 1500 concurrent requests many will fail, but if you have "{1,100000}-foo" it should be fine. This is screwing us pretty hard right now, looking at replacing s3 all together and just using Riak for our primary access and storing in s3 as a backup

Soullivaneuh referenced this issue in fzaninotto/uptime Sep 11, 2014

Open

Regular went down checks #248

I found this http://stackoverflow.com/questions/27392923/uploading-to-s3-with-node-knox-socket-hang-up when searching for a solution into this problem. Looks like a lot of people have this problem with our good ol' friend S3.

var req = client.putStream(res, elem._id, headers,function(err,s3res){
    if(err) console.log(err);
    console.log(s3res);
}).end();

This was the last comment. Worked for the guy who posted the solution.

The following solved it for me:

var http = require('http')
http.globalAgent.maxSockets = 2048

Some suggest 1024, but I was still getting some errors. I upped to 2048 and works fine.

droppedoncaprica commented Aug 3, 2016 edited

I know this isn't StackOverflow, but for anyone coming into this issue in the future, make sure you call .end() on the knoxClient.get() call.

IE this will return a socket hangup...

client.get('s3-key')
.on('response', handleResponse)
.on('error', handleError);

and this will not

client.get('s3-key')
.on('response', handleResponse)
.on('error', handleError)
.end();

A pretty obvious coding error, but one that I usually end up running into when I add Knox to a new project. 😆

pocketjoso referenced this issue in bitinn/node-fetch Sep 24, 2016

Closed

socket hang up when using maxSockets #125

DaGaMs commented Nov 20, 2016

FWIW, I've had this problem when I was passing the request object into a callback function in Loopback. Long story short, the only thing that "fixed" the problem was to do getFile(url, (err, res) => {callback(null, res)}) - in that way the data is streamed correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment