-
Notifications
You must be signed in to change notification settings - Fork 163
SDK v10: IOException: Connection reset by peer #363
Comments
Hi, @rocketraman. Thank you for opening this issue. We have noticed this as well. What happened here was a change to the retry policy to try to handle more specific network-related exceptions rather than IOExceptions, which also includes lots of non-network related issues. We've since decided that it is better to simply look for IOException and will be changing it back in the next release, which should resolve this issue because those errors will be automatically retried. |
@rickle-msft Thanks, that explains it. Any IOException emanating from the networking layer should be networking related, so I think your decision to revert the retry behavior makes sense. |
@rickle-msft Just realized this is the same problem I previously reported (Azure/autorest-clientruntime-for-java#467) -- I just didn't realize it before because it was being worked around automatically by the retry. |
Ah. Good catch. So the reason we get these exceptions at all is because we pool connections and keep them alive across multiple operations, and the service will eventually time them out and close them. When we retry, we just establish a new connection and the request works fine. Reverting the retry policy should mitigate this for the user, but we'll continue looking into the root cause |
Was the retry policy reverted in 10.1.0? I didn't see it in the changelog. |
Never mind my last, I do see the change in |
Yes we did. Sorry about the gap in the changelog. I'll close this issue now since it has been addressed in the latest release. Please feel free to reopen it if you have further concerns. |
@rickle-msft I just encountered this (or a very very similar) issue again with 10.1.0: Here is the stack:
Retrying multiple times did not solve the problem -- the app continue to fail with the same Exception until it was restarted, at which point things started working again. |
Hey, @rocketraman. I've reopened the dialogue with the team that owns the NettyClient. Hopefully we can get to the bottom of either why the socket is being closed or why we're still trying to write to it. Thanks for pointing this out again and sorry you continue to experience difficulty here. |
@rocketraman Quick question. Are you seeing this less frequently since we returned to retrying on IOException? Or the same? |
Far less frequently. Its not the same issue for sure. Previously, this would happen if the CMS were idle for a while, and then a user-driven retry (without an app restart) would fix the problem. This was the first time I saw the situation that happened today, and user-driven retries were not effective in solving the problem. The only solution to the problem was restarting the app. |
Was this after the app had been running for a long period of time? Or was there some new workload it was trying to process that might hit a corner case? It sounds like you hit this error, then retries maxed out and passed the error back up to your application. You say you had to restart the app, so were all (or any number) of requests that should have been independent hitting this same issue once you hit it once? |
No, it had been running for a couple of days, though its still in dev/test so there wasn't a lot of traffic. It wasn't a corner case -- the same action (uploading a particular blob) worked after a restart.
I believe that all requests were failing once the issue had been hit, yes. |
Unfortunately I had to get the service working right away but if this happens again I'll do some more debugging at the networking level. |
That would be really helpful. Thank you. I'm also going to try setting up an application that just does some uploads and downloads and try running it for a couple days and see if I can gather any more information. |
@rickle-msft I just encountered this again, with the same stack I posted in #363 (comment).
The only thread that appears to be related to the SDK is this one:
I've used I also note that the
Lastly, I've compare the stack dump with a working process, and there is no difference -- the SDK is only mentioned in the one Thread with the same stack as shown above. I've also done the same I've left the process that has the SDK in the weird state running, so if there is some more debugging you want me to do on the process, let me know. |
Thank you so much for following up with this. Just so I understand the timeout situation correctly, you have client -> storage sdk -> storage service. Client is timing out and is trying to cancel some outstanding operation in the sdk, and that is the point when everything seems to hang? |
Yes
I suspect that is a bit backward: the sdk seems to already be hung, the client is timing out because it is hung, and the exception is reported when the client cancels its connection, which I guess causes the open channel / rx flowable for the sdk to be cancelled, which seems to prompt the exception previously reported. So the issue is not so much the exception, but why the SDK seems to be hung in the first place. I'm not seeing any odd messages or exceptions in our logs before this "hang" though. |
I think we discussed before that this "Connection reset by peer" thing comes from a server side timeout, which closes the socket, then we try to write into the half closed socket, retry and grab another connection from the pool. Usually this fixes it, but perhaps we're running out of connections? So after some large number of retries, there are just no more connections available, and the client sits waiting for another connection that will never come? I'm not terrible familiar with the connection pooling logic, so there could be a bug that is slowly draining connections? |
@rickle-msft That seems reasonable. If it is waiting for a connection, might there be a Thread blocking on a queue read or something somewhere? I still have the hung instance running so can continue to debug with that if we think of something concrete to look at. |
@rocketraman. Update for you. The runtime team spent a while doing some investigation, and they found some suspect logic in the connection acquisition here. They are thinking that this logic rarely results in using existing channels, so the connections tend to sit idle for a long time and then close and eventually give this IOException. The runtime team is working on a fix and some testing that we should be able to try out soon. |
That's good to hear. We also ran into similar hangs and eventually worked around it by creating our own |
@yeroc. Thanks for sharing that work around. I'm hopeful this fix will be successful. I'll post here when there's a version with the fix that you can try out if you're interested. |
@rickle-msft Awesome, thanks. Looking forward to it. @yeroc Can you share a bit more about your workaround? I see that I'm starting to think it might be easiest just to use the underlying storage REST API directly with my own code using a better tested async client -- I'm only doing PUT and GET on blobs, so it can't be that hard (famous last words) :-) |
@rocketraman Yea, we started thinking the same thing (maybe it would be easier to write our own REST library) but we were on a very tight timeline so looked harder for a workaround. Anyway, more detail on the workaround... For ease-of-use we created a wrapper object which implements
and we use it to create a
the
We're creating a new wrapper for every request so this is a sledgehammer approach to fixing the issue. We were seeing random failures after only 5 requests so the safest approach for us was to create a new one for every high-level download/upload request. If you're doing many requests this may not work for you. In our case we're manipulating larger files relatively infrequently so initializing a new client every time wasn't much of a performance hit. |
@yeroc. You said you're seeing random failures after only 5 requests. Are these all the IOExceptions? That behavior is a bit odd to me. On the latest versions, we've been seeing these IOExceptions much less frequently, and those ones are almost always resolved by a retry. I think @rocketraman has only been seeing this issue become unrecoverable after running his application for a while if I'm not mistaken. I'm curious why you're still seeing it so frequently. What version of the library are you using and what is your workflow? Have you observed retries being unsuccessful (retries are enabled by default)? And has this workaround completely mitigated the issue? |
@rickle-msft These weren't 5 requests back-to-back but rather 5-6 requests spread out over a period of time (30 minutes or so). We were never able to create a test that reproduced the issue reliably. These ended in hangs. In most cases we observed no retry being attempted, things just hung with things blocking. We're using version 10.1.0. To date the workaround completely mitigated the issue. As per above, we completely tear down the connection pool after each upload/download so the window for the connections or client to get itself into an unrecoverable state is very small. |
@anuchandy Sorry for the delay, I was on vacation last week. I've deployed a new version of our service with the correct version of Netty, and without the native epoll. I can immediately note that we are no longer seeing the Will do some testing over the next week or two to see if we run into any similar issues as before, but I'm cautiously optimistic here. |
@rocketraman thank you for testing this. Sure, waiting to see the result of long run. |
I wasn't sure I was hitting the same issue, as I have no other Netty dependencies, just the one introduced by azure storage, which is why I was reporting a new issue #438 . I am on blob storage version 10.5.0 Dependency graph does show a superseded version though, is this normal?:
(Full dependency graph attached below) My issue is similar: I am uploading blobs to save json-formatted text data into Azure Blob Storage. When I run the application overnight, receiving messages about once every 10 minutes, the application stops being able to upload or download data from Azure storage. I also see many "channels leaked" and "Connection reset by peer" messages from the Azure SDK or libraries it uses. I do not have any other version of netty in my app except for the one pulled by
The errors I have got are like the following:
The application does stall in the sense that no new azure blob reads or writes get through anymore at that point. This happens at around 100 "leaked channels" as reported by the shared channel pool print. I do not see other errors. Here is the stall from this morning:
After this tries just hang:
|
@rocketraman need a help from you. could you describe how did you disable native-transport for linux? (pom file change? how it looks like) it will be very helpful for anyone having same issue [@lagerspetz seem hitting similar issue] |
@anuchandy As I understand it, the native transport is enabled only if a runtime dependency on the appropriate artifact is included e.g.:
Since the Storage SDK doesn't include this in its POM, it won't be included for the user either unless they add it themselves, or another dependency they have adds it for them (such as was my case above). For @lagerspetz , its odd that his Netty version was superseded by Gradle -- that shouldn't happen unless the dependency was explicitly specified, or unless Gradle is applying a version conflict resolution algorithm, which shouldn't be the case if netty isn't being pulled in by any other dependency. A Gradle build scan ( I don't believe @lagerspetz 's dependencies include any native libs though, based on his deps report, which is an interesting data point if so -- that means the issue is not with the native libs, but rather with newer versions of Netty (or in how the storage SDK uses them, of course). |
--scan did not reveal anything, it just says that version of netty was
|
@lagerspetz It looks like you are using Spring Boot -- it may very well be the cause of your "Selected by rule". I believe Spring Boot does all sorts of things to try and set the versions of things automagically. See for example: https://github.com/lkishalmi/gradle-gatling-plugin#spring-boot-and-netty-version. |
Thanks @rocketraman I found this as well: For reference, the dependencies section that produces the same deps as the one in my above comment:
|
Even with this version, I am still getting the error. There's also "Unexpected failure attempting to make request." that shows up after the 100 leaked channels:
|
@anuchandy Like @lagerspetz I can also confirm I still have the issue, with no native epoll and the correct version of netty. Not sure this means anything, but it also "hangs" for me way before reaching 100 leaked channels -- in fact the hanging behaviour seems to be somewhat random. |
thanks @rocketraman & @lagerspetz for validating. @lagerspetz could you share a bit more about nature of your application, you shared the following code @Override
public int store(String name, String jsonData) {
BlockBlobURL blob = containerURL.createBlockBlobURL(name);
byte[] array = jsonData.getBytes(AuthUtils.utf8);
ByteBuffer buf = ByteBuffer.wrap(array);
Flowable<ByteBuffer> data = Flowable.fromArray(buf);
long length = array.length;
Single<BlockBlobUploadResponse> resp = blob.upload(data, length);
BlockBlobUploadResponse result = resp.blockingGet();
return result.statusCode();
}
|
EDIT: I am investigating possible client issues that might cause the stored message to contain other than expected content (JSON Untyped vs JSON typed). The below is still an accurate description of the problem.
I am using Java 8 (openjdk-8-jre-headless) |
I am trying with Java 10 now, and using different client side format, but I still get this issue. |
I'm on Java 11 now, btw. That doesn't help either. |
@lagerspetz @rocketraman Have you guys tried bumping Netty to the latest patch version manually? I was experiencing the same issue and doing that seem to have helped. My
It might be a temporary measure while they don't fix the SDK. |
Thanks @marcioos I am trying that next. For me it was enough to do:
UPDATE: This does not work for me. The app still hangs. |
After pushing v10 to our production environment, the issues found by @rocketraman started manifesting quite consistently. I ended up rolling back to v8. |
Thank you guys, for the updates. We are currently spending a considerable amount of time investigating this issue and working towards a resolution. We will let you know here when there is progress. |
We are also affected by this - at this moment this api version is useless for production loads. This should be clearly stated in the Readme and docs. |
@mzarkowski - we have validated a fix for this internally. We'll reply to this thread with more information today. |
Quick update. We are prepared to release the version which contains this fix as soon as some service features light up (to support other features added in this release). That deployment is in its final stages. Thank you all for your patience and your contributions to this effort. We look forward to publishing these fixes and unblocking everyone here. As always, if you encounter any other problems, we are happy to continue to work with you |
We have released v11.0.0 which depends on v2.1.0 of the runtime, containing several fixes related to connection issues. Please consider upgrading and giving this a try. Thank you all again for you participation in this issue and for yourp patience in its resolution. I will close this issue now as we believe that we have fixed it, but please feel free to reopen it if you continue to experience problems in this area. |
I'm still regularly getting "Connection reset by peer" with v11; there's been no difference for me between 10.X and 11 (other than no longer getting ConcurrentModificationException errors). |
I am getting Connection reset by peer regularly also, but I have so far got only 1 leaked connection in over two weeks, and the library doesn't seem to hang. So it would appear that the connection resets are OK for me, since the library retries the upload, and they do not affect functionality for me. |
@lagerspetz I am happy to hear that you are seeing better results now. @Spinfusor Could you please provide the output of mvn dependency:tree so we can first validate that all the dependencies that you are using are as they should be to pull in the fix? |
I'll add a vote for positive results with v11. I haven't noticed any library hangs any more, and very few leaked connections. |
|
@Spinfusor Thank you for sending that. Do you have some logs that we can look through that capture the failure? And can you describe the behavior of your application and when it hits this issue? |
Which service(blob, file, queue, table) does this issue concern?
Blob
Which version of the SDK was used?
10.0.4-rc
What problem was encountered?
Upon upgrade from 10.0.1-preview to 10.0.4-rc, I occasionally get the following exception:
java.io.IOException: Connection reset by peer
, with the stack:This error was encountered during an upload, and it doesn't look like retry worked either.
I don't recall ever seeing anything similar with 10.0.1-Preview and will likely downgrade to that version until this is resolved.
Have you found a mitigation/solution?
No
The text was updated successfully, but these errors were encountered: