Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection stuck while having network outages, leading to memory leak and and application freeze #3353

Closed
ayarov opened this issue Dec 1, 2014 · 19 comments

Comments

@ayarov
Copy link

ayarov commented Dec 1, 2014

I'm working with a .Net client 2.1.2
We lately upgraded it from the old 1.xxx version

Now, we are trying to simulate unstable environment were from time to time connection issues.
To be more specific, i simulated a WI-FI disabling and enabling after a while.
It leaded me to communication channel stuck: no message could be sent, no receive, not even Stop connection.
By looking deeper i found that at some point there were hundreds of threads up in the air due to bad implementation of the Heartbeat callback event.
Also, the CheckKeepAlive method has a lock on its enterence, it means that all of this threads are stuck until the first one that is trying to abort will not complete (in that case i have my doubt whether the abort is going to end at some point). So in the meanwhile the heartbeat is ticking and generating more and more threads hanging the application.
image

@halter73
Copy link
Member

halter73 commented Dec 1, 2014

Thanks for the bug report. Do you have the stack trace for any thread that might holding the lock?

You mentioned that "threads are stuck until the first one that is trying to abort will not complete". Does that mean that there is a thread blocked in TransportAbortHandler.Abort?

Are you testing more than one client simultaneously? If so, how many? How long does it take for hundreds of threads to get blocked in the HeartbeatMonitor?

@ayarov
Copy link
Author

ayarov commented Dec 2, 2014

First, please refer to an issue: #3233
After posting this issue i found similar issue from August on your list by David was not able to reproduce as he didn't have the scenario - so i guess we have one now.

As you can see on the picture, all of the threads are stuck on the System.Threading.Monitor.Enter
I meant HttpWebRequestWrapper Abort method which is using HttpWebRequest

We are using a WPF application with a single AppDomain and there is at least a single connection but in my scenario i had only one.

In my scenario i get into this in about a minute, at most two.

p.s. I didn't do this having SignalR.Client compiled with pdb so i don't really have any evidence that i'm right by debugging it, but i opened latest SignalR solution and tracked to the method which is stuck in the stack trace for each thread and didn't find ny other path to get into this problem.

image

@ayarov
Copy link
Author

ayarov commented Dec 3, 2014

Attached link to a dump file.
The password will be supplied by email.

https://onedrive.live.com/redir?resid=E1BAD7EF461ED05A!721&authkey=!APqe50orMO0P9zI&ithint=file%2czip

@ayarov
Copy link
Author

ayarov commented Dec 3, 2014

I think that i fixed it temporary for us but probably you need to consider this as standard fix for SignalR client:
image
According to the VerifyLastActive method, it checks whether the connection last activity crossed the reconnect window. and that is perfect. But while there is a network outage, the keep alive monitor is going to stuck on the web request abort:
e.g. ServerSentEventsTransport -> LostConnection (uset by keep alive timer)

image

        //By Alex Y.
        //LostConnection is executed from a critical section blocked by "lock" statement,
        //to avoid threads starvation and memory leak (as the CheckKeepAlive is invoked by a timer and generates a new thread each time)
        //we need to use timer for an abort operation which is longer than reconnect window (see VerifyLastActive method).
        if (_request != null)
        {
            var task = Task.Factory.StartNew(_request.Abort);
            task.Wait(connection.ReconnectWindow);
        }

All this will lead to mark the connection as active with a time stamp that gives us:
(Now - last active) > reconnect window
The connection will be stopped which is a correct flow while the connection got our of order .
Then we and anyone else can implement reconnect or whatsoever.

@halter73
Copy link
Member

halter73 commented Dec 4, 2014

Thanks a lot for the dump file and the extra information. I found that this bug has been previously reported in yet another issue: #2325

The call to _request.Abort() is calling into HttpResponseMessage.Dispose which can sometimes hang on some older versions of .NET 4.5 while waiting for the response to drain. For this reason, it is probably good idea do this on another thread as you have. We could also try to ensure that only one thread simultaneously attempts to acquire the _connectionStateLock in HeartBeatMonitor.Beat as suggested in #2324.

From looking at your dump, it appears that the client was running on a machine with an older version of .NET 4.5. I had a discussion with @Tratcher who suggested that HttpResponseMessage.Dispose should not hang on .NET 4.5.1 and later. I understand that in practice it can often be difficult/impossible to ensure that your application always runs on an up-to-date version of .NET.

@ayarov
Copy link
Author

ayarov commented Dec 4, 2014

First, thanks a lot for confirming my change is correct.
Regarding the connection state lock for keep alive mechanism, i also considered this and you actually confirming my doubts, thanks for that as well.

Just one more issue to complete this thread:
I also tried to stop connection while network cable is out.
This leaded me to stuck once again, but this time on disposing of the HttpRequestMessage.
I looked deeper into the System.Net.Http.HttpRequestMessage dispose implementation and found that it is actually being stuck on the dispose of internal HttpContent content read stream which is System.IO.Stream.

Now, the question is whether .Net 4.5.1 is acting differently for this one too?
In the meanwhile i kicked it off by an async operation waiting up to 5 seconds to complite the stream close/dispose.

image

Please let me know what you and your team suggest for this one.

@halter73
Copy link
Member

halter73 commented Dec 4, 2014

HttpResponseMessage.Dispose should not hang in .NET 4.5.1 and later. The Dispose method you see hanging when you stop the connection is the exact same method you see holding up ServerSentEventsTransport.LostConnection when it is called by the HeartBeatMonitor. In both cases the client is waiting for the response to drain which doesn't happen when the network is out. Upgrading .NET should fix both issues.

The change you made in DefaultHttpClient.Get to time out the call to the responseDisposer in the HttpRequestMessageWrapper's cancel callback should mitigate both issues. This callback is what is hanging when you call _request.Abort in ServerSentEventsTransport.LostConnection. So with this change, you should no longer need to schedule a new task is LostConnection.

@ayarov
Copy link
Author

ayarov commented Dec 5, 2014

Thanks, i'll compile a custom version for us.
Is there any plan to contain this fix in SignalR next release?
And if yes, is there any planned date for this?

Thanks once again.

@ghost
Copy link

ghost commented Apr 27, 2016

Any news on this. I am using SignalR Client v2.2.0 in a Xamarin app and I am getting exactly the same error. Will this be fixed in v2.2.1? If so is there any news on when this will be, it was thought this would be available after December 2015 but there is still no sign of it. This is a major issue for me as I cannot reestablish my connection due to the locking problem

If there is no fix, is there a workaround you could suggest

@raghav-axero
Copy link

raghav-axero commented Apr 28, 2016

Hi,

We are facing a similar issue like this.

When we start our application, after a while all the users connected to the hub start getting reconnecting issues.

In Chrome console all the users are seeing this continuously:

jquery.signalR.min.js WebSocket connection to 'wss://[our site url]/signalr/reconnect?transport=webSocke…HcDfA%3D%3D&connectionData=%5B%7B%22name%22%3A%22myhub%22%7D%5D&tid=10' failed: Error during WebSocket handshake: Unexpected response code: 404

All the users are stuck in reconnecting mode i.e this method is continuously firing for all users:

connection.hub.stateChanged(function (change) {
                if (change.newState === $.connection.connectionState.reconnecting) {
                    failPendingMessages();
                    ui.showStatus(1, '');
                }
                else if (change.newState === $.connection.connectionState.connected) {
                    if (!initial) {
                        ui.showStatus(0, $.connection.hub.transport.name);
                        ui.setReadOnly(false);
                    } else {
                        ui.initializeConnectionStatus($.connection.hub.transport.name);
                    }

                    initial = false;
                }
                else if (change.newState === $.connection.connectionState.disconnected && initial === true) {
                    initial = false;
                }
            });

We are using that code from the Jabbr source code.

The state is changing from connected > reconnecting > diconnected> connected

Seems after a while when there are many connections, all the connections to Hub remain stuck i.e are not able to converse with hub properly and hence connection gets broken and we receive "WebSocket handshake: Unexpected response code: 404" what I mentioned above.

Note that there is not a fixed time for this behavior/issue. Sometimes it starts with in few minutes of "application start" and sometimes it takes around 20-30 minutes.

This issue definitely seems to be because of many connections conversing at the same time to the Signalr core code via Hub.

Is there any fix for it?

Edit: We are using the latest version of Signalr i.e Microsoft.AspNet.SignalR.Core.dll (2.2.0)

@ghost
Copy link

ghost commented May 4, 2016

Any word on a fix for this? I can make this happen easily in a Xamarin app. Unfortunately all the PCL profiles only use .NET 4.5 so there is no chance of moving to a later version of the .NET framework. This is a really big issue for me. Can it be fixed or can a workaround be provided. This is also on V2.2.0

@davidfowl
Copy link
Member

@sisterray Did you file a bug on Xamarin?

@LuoyeAn
Copy link

LuoyeAn commented May 31, 2017

@sisterray any solution for the issue? I am hitting it on xamarin.ios app

@ghost
Copy link

ghost commented May 31, 2017

This is a comment I place on #645 which seemed to fix the issue for me

You could try a couple of things

Upgrade to 2.2.1
Ensure that you cleanly manage the connection, e.g. Ensure the the previous connection is cleanly disposed of before setting your connection to a new one.
Don't abort or dispose of a connection that failed to connect, e.g. One with no connection Id. I found this seemed to work on iOS but crashed on Android. I suspect it didn't really work on iOS but eventually causes the locking issue.
Since I have done this the issue has gone.

@nirajkr
Copy link

nirajkr commented Aug 24, 2018

@sisterray is this issue fixed ? I am using 2.2.3 and still facing the same issue

@ghost
Copy link

ghost commented Aug 24, 2018

There was no fix, I am using 2.2.1

Just follow the points I made in my previous comment. If that doesn't fix it it is probably another issue

@aspnet-hello
Copy link

This issue has been closed as part of issue clean-up as described in https://blogs.msdn.microsoft.com/webdev/2018/09/17/the-future-of-asp-net-signalr/. If you're still encountering this problem, please feel free to re-open and comment to let us know! We're still interested in hearing from you, the backlog just got a little big and we had to do a bulk clean up to get back on top of things. Thanks for your continued feedback!

@analogrelay
Copy link
Contributor

It looks like there's still some activity on this thread. @nirajkr or @sisterray could one of you open a new bug and describe what you're seeing? There's a lot going on in this thread (including some slightly different but possibly related issues) so it's difficult to track down exactly which issue you're seeing :).

@HelloMyDevWorld
Copy link

HelloMyDevWorld commented Jan 7, 2019

2.4.0 Still the same issue long execution time of hubConnection.Stop(); in Xamarin.Froms ;/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants