Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Editor and Object Explorer connections are dropped in SQL 2019 demo scenarios #2718

Closed
kburtram opened this issue Sep 28, 2018 · 3 comments

Comments

@kburtram
Copy link
Member

  • Azure Data Studio Version: 1.0.0

Repro steps are still needed

@julielerman
Copy link

julielerman commented Sep 29, 2018

I'm finding the same connecting to sql server 2016 v 13.0.1742.0). I'm working on macOS. Connecting via sql login to sql server on my local network. When I abandon the keyboard for a short time...5 minutes or so? but I haven't done real tests .. I have lost my connection to the server. I still have a green dot on the server, but get "invalid argument" message when I try to interact with any of the nodes. Oddly, I can still use the MANAGE function of the context menu on the server. Note that if I disconnect the db server , then I can just click on it and it will reconnect and all is well again.

@kburtram
Copy link
Member Author

@julielerman thanks for confirming. We're seeing dropped connections relatively frequently from macOS, typically with SQL 2019 instances It's good to hear of a repro with in-market server versions as it helps to narrow the scope of the investigation.

@kburtram
Copy link
Member Author

Specifically for demo scenarios using a macOS client we observed an issue where sockets were being disconnected after a couple minutes of inactivity. This creates a lot of demo problems since while the presenter is speaking to a couple slides the connection would drop causing "random" failures in various features.

The underlying issue here is that on Windows a KeepAlive packet is sent every 30 seconds for SqlConnection sockets, whereas in the .Net Core SqlConnection implementation this wasn't happening. It looks like SQL Server, particularly in Azure, is pretty aggressive in reclaiming inactive TCP connections. We are shipping a private SqlClient binary for macOS with the following fix to address this issue kburtram/corefx#1.

Here are some additional RCA investigation notes.


We were able to root cause the underlying macOS dropped connection issue impacting ADS. The basic repro is on macOS do anything in ADS requiring an active database connection, then let ADS idle a few minutes, then try to use that connection. The task after idling fails due to a dropped connection. This causes failures through the product such as “#2914 Invalid argument exception expanding OE database node” and “#2933 Connection lost to Azure SQL DB”.

In summary, the issue is that KeepAlive is not being set for SqlConnection sockets on non-Windows platforms and a fix is available at “kburtram/corefx#1 Enable keep alive on SqlClient TCP sockets” and will be included in the next insider build. Please see additional details at https://github.com/dotnet/corefx/issues/32894.

The fix requires making changes to .Net Core networking PAL, the .Net framework Socket class, and the SqlClient TCP handle wrapper (see details at kburtram/corefx#1). We’ll need to determine how best to consume this patch until it can be rolled-up into an official .Net Core runtime release.

For those interested, the issue is that on macOS the recvmsg function was returning an ETIMEDOUT error code due to a TCP Retransmission Timeout occurring while trying to send a packet. This error code is converted into a SocketException at the below callstack.

{System.Net.Sockets.SocketException (60): Operation timed out
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)}System.Net.Sockets.dll!System.Net.Sockets.NetworkStream.Read(byte[] buffer, int offset, int size) (Unknown Source:0)

System.Data.SqlClient.dll!System.Data.SqlClient.SNI.SNITCPHandle.Receive(out System.Data.SqlClient.SNI.SNIPacket packet, int timeoutInMilliseconds) (Unknown Source:0) System.Data.SqlClient.dll!System.Data.SqlClient.SNI.TdsParserStateObjectManaged.ReadSyncOverAsync(int timeoutRemaining, out uint error) (Unknown Source:0)
System.Data.SqlClient.dll!System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync() (Unknown Source:0)
System.Data.SqlClient.dll!System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket() (Unknown Source:0)
System.Data.SqlClient.dll!System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer() (Unknown Source:0) System.Data.SqlClient.dll!System.Data.SqlClient.TdsParserStateObject.TryReadByte(out byte value) (Unknown Source:0) System.Data.SqlClient.dll!System.Data.SqlClient.TdsParser.TryRun(System.Data.SqlClient.RunBehavior runBehavior, System.Data.SqlClient.SqlCommand cmdHandler, System.Data.SqlClient.SqlDataReader dataStream, System.Data.SqlClient.BulkCopySimpleResultSet bulkCopyHandler, System.Data.SqlClient.TdsParserStateObject stateObj, out bool dataReady) (Unknown Source:0)
System.Data.SqlClient.dll!System.Data.SqlClient.SqlDataReader.TryConsumeMetaData() (Unknown Source:0)
System.Data.SqlClient.dll!System.Data.SqlClient.SqlDataReader.MetaData.get() (Unknown Source:0) System.Data.SqlClient.dll!System.Data.SqlClient.SqlCommand.FinishExecuteReader(System.Data.SqlClient.SqlDataReader ds, System.Data.SqlClient.RunBehavior runBehavior, string resetOptionsString) (Unknown Source:0) System.Data.SqlClient.dll!System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(System.Data.CommandBehavior cmdBehavior, System.Data.SqlClient.RunBehavior runBehavior, bool returnStream, bool async, int timeout, out System.Threading.Tasks.Task task, bool asyncWrite, System.Data.SqlClient.SqlDataReader ds) (Unknown Source:0) System.Data.SqlClient.dll!System.Data.SqlClient.SqlCommand.ExecuteReader(System.Data.CommandBehavior behavior) (Unknown Source:0)
System.Data.SqlClient.dll!System.Data.SqlClient.SqlCommand.ExecuteReader() (Unknown Source:0)

We captured the network traffic for the repro scenario using WireShark on both macOS and Windows working with the same server at the same time. On macOS you can see that there is a large gap of inactivity then a packet is sent without an ACK. This triggers a series of retransmissions leading to an eventual timeout. On Windows there is a continuous stream of KeepAlive messages followed by the PSH packet which is followed by the expected ACK packet.

Windows Network Traffic (note the continuous KeepAlive packets)
image

macOS Network Traffic (note retransmission packets and large gap in timestamps)
image

@adsbot adsbot bot locked and limited conversation to collaborators Mar 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants