New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionMultiplexer does not reliably restart failed connections #269

Closed
ghost opened this Issue Aug 25, 2015 · 12 comments

Comments

Projects
None yet
10 participants
@ghost
Copy link

ghost commented Aug 25, 2015

We're getting this issue happening in production, we spin up a ConnectionMultiplexer when the application starts, but due to some random behavior the connection drops after some random period of time.

Unfortunately about half of the time the ConnectionMultiplexer fails to reconnect, and IsConnected remains false until we restart the application.

After seeing this happen I went and manually tested that redis-cli can connect from that box, and everything seems fine. I can't reproduce this on demand, so testing is a bit difficult.

Should application developers need to dispose/create a new ConnectionMultiplexer if it's unable to reconnect, or should it be treated as a bug for CM?

@cs-NET

This comment has been minimized.

Copy link

cs-NET commented Aug 27, 2015

I am trouble shooting a similar problem, time to look through source code and understand how re- connections are handled. I was under the impression that the client driver would continue to retry the connection indefinitely, but I also have to restart to resolve the issue.

Difficult one to debug as it only happens in production about once every 2 weeks.

@shogrenfr

This comment has been minimized.

Copy link

shogrenfr commented Sep 2, 2015

We discovered what appears to be a similar problem. We are on build 10450.

We had to force the instantiation of a new instance of the ConnectionMultiplexer to resolve the problem.

@ghost

This comment has been minimized.

Copy link

ghost commented Sep 3, 2015

For me this happens a lot - in one of pur data centres its about 30 times a day across a bunch of boxes.

Same code runs in both DCs so its something network related, but since we don't own the network we can't debug that aspect further.

Edit: since I logged the issue, I went and put in some monitoring code that waits 30 seconds after a disconnect, and then creates a new multiplexer.

I noticed that the multiplexer either connects again within 1-2 seconds, or it hits the 30 second timeout and gets thrown out/recreated. So far over a few hundred of these events its never failed to reconnect with the new multiplexer. We do create the new connection before disposing the old one though.

@levmatta

This comment has been minimized.

Copy link

levmatta commented Sep 4, 2015

I am having the same problem here. Furthermore I have no idea why the connection is dropping (but it apears in my logs).

@mohammedh123

This comment has been minimized.

Copy link

mohammedh123 commented Sep 18, 2015

having an issue here on our production boxes as well, seeing it maybe once a week

@TheCloudlessSky

This comment has been minimized.

Copy link

TheCloudlessSky commented Dec 13, 2015

I've had a similar problem. We have an MVC app running on Windows Server 2008 R2 that connects via DNS (not by IP) to a Redis instance running on AWS ElastiCache. Usually (when I manually test it), the ConnectionMultiplexer re-connects just fine:

  1. Block outgoing ports to Redis.
  2. See errors for the running app.
  3. Unblock outgoing ports to Redis.
  4. Errors stop since the CM re-connects.

However, we've found that under unknown circumstances when a connection fails, the ConnectionMultiplexer doesn't attempt to re-resolve the DNS (until it eventually does several minutes later). I've created a hack sample to manually clear the DNS entries and it's able to re-connect the ConnectionMultiplexer! However, it wouldn't explain why any of you would experience the connection failing in the first place (could be related to #83 or #42).

NOTE: Clearing these DNS entries should probably be rate-limited somehow under high load. We're still testing this and seeing if there are better workarounds. I've heard that moving the app to Server 2012 would eliminate any DNS caching problems.

public class ClearRedisHostNameDnsCacheOnConnectionFailureExceptionFilter : IExceptionFilter
{
    private readonly ConnectionMultiplexer connectionMultiplexer;

    public ClearRedisHostNameDnsCacheOnConnectionFailureExceptionFilter(ConnectionMultiplexer connectionMultiplexer)
    {
        this.connectionMultiplexer = connectionMultiplexer;
    }

    public void OnException(ExceptionContext filterContext)
    {
        var redisConnectionException = GetRedisConnectionException(filterContext.Exception);
        if (redisConnectionException != null && IsConnectionFailure(redisConnectionException))
        {
            foreach (var dnsEndPoint in connectionMultiplexer.GetEndPoints().OfType<DnsEndPoint>())
            {
                DnsFlushResolverCacheEntry(dnsEndPoint.Host);
            }
        }
    }

    private static RedisConnectionException GetRedisConnectionException(Exception exception)
    {
        if (exception == null) return null;
        var redisConnectionException = exception as RedisConnectionException;
        if (redisConnectionException == null)
        {
            return GetRedisConnectionException(exception.InnerException);
        }
        else
        {
            return redisConnectionException;
        }
    }

    private static bool IsConnectionFailure(RedisConnectionException exception)
    {
        switch (exception.FailureType)
        {
            case ConnectionFailureType.AuthenticationFailure:
            case ConnectionFailureType.Loading:
            case ConnectionFailureType.None:
                return false;
            default:
                return true;
        }
    }

    [DllImport("dnsapi.dll", EntryPoint = "DnsFlushResolverCacheEntry_W", CharSet = CharSet.Unicode, SetLastError = true, ExactSpelling = true)]
    private static extern int DnsFlushResolverCacheEntry(string hostName);
}

We do this in the global Application_Error (to handle errors outside of MVC) and hook it into the exception filters of the MVC pipeline.

@renaatd

This comment has been minimized.

Copy link

renaatd commented Jan 29, 2016

We seem to have the same problem using StackExchange.Redis 1.0.488. We had 2 network glitches, causing approx 50 apps using StackExchange.Redis to throw exceptions. After the glitch, one app didn't recover. It keeps throwing exceptions. Sample exception:
Timeout performing GET xxx, inst: 4, mgr: RecordConnectionFailed_FailOutstanding, err: never, queue: 92820, qu: 0, qs: 92820, qc: 0, wr: 0, wq: 0, in: 8192, ar: 0, IOCP: (Busy=26,Free=974,Min=4,Max=1000), WORKER: (Busy=5,Free=32762,Min=4,Max=32767), clientName: xxx
All other apps continue working without problems. In the exceptions, the numbers queue and qs keep increasing. IOCP busy goes up and down. Memory consumption keeps increasing, the memory dump contains a lot of StackExchange.Redis.Message+CommandKeyMessage objects. Still analyzing the dump to understand what's going on.

@renaatd

This comment has been minimized.

Copy link

renaatd commented Feb 2, 2016

Some extra information, from the app having connection problems. The PhysicalBridge of the interactive connection to the slave is OK (state = 2, ConnectedEstablished), but the interactive connection to the master is in state 1 (ConnectedEstablishing). failConnectCount = 0x15F, socketCount = 0x163. We were able to connect to both redis-servers from the same machine without any problems (latency < 1 ms).

@hahaxj

This comment has been minimized.

Copy link

hahaxj commented Mar 22, 2016

Our webserver are also down for the error. Every time after we restart the IIS, the errer is gone. But maybe after severval days, the error will come back....

@NickCraver NickCraver added the timeout label Sep 2, 2017

@NickCraver

This comment has been minimized.

Copy link
Member

NickCraver commented May 28, 2018

Is anyone still seeing this in 1.2.6?

@vincec-msft

This comment has been minimized.

Copy link

vincec-msft commented May 28, 2018

Yes. :( We recently implemented the reconnect pattern recommended here and that's worked pretty well.

@mgravell

This comment has been minimized.

Copy link
Contributor

mgravell commented Jun 28, 2018

Closing; please see #871

@mgravell mgravell closed this Jun 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment