Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedisCommandException SETEX Errors on Azure Redis Cluster #1504

Closed
tonycoelho opened this issue Jun 18, 2020 · 3 comments
Closed

RedisCommandException SETEX Errors on Azure Redis Cluster #1504

tonycoelho opened this issue Jun 18, 2020 · 3 comments

Comments

@tonycoelho
Copy link

StackExchange.Redis version 2.1.30

This issue is related to #1172 where SETEX errors are occurring while trying to write to a replica/slave while a master node is in a failover scenario on Azure Redis with clustering enabled. The steps to reproduce this issue are as follows:

  1. Enable clustering on an Azure Redis instance configured with 2 shards
  2. Start an application/script that writes to cache, reads from cache, and deletes from cache on a recurring loop
  3. Reboot the master node one of the clustered shards, i.e. Shard0

This results in the following transient exceptions:
message=The PutItemAsync operation failed. Attempts: 6 Duration: 00:00:00.0018099 Exception: InternalFailure on SETEX RedisCacheCheckKeyd206b88c-1760-4e4b-a719-7aebcf7b41b3. exception=StackExchange.Redis.RedisConnectionException: InternalFailure on SETEX RedisCacheCheckKeyd206b88c-1760-4e4b-a719-7aebcf7b41b3
---> StackExchange.Redis.RedisCommandException: Command cannot be issued to a slave: SETEX RedisCacheCheckKeyd206b88c-1760-4e4b-a719-7aebcf7b41b3
at StackExchange.Redis.PhysicalBridge.WriteMessageToServerInsideWriteLock(PhysicalConnection connection, Message message) in /
/src/StackExchange.Redis/PhysicalBridge.cs:line 1303_

In this specific case, the application is executing the write operation with resiliency and retrying any exceptions that occur. The write operation ultimately failed after 6 attempts. Once the reboot of the master node on Shard0 completed, all operations continued to succeed again.

Per #1172, this was supposedly fixed in release 2.1.30, but we are still seeing this issue. Any guidance on how best to handle this error would be appreicated.

@NickCraver
Copy link
Collaborator

I'm not entirely sure what's being asked here - is the expectation that no errors occur while a failover executes? Or some interim state? Or is the issue that the target node was not a replica during the reboot, and was promoted to primary? I'm not sure what the in-between state is in Azure's setup.

@tonycoelho
Copy link
Author

Hey @NickCraver the question is; when the master node is down on a shard in a clustering configuration, why do writes to the slave/replica fail and why doesn't it try to use a master node on a different shard in the cluster for better resiliency? I'm trying to understand what we can do that make the system more resilient in cases like this i.e. when a node is failing or being patched in Azure. Using a resiliency pattern like retry exponential back off doesn't help because writes continue to fail until the master node is healthy again.

@NickCraver
Copy link
Collaborator

@tonycoelho There was a lot of digging here and I didn't update this issue but one change we made was proactively recognizing topology changes in Azure when maintenance events happen. This was added in #1876 which should dramatically improve the recognition time for changes here.

Happy to reopen if this is still an issue, but overall: grab the latest client and it'll have the changes from #1876 to better handle this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants