Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis-cluster attempting to connect to a stale redis server IP and not recovering #133

Closed
rahulKrishnaM opened this issue Jan 16, 2023 · 29 comments

Comments

@rahulKrishnaM
Copy link

Hi @bjosv,

We came across a case wherein hiredis-cluster seems to be trying to connect to a stale redis-server ip while issuing commands to redis server, this connection flapping kept on happening (approx 2hrs) until one more redeploy was done for the application image (which has redis-cluster integrated).
From the server-side logs, it looks like this ip was alive earlier for some period of time in the past, but not for the previous couple of hours atleast when the issue was seen. There was an image upgrade done on the redis server nodes which could have resulted in the change of ips.

As part of the callbacks registered by application with hiredis, the below prints were seen as part of the redis command send calls.

[2023-01-12T07:47:30.263102][disconnectCallback]Error: Server closed the connection

...connectCallback print for 192.168.228.26

[2023-01-12T07:48:26.189608][disconnectCallback]Error: Timeout

...connectCallback print for 192.168.228.26

[2023-01-12T07:48:28.191117][disconnectCallbackk]Error: Timeout

...connectCallback print for 192.168.228.26

[2023-01-12T07:48:34.186333][disconnectCallback]Error: Connection reset by peer

...connectCallback print for 192.168.228.26

[2023-01-12T09:12:53.184634][disconnectCallback][Disconnected from 192.168.228.26:6379

kubectl get pods output was collected when the issue was seen, and at that point the list of ips are as captured below. 3 primary nodes and 2 replica per node (adds up to total 9 nodes) as shown below. Here, the ip which redis-cluster is trying to connect - 192.168.228.26 is not appearing.

  1 NAME                                READY   STATUS    RESTARTS   AGE    IP                NODE                                    NOMINATED NODE   READINESS GATES
  2 rediscluster-node-for-redis-9cj4g   2/2     Running   0          169m   192.168.229.229   cl-60-node-5-dcc808-4b2c09f7b029c460    <none>           <none>
  3 rediscluster-node-for-redis-9vvzf   2/2     Running   0          165m   192.168.228.4     cl-60-node-9-dcc808-4b2c09f7b029c460    <none>           <none>
  4 rediscluster-node-for-redis-h5lw6   2/2     Running   0          168m   192.168.100.63    cl-60-node-4-dcc808-4b2c09f7b029c460    <none>           <none>
  5 rediscluster-node-for-redis-mx8gs   2/2     Running   0          168m   192.168.149.14    cl-60-node-1-dcc808-4b2c09f7b029c460    <none>           <none>
  6 rediscluster-node-for-redis-qwrqg   2/2     Running   0          168m   192.168.33.99     cl-60-node-6-dcc808-4b2c09f7b029c460    <none>           <none>
  7 rediscluster-node-for-redis-slbj7   2/2     Running   0          166m   192.168.164.92    cl-60-node-3-dcc808-4b2c09f7b029c460    <none>           <none>
  8 rediscluster-node-for-redis-t92lx   2/2     Running   0          169m   192.168.100.3     cl-60-node-4-dcc808-4b2c09f7b029c460    <none>           <none>
  9 rediscluster-node-for-redis-wgbdq   2/2     Running   0          166m   192.168.41.246    cl-60-node-10-dcc808-4b2c09f7b029c460   <none>           <none>
 10 rediscluster-node-for-redis-z5dr7   2/2     Running   0          169m   192.168.124.200   cl-60-node-2-dcc808-4b2c09f7b029c460    <none>           <none>

Do you have some clue on what could be resulting in this behaviour?

@rahulKrishnaM
Copy link
Author

Hi, can someone please take a look at this one? We are a bit blocked with this currently.

@bjosv
Copy link
Collaborator

bjosv commented Jan 17, 2023

That sounds strange.
I guess you have checked the individual nodes by running the command CLUSTER NODES (or CLUSTER SLOTS) individually on them to make sure they are all in sync.
Do you set the max retry config to a specific value (using redisClusterSetOptionMaxRetry()) or is it default? (in async API that means 5 errors starts a timer to fetch topology).
Do you have/see any logs about failed command callbacks, which then would trigger a new topology fetch?

@rahulKrishnaM
Copy link
Author

rahulKrishnaM commented Jan 17, 2023

Hi @bjosv

Yes, cluster nodes command was separately run on the redis-server when issue happened and ouput(captured below) is matching with the kubectl o/p.

>>> Calling cluster nodes
redis-cluster.local:6379: 51c49c0ba2f1c33456dafc1654e4969c8ab25416 192.168.100.3:6379@16379 slave 6d68da34389cb97e70e4ee8075108f6394847446 0 1673519200113 29 connected
6fde1a7b2f6ed7355066534e8a1656f0f9a0e988 192.168.164.92:6379@16379 slave 8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 0 1673519198598 35 connected
8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 192.168.228.4:6379@16379 master - 0 1673519198000 35 connected 10924-16383
7061695cb989db63cf08cecf5dcdcee7293a7c4e 192.168.33.99:6379@16379 myself,master - 0 1673519195000 34 connected 2-5463
3cb872f1153c39f9e1b5ce3f1ddf2136a2558bba 192.168.149.14:6379@16379 slave 7061695cb989db63cf08cecf5dcdcee7293a7c4e 0 1673519196088 34 connected
6d68da34389cb97e70e4ee8075108f6394847446 192.168.229.229:6379@16379 master - 0 1673519199100 29 connected 0-1 5464-10923
b138088411b13d8db3b855ca454d3f7881065d39 192.168.100.63:6379@16379 slave 7061695cb989db63cf08cecf5dcdcee7293a7c4e 0 1673519197089 34 connected
5e99ed9c02ae3daecabbef44d05563f502efc6eb 192.168.124.200:6379@16379 slave 6d68da34389cb97e70e4ee8075108f6394847446 0 1673519195000 29 connected
b2c8cdd82c52c27ecda5c44a121e21ebb5d2406a 192.168.41.246:6379@16379 slave 8035f1c9c0a7ade4a303381e492ff2b85dbe5f96 0 1673519198094 35 connected

We dont set redisClusterSetOptionMaxRetry() explicitly. So default value should be set i guess.

Also, I couldn't find any logs which suggests that we ran cluster nodes command, and i suppose if it was run, it would have self resolved.

One more point to add is, when hiredis calls back into application during tcp disconnect, application doesnt do anything with the redisAsyncContext* pointer that is sent back (I mean no delete of object or anything). I hope that is fine.

@rahulKrishnaM
Copy link
Author

rahulKrishnaM commented Jan 17, 2023

detailed log set from begining:

Log file top entries

begining of connection with redis.

{"severity":"debug","debug-string":"[2023-01-12T07:38:41.584874][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:38:41.584888][connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T07:45:45.904981][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:45:45.905107][connectCallback][114]Connected to 192.168.33.99:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T07:47:30.263102][disconnectCallback][149]Error: Server closed the connection
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:24.185349][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:24.185751][connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:26.186998][CommandCb][1480]Redis command timeout
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:26.189608][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.190408][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:26.190434][connectCallback][114]Connected to 192.168.228.26:6379

{"severity":"debug","debug-string":"[2023-01-12T07:48:28.191117][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:32.200001][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:32.200195][connectCallback][114]Connected to 192.168.228.26:6379
"}
{"severity":"debug","debug-string":"[2023-01-12T07:48:34.185965][CommandCb][1480] Redis command timeout
"}

{"severity":"debug","debug-string":"[2023-01-12T07:48:34.186333][[disconnectCallback][149]Error: Connection reset by peer
"}
{"severity":"debug","debug-string":"[2023-01-12T08:39:34.672284][[connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T08:39:34.672358][[connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T08:39:36.675527][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T08:53:21.596753][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T08:53:21.596809][connectCallback][114]Connected to 192.168.228.26:6379
"}

{"severity":"debug","debug-string":"[2023-01-12T08:53:23.599153][disconnectCallback][149]Error: Timeout
"}
{"severity":"debug","debug-string":"[2023-01-12T09:11:27.937986][connectCallback][98] connectCallback invoked
"}
{"severity":"debug","debug-string":"[2023-01-12T09:11:27.938091][connectCallback][114]Connected to 192.168.228.26:6379
"}

Once connection is established, there are application commands that are triggered. We dont have prints as to when in this sequence they are triggered. There are application commands that are timing out in bwn (CommandCb) Other logs are all connect/disconnect related.

@rahulKrishnaM
Copy link
Author

Hi @bjosv, You have any thoughts on this?
One other thing which i wanted to understand is that while client is connected to redis-server, and there is a slot redistribution, how will redis-cluster get to know of that? I guess tcp connection won't be broken from server side for slot redistribution, right?

@rahulKrishnaM
Copy link
Author

rahulKrishnaM commented Jan 17, 2023

One other thing I saw while going through the code is that max_retry_count is checked for making cluster commands call is done in following methoods.

__redisClusterGetReplyFromNode()
redis_cluster_command_execute()
redisClusterAsyncRetryCallback()

But in this scenario mentioned in the ticket, I don't think we are going through any of these methods. So it looks like max_retry_count never gets compared against which could explain why we are remaining/stuck in the same state.

@bjosv
Copy link
Collaborator

bjosv commented Jan 17, 2023

I only see 2 CommandCb logs, which if they are errors would increases the internal failure counter.
The async legacy code expects 5 send errors (default is 5) before it attempts to renew the slot distribution, so that might be a reason for the logs.
One remedy could be to use redisClusterSetOptionMaxRetry(1) to only allow one retry after a MOVED/ASK/error.

Normally the query from a client would get MOVED response, which will trigger the renewal of hiredis-clusters slots distribution. In this case in the async code 5 send errors are allowed before the renewal. Each time the client tries to send the command it first attempts to connect to the redis node.

@bjosv
Copy link
Collaborator

bjosv commented Jan 17, 2023

This is legacy behavior and a bit strange, and it would be nice to have it changed.
A disconnect is not always an indication of a change in the slot distribution, but if the slot distribution haven't been updated for a while maybe hiredis-cluster should perform an check/update in this case instead.

@rahulKrishnaM
Copy link
Author

rahulKrishnaM commented Jan 17, 2023

I only see 2 CommandCb logs, which if they are errors would increases the internal failure counter. The async legacy code expects 5 send errors (default is 5) before it attempts to renew the slot distribution, so that might be a reason for the logs. One remedy could be to use redisClusterSetOptionMaxRetry(1) to only allow one retry after a MOVED/ASK/error.

Normally the query from a client would get MOVED response, which will trigger the renewal of hiredis-clusters slots distribution. In this case in the async code 5 send errors are allowed before the renewal. Each time the client tries to send the command it first attempts to connect to the redis node.

Ya there are more than 5. I just pasted logs from top of the logfile. If you see my initial post, it goes on for >2hrs. But one thing to note here is that its not actually a response coming from the server which the application is logging, we have configured for command_timeout of 2sec, and since no response is seen for that period, there is a timeout happening which is getting logged by the application with CommandCb tag.

And as i mentioned above, the places where we check for max_retry_count is mostly getting skipped as we never go through one of those flows.

Edit: There is no response from the IP because that IP(192.168.228.26) is no longer valid. So obviously all the commands will see a timeout eventually.

@rahulKrishnaM
Copy link
Author

Hi @bjosv do you agree with the above observation?
Is so, is there a way to introduce max_retry_count check in the flow it's taking now?

@rahulKrishnaM
Copy link
Author

One other thing I saw while going through the code is that max_retry_count is checked for making cluster commands call is done in following methoods.

__redisClusterGetReplyFromNode()
redis_cluster_command_execute()
redisClusterAsyncRetryCallback()

But in this scenario mentioned in the ticket, I don't think we are going through any of these methods. So it looks like max_retry_count never gets compared against which could explain why we are remaining/stuck in the same state.

Hi @bjosv Correcting my statement above:

redisClusterAsyncRetryCallback() is the callback we pass to hiredis, so it should get called back for each command that is attempted.
So, whenever hiredis api redisAsyncHandleTimeout() gets called during command_timeout, hiredis sets the error to REDIS_ERR_TIMEOUT and callsback into hiredis-cluster with NULL reply. As part of this max_retry_count should have got incremented. So, I tend to agree with your statement.

But the issue here is that, if traffic is less and the key/slot never falls into the issue node, redis-cluster could just continue with the bad state for such a long time (as seen here approx 2hrs, 3 callback errors) with the default value of 5.

@rahulKrishnaM
Copy link
Author

One more query in the same context. I was also wondering why a connect() call would succeed in such a scenario?

If the remote ip is not reachable shouldn't the connect() itself fail? The fact that application is able to print connectCallback tagged messages is an indication that the call succeeded with REDIS_OK. Do you have any thoughts on this @bjosv?

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

But the issue here is that, if traffic is less and the key/slot never falls into the issue node, redis-cluster could just continue with the bad state for such a long time (as seen here approx 2hrs, 3 callback errors) with the default value of 5.

Yes, I believe it was done in this way to handle heavy traffic scenarios, giving an imperfect behavior for low traffic scenarios..

I was also wondering why a connect() call would succeed in such a scenario?

Since the async api is used the connect will only register to the eventsystem, which normally wont fail.
When the eventsystem then polls an error from the socket, due to the failing connect, hiredis will/should call the connect callback with the argument status=REDIS_ERR.
Do you get a REDIS_OK in your case?

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

Ah, yes, I see now that you got calls to the disconnect callback, which only should be received when there has been a successful connect. Probably from 192.168.228.26.

Do you have authentication enabled on Redis, using password or username/password?
How about TLS?

@rahulKrishnaM
Copy link
Author

Hi @bjosv

Do you get a REDIS_OK in your case?

Yes, I keep getting REDIS_OK in the connectCallbacks.

When the eventsystem then polls an error from the socket, due to the failing connect, hiredis will/should call the connect callback with the argument status=REDIS_ERR.

Isn't connect() supposed to be a blocking call? I hope you are talking about redisAsyncConnectWithOptions() which then calls redisContextConnectBindTcp() where I could see after connect() call its checking for status of connect().

Ah, yes, I see now that you got calls to the disconnect callback, which only should be received when there has been a successful connect. Probably from 192.168.228.26.

But the IP is no longer present in the cluster. I was thinking there is some issue here due to which its wrongly claiming it's connecting.

Do you have authentication enabled on Redis, using password or username/password?
How about TLS?

There is no password authentication/TLS used for now.

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

Isn't connect() supposed to be a blocking call? I hope you are talking about redisAsyncConnectWithOptions() which then calls redisContextConnectBindTcp() where I could see after connect() call its checking for status of connect().

I would guess the status check is skipped for async.
redisAsyncConnectWithOptions() sets REDIS_OPT_NONBLOCK which results in that REDIS_BLOCK is not set on the hiredis context.
It could be that the connect fail with errno=EINPROGRESS, which is not uncommon.

hiredis net.c

static int _redisContextConnectTcp(redisContext *c, const char *addr, int port,
...
    int blocking = (c->flags & REDIS_BLOCK);
...
        if (connect(s,p->ai_addr,p->ai_addrlen) == -1) {
            if (errno == EHOSTUNREACH) {
                redisNetClose(c);
                continue;
            } else if (errno == EINPROGRESS) {
                if (blocking) {
                    goto wait_for_ready;                <--- skipped
                }

I think this can be indicated and viewed in the hiredis-cluster async tests, I will double-check.

@rahulKrishnaM
Copy link
Author

rahulKrishnaM commented Jan 18, 2023

Hmm. Yes, I guess that definitely looks like a possible gap.

However I was checking more in that flow, I see that once we come out of that else if block, we also set the flag to REDIS_CONNECTED.

c->flags |= REDIS_CONNECTED;

And when i searched for places where hiredis could callback onConnect() api with success only 1 search result came up.
which is inside __redisAsyncHandleConnect():

https://github.com/redis/hiredis/blob/7583ebb1b271e7cb8e44f8211d02bd719d4a4bd1/async.c#L672

redisCheckConnectDone() internally does a connect() on the socket and if return code is 0 then only it should invoke the line 627. --> Will connect() continue to be non-blocking mode here ?

Also, __redisAsyncHandleConnect() is called only if flags don't have REDIS_CONNECTED set.

653     if (!(c->flags & REDIS_CONNECTED)) {
654         /* Abort connect was not successful. */
655         if (__redisAsyncHandleConnect(ac) != REDIS_OK)
656             return;

But if errno == EINPROGRESS block had got hit, the flag would have been already set, and we would never invoke the onConnect() callback.

I'm thinking now is __redisAsyncHandleConnect() at fault here ?

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

Just as a note, there is this in redisAsyncInitialize() called just after redisConnectWithOptions() in hiredis redisAsyncConnectWithOptions():
https://github.com/redis/hiredis/blob/6f5bae8c6900e051da6e677756508707565ce56e/async.c#L124-L127

@rahulKrishnaM
Copy link
Author

Ya, I suppose I missed noting this. :)

@rahulKrishnaM
Copy link
Author

Ok, still trying to understand where would be the REDIS_OK connectedCallback be called from.

I guess when there is a redisAsyncHandleWrite() or a redisAsyncHandleRead() operation, it will invoke __redisAsyncHandleConnect() which inturn calls connect() and connect should returns back 0 ? This is the only place where it could invoke connected callback with REDIS_OK right ?

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

Exactly what I think, __redisAsyncHandleConnect() should return REDIS_OK, and completed=1, that is the only way to set the REDIS_CONNECTED flag.
So, somehow it managed that.
Could it be that the K8s pod still exists with the redis instance, but its not included in the redis cluster?
Update: Ok, you had a printout of kubectl get pods without it.. yes, strange.

@rahulKrishnaM
Copy link
Author

Ya this flapping went on for atleast 2hrs. So very unlikely that could be the case.
Just thinking if there could be any chance connect() can return back a 0 for a non-success case.

@rahulKrishnaM
Copy link
Author

One more thing i was thinking is, since the IP is not existing, wouldn't number of connectCallback prints translate to number of commands attempted towards the node ? I could see its definitely more than 5.

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

There should be a connectCallback (connect attempt) for each CommandCb that timed out towards IP x.26.
Do you see more than 5 timed out CommandCb's?

@rahulKrishnaM
Copy link
Author

Yes, in the logfile i am looking at i could see in this time interval application Timeout error has occurred for 7 times.

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

Just one other thing that I thought of:
You ran the cluster nodes command manually to check the redis cluster, ie the log above with >>> Calling cluster nodes. Did you run this command on each node to check each nodes view of the cluster?

This would just making sure they are all in sync since hiredis-cluster can get its slot distribution from any master node, which might not be the one you send the command to.
Probably unlikely a problem though..but it depends on how your K8s Redis operator removes a node (if it send CLUSTER FORGET to all other nodes).

@bjosv
Copy link
Collaborator

bjosv commented Jan 18, 2023

Which operator are you using btw?
KubeDB, AmadeusITGroup/Redis-Operator/.. or own?

@rahulKrishnaM
Copy link
Author

Hi @bjosv

cluster nodes command is run as below on redis-cluster kubernetes service FQDN.

redis-cli --cluster call redis-cluster.local:6379 cluster nodes

It is not run for all nodes separately. OK, I see you think operator could also be involved in this. We are using IBM redis-operator for this.

@bjosv
Copy link
Collaborator

bjosv commented Nov 14, 2023

The root problem that a timeout was not triggering a slotmap update is corrected by #144.
The reason for hiredis to be able to connect to a cluster-node that shouldn't be available looks like a Redis operator issue.
Please reopen with more info if that problem can be blamed on hiredis-cluster.

@bjosv bjosv closed this as completed Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants