Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time taken to return to application from hiredis library seems to be too long (>60sec) #142

Closed
rahulKrishnaM opened this issue Feb 16, 2023 · 10 comments · Fixed by #144
Closed

Comments

@rahulKrishnaM
Copy link

Hi,

Encountered a case wherein the processing seems to be stuck at hiredis-cluster library and not returning to application for more than 60seconds. There is a mechanism we have in place to check if a thread is taking longer to report a heartbeat, which is getting missed in this case, and results in an exception being thrown.

This is the backtrace from the flow:

<hearbeat missing for long time resulting in a signal/crash in above frames>
#7  0x00007eff1914d90f in poll () from /lib64/libc.so.6
#8  0x00007eff1ab5a2ba in redisContextWaitReady (c=c@entry=0x26bac60, msec=msec@entry=2000) at net.c:250
#9  0x00007eff1ab5a991 in _redisContextConnectTcp (c=c@entry=0x26bac60, addr=<optimized out>, port=<optimized out>, timeout=<optimized out>,  source_addr=<optimized out>) at net.c:504
#10 0x00007eff1ab5aaa5 in redisContextConnectBindTcp (c=c@entry=0x26bac60, addr=<optimized out>, port=<optimized out>, timeout=<optimized out>, source_addr=<optimized out>) at net.c:544
#11 0x00007eff1ab592d7 in redisConnectWithOptions (options=options@entry=0x7ffc201f96e0) at hiredis.c:818
#12 0x00007eff1ab406b4 in ctx_get_by_node (cc=cc@entry=0x2859000, node=node@entry=0x2b085f0) at hircluster.c:1993
#13 0x00007eff1ab409b1 in node_get_which_connected (cc=cc@entry=0x2859000) at hircluster.c:2052
#14 0x00007eff1ab4990a in cluster_config_get (config_name=0x7eff1ab4a7ad "cluster-node-timeout", config_value_len=<synthetic pointer>, cc=0x2859000) at hircluster.c:2072
#15 redisClusterAsyncRetryCallback (ac=0x2821dc0, r=0x0, privdata=0x2b1e270) at hircluster.c:3926
#16 0x00007eff1ab55e09 in __redisRunCallback (reply=0x0, cb=0x7ffc201f9820, ac=0x2821dc0) at async.c:288
#17 __redisAsyncFree (ac=0x2821dc0) at async.c:310
#18 0x00007eff1ab566b9 in __redisAsyncDisconnect (ac=<optimized out>) at async.c:402
#19 0x00007eff1ab56765 in __redisAsyncHandleConnectFailure (ac=0x2821dc0) at async.c:604
#20 __redisAsyncHandleConnect (ac=ac@entry=0x2821dc0) at async.c:617
#21 0x00007eff1ab56e5e in redisAsyncHandleRead (ac=0x2821dc0) at hiredis/async.c:655
#22 0x00007eff1960f243 in AdapterEvents::redisLibmyeventHandler (this=0x26bb8c0, fd=20, events=29, info=...) at hiredis_libmyevent.h:89
<application trying to do a socket read in below frames>

From what I could gather from the backtrace and frames, this is the flow of events:

  1. application attempts a read event as part of epoll event seen on the socket.
  2. As part of redisAsyncHandleRead() operation, hiredis tries to connect() to a specific redis server node as part of __redisAsyncHandleConnect().
  3. connect() fails with error: "Connection timeout", which triggers a __redisAsyncFree() operation which tries to clean up all outstanding commands (for which response is not received) to that particular redis server node.
  4. There seems to be around 128 outstanding commands in the obuf of this particular redis server node at this point.
  5. hiredis runs a while loop over all these outstanding commands and makes callbacks with empty reply to the hiredis-cluster/application.
  6. hiredis-cluster processing these callbacks as part of redisClusterAsyncRetryCallback(), keeps track of failure responses, and it sees 5 failing responses, and decides to relearn the topology after a specified timeout("cluster-node-timeout") configured at the server.
  7. For fetching "cluster-node-timeout" value from the redis server, hiredis-cluster frames a new command and tries to connect to one of the previously discovered nodes (added below decoded from the frames).
    192.168.65.161, 192.168.121.79, 192.168.235.38, 192.168.221.225, 192.168.2.202, 192.168.165.198, 192.168.121.12, 192.168.165.182
  8. hiredis-cluster follows the above order in attempting connection to the redis server ips, and in doing so it (redisConnectWithOptions()) performs a blocking connect() call to each of the server ips.
  9. The time where the hearbeattimeout occurred, hiredis-cluster was attempting to connect to the 6th ip (192.168.165.198). So, the previous 5 attempts had failed in connecting and could account for a long time since the calls are all blocking.

I assume the analysis above holds good here. If this is the case, is there a way or suggestion to improve this? Do you see any other way of doing this that it doesn't get stuck at hiredis-cluster in processing.

@bjosv

@bjosv
Copy link
Collaborator

bjosv commented Feb 16, 2023

This relates with #19 and the conditions haven't changed since.
I guess you have a pretty low connect timeout already?

@rahulKrishnaM
Copy link
Author

Ya connect-timeout is set to 2seconds currently.

@rahulKrishnaM
Copy link
Author

So, a blocking connect() call will effectively see a max wait time of connect_timeout value, right?
And, if a poll() follows a connect(), i guess there too we will see a max of connect_timeout wait time. So, per redis server-ip there could be a 2*connect_timeout wait time in an error case.

@bjosv
Copy link
Collaborator

bjosv commented Feb 16, 2023

I'm not following you regarding the following poll()? The poll() in hiredis is the one that handles timeout I though.

I guess the amount of outstanding callbacks add-on time, depending on max-retry-count.
If max-retry-count is higher then fewer callbacks triggers a config-get, so that might be a remedy..

@rahulKrishnaM
Copy link
Author

rahulKrishnaM commented Feb 16, 2023

I'm not following you regarding the following poll()? The poll() in hiredis is the one that handles timeout I though.

what I meant was this piece of code, which follows after a failed connect() call.
https://github.com/redis/hiredis/blob/master/net.c#L266

@rahulKrishnaM
Copy link
Author

I'm not following you regarding the following poll()? The poll() in hiredis is the one that handles timeout I though.

what I meant was this piece of code, which follows after a failed connect() call. https://github.com/redis/hiredis/blob/master/net.c#L266

Hope I made this clear @bjosv

@bjosv
Copy link
Collaborator

bjosv commented Feb 19, 2023

Update:
I'm currently trying out a change for using async reconnection attempts.
Hopefully this will fix this problem, but I need to try it out a bit more.

@rahulKrishnaM
Copy link
Author

Thanks for the update @bjosv . Making the reconnection attempts async would really help in improving this wait period. Shall wait for the changes to retry the scenario.

@bjosv
Copy link
Collaborator

bjosv commented Mar 14, 2023

A change covering this issue is now delivered, hope it fixes the problems in your setup as well.

@rahulKrishnaM
Copy link
Author

Nice! Thanks, bjosv, shall try pulling this and see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants