Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with commands not timing out when the server is unresponsive and rate of outgoing traffic is high #1225

Open
rahulKrishnaM opened this issue Sep 10, 2023 · 5 comments
Assignees

Comments

@rahulKrishnaM
Copy link

rahulKrishnaM commented Sep 10, 2023

Hi,

We are using hiredis-cluster library to interact with a redis cluster. Recently, we came across an issue where hiredis library was taking a very long time (more than 4mins) to detect that the redis server was down, when there was an abrupt disconnection of tcp connection (no FIN packet sent from server), and continued to send packets to the server IP that went down. Looking more into the hiredis code we were thinking if there could be an issue with the command timeout refresh logic.

  1. When the socket is writeable, the timeout gets refreshed here.

    hiredis/async.c

    Line 750 in 869f3d0

    _EL_ADD_WRITE(ac);

  2. When application tries to queue data to hiredis, the timeout gets refreshed here.

    hiredis/async.c

    Line 961 in 869f3d0

    _EL_ADD_WRITE(ac);

Let's say the server is nonresponsive. The socket might not become writeable any time soon, so code path 1 will never get hit. However, application continues to push data to hiredis out buffer at the same rate since it is not aware of the server state. As part of this code flow 2 gets invoked, and refresh timeout gets updated. But this continues to for a long time, and the timeout keeps getting extended for a long period of time. I feel this is causing the issue here. Also, just to highlight, if the application is not pumping data at a regular/high rate, we might see the timeout is happening as expected, as code path 2 won't get hit within the command_timeout period, and hiredis won't refresh the timeout.

The question I have is why should the timeout be refreshed as part of point 2 ? When a socket read/write event happens refreshing the timeout makes sense. However, when application is pushing commands to hiredis, isn't it enough to just add a write event without refreshing the timeout. This would inturn result in a non-responsive server to timeout early and give back a "timeout" error response, which would result in client detecting the faulty server early and sending disconnect to it.

Please share your thoughts.

@michael-grunder michael-grunder self-assigned this Sep 11, 2023
@rahulKrishnaM
Copy link
Author

Hi @michael-grunder Do you see there could be an issue here? We are running some resiliency cases, and getting inputs on this would help us mitigate the issue and proceed further.

@michael-grunder
Copy link
Collaborator

I'm going to try and dig into this over the weekend, but yes I can see how extending the timeout how we are could cause the issues you describe. I'll put together a small program that replicates this scenerio to try and figure out what could be done.

@rahulKrishnaM
Copy link
Author

hi @michael-grunder Were you able to proceed with the reproduction/fix for the same ?

@rahulKrishnaM
Copy link
Author

Hi @michael-grunder I was thinking of removing the refresh timeout done as part of point 2. We can register for a write event, but without doing refresh timeout. Do you think that would work?

@rahulKrishnaM
Copy link
Author

hi @michael-grunder This is seen quite often in production when there are network glitches. Is it possible for you to please give some feedback on the queries.

We have been following up for almost 3months on this now with no response, expecting a better way of feedback/response model from such a reputed community.

Reaching out for help to other owners of the repo as well.
cc: @antirez @pietern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants