Use the async api when a connection error triggers a slot update #144

bjosv · 2023-02-22T09:08:54Z

When a command response indicates a communication error the slot map is updated.
This is now updated using the async api to avoid blocking calls to connect when querying a cluster node.

This fixes the problem with hanging connects that blocks the event handling system.

Changed behaviors
Previously communication errors were counted. When the client had received more errors that the configured max_retry_count value the Redis configuration "cluster-node-time" was fetched from a cluster node. This configured value was then used to determine when to perform a slotmap update. When an additional error was received after the time-to-wait the slotmap update procedure started. This procedure used blocking calls on a new TCP connection.

After this PR a communication error triggers the slotmap update procedure directly.
Primarily a connected node is selected that is found close to a randomly picked index of all known nodes.
The random index should give a more even distribution of selected nodes.
If no connected node is found while iterating to this index the remaining nodes are also checked until a connected node is found. If no connected node is found; a node close o the picked index, for which a connection establishment has not been attempted within throttle-time, is selected.
The commands are sent using the async api to avoid blocking sends (or connects). During the time the slotmap update procedure runs and until a second after it is finish other sent commands that triggers communication errors/timeouts will not start additional slotmap updates, ie the slotmap update is throttled.

Other:
Using async-api during MOVED should be implemented as well, but done in other PR.

Fixes #142

zuiderkwast

Very good! A few comments and questions.

zuiderkwast · 2023-03-01T12:19:16Z

hircluster.c

+    acc->update_route_time = hi_usec_now() + SLOTMAP_UPDATE_THROTTLE_USEC;
+
+    if (r == NULL) {
+        /* Ignore failures, next random node is hopefully working */
+        return;
+    }


Should we set update_route_time even if r == NULL?

Shouldn't we trigger another slot update with a different node here?

This was a bit simplified as a start.
Hiredis calls the callback-function with a NULL reply when either the command has timedout, or when there is connection errors.

The scenario I thought of:
If a slot map update don't find any connected node it will take a random node and start an async connect, it will also send/give the CLUSTER SLOTS command to hiredis for it to send it when the connection is up.
If the connect fails this callback is called with a NULL reply, due to the outstanding CLUSTER SLOTS command.
If all redis nodes are gone (like in the issue we have) I was afraid slotmap updates would go into a busy-loop.

One option could be to only trigger a new slotupdate to one of connected nodes, if that exists.
This would speedup updates for timeout cases, or temporary node issues.
..or some bookkeeping to avoid busy-looping cases..

I think we want a loop until we have a slot map, but wait one second between each attempt. (That's what we do in ered.)

Waiting a second is the problem here, hiredis owns the eventloop and its timers.
We need to add a interface and functionality to each event-adapter to be able to kick-off timers in each eventsystem (or have hiredis expose something).

The modularity and being able to choose an own eventsystem its a bit annoying :-)
We have users that has implemented an own eventsystem as well, so updating existing adapters might not be enough for everybody.

The PR is now updated to retry until all nodes has been attempted (by using a lastConnectionAttempt and throttling).
Since there are no timerhandling available yet to hiredis-cluster the first slotmap update attempt is triggered by the user via a command, the retries are triggered by the NULL-reply callback.

hircluster.c

Breakout of a new function that also will be reused for the async api.

The attempt to update the slot map will be immediate.

Make sure we throttle the slot map update to 1/sec after communications errors. Update testcase timeouts to include slotmap updates.

Make sure that a connection to the first node exists so that CLUSTER SLOTS is sent to that node instead of a random node.

Let hiredis cleanup after a disconnect like it does after a failure, i.e call unlinkAsyncContextAndNode(). This fixes a thread-sanitizer issue.

Add a timestamp to each redisClusterNode to indicate when the last connection attempt was performed to this node.

When a command response indicates a communication error the slot map is updated. This is now updated using the async api, which avoids blocking calls to connect when querying the cluster node. Primarily select a connected node by picking one of four first found connected nodes. If there are no connected nodes then pick a node where a connect has not been attempted within throttle time (1 sec). cc->update_route_time is used to throttle the slotmap update and to make sure we only have one ongoing query at a time (SLOTMAP_UPDATE_ONGOING).

Add missing srandom() and random()

zuiderkwast

Much nicer now, to use an existing connection.

Is there any way to see how many requests are pending on a connection? Just thinking about the case the CLUSTER SLOTS will be added to a very long pipeline and therefore be delayed... Maybe not a valid concern?

hircluster.c

hircluster.h

bjosv · 2023-03-13T12:07:05Z

Ouch, the randomness makes the simulated-redis tests a bit harder to handle..

zuiderkwast · 2023-03-13T12:53:18Z

Ouch, the randomness makes the simulated-redis tests a bit harder to handle..

Right :-) We can inject some fake-randomness just for testing, e.g. using an ifdef or a known fixed random seed.

This reverts commit 908e8fd.

This reverts commit 32afdf0.

zuiderkwast

Pretty clean algorithm now. I like it. However, I found a corner case that might need to be addressed.

hircluster.c

bjosv force-pushed the async-connection-errors branch 3 times, most recently from 1ddb360 to 96b9389 Compare February 28, 2023 22:12

bjosv marked this pull request as ready for review March 1, 2023 08:23

bjosv requested a review from zuiderkwast March 1, 2023 08:35

zuiderkwast reviewed Mar 1, 2023

View reviewed changes

bjosv added 8 commits March 7, 2023 00:24

Keep specific error from failing cluster_update_route() calls

3a8e3c6

Refactor code for updating known cluster nodes

9552d86

Breakout of a new function that also will be reused for the async api.

Remove use of Redis config cluster-node-timeout

9a9d850

The attempt to update the slot map will be immediate.

Update slot map immediately for communication problems

5e36f24

Make sure we throttle the slot map update to 1/sec after communications errors. Update testcase timeouts to include slotmap updates.

Update timeout-handling-test

2d9cb0c

Make sure that a connection to the first node exists so that CLUSTER SLOTS is sent to that node instead of a random node.

Fix redisClusterAsyncDisconnect()

2013eec

Let hiredis cleanup after a disconnect like it does after a failure, i.e call unlinkAsyncContextAndNode(). This fixes a thread-sanitizer issue.

Add lastConnectionAttempt to redisClusterNode

ae5501a

Add a timestamp to each redisClusterNode to indicate when the last connection attempt was performed to this node.

bjosv force-pushed the async-connection-errors branch from daa9bde to c83f290 Compare March 7, 2023 00:03

Build on windows

c2642ab

Add missing srandom() and random()

zuiderkwast reviewed Mar 10, 2023

View reviewed changes

hircluster.c Outdated Show resolved Hide resolved

hircluster.c Outdated Show resolved Hide resolved

hircluster.h Outdated Show resolved Hide resolved

bjosv added 3 commits March 13, 2023 09:57

Fixup: Use lastSlotmapUpdateAttempt

09bf236

Add dictGetRandomKey() from hiredis

908e8fd

Fixup: Perform 3 attempts to randomly pick a connected node

32afdf0

bjosv added 4 commits March 13, 2023 14:10

fixup: remove randomizer seed from library

a5d928f

Revert "Add dictGetRandomKey() from hiredis"

9385482

This reverts commit 908e8fd.

Revert "Fixup: Perform 3 attempts to randomly pick a connected node"

dc5d570

This reverts commit 32afdf0.

Fixup: Select node close to a random index

696ebc3

zuiderkwast reviewed Mar 13, 2023

View reviewed changes

hircluster.c Outdated Show resolved Hide resolved

hircluster.c Outdated Show resolved Hide resolved

bjosv added 3 commits March 13, 2023 17:04

Add info about need of srandom() to README

29f6c59

fixup: handle low checkIndex cornercase

5f6a783

Fixup: Select a connected node primarily

76ada61

zuiderkwast approved these changes Mar 14, 2023

View reviewed changes

fixup: remove unused include

82e8708

bjosv merged commit ad98e88 into Nordix:master Mar 14, 2023

bjosv deleted the async-connection-errors branch March 14, 2023 10:26

bjosv mentioned this pull request Apr 27, 2023

async api #19

Closed

bjosv mentioned this pull request Nov 14, 2023

redis-cluster attempting to connect to a stale redis server IP and not recovering #133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the async api when a connection error triggers a slot update #144

Use the async api when a connection error triggers a slot update #144

bjosv commented Feb 22, 2023 •

edited

Loading

zuiderkwast left a comment

zuiderkwast Mar 1, 2023

bjosv Mar 1, 2023

zuiderkwast Mar 1, 2023

bjosv Mar 1, 2023

bjosv Mar 7, 2023

zuiderkwast left a comment •

edited

Loading

bjosv commented Mar 13, 2023

zuiderkwast commented Mar 13, 2023

zuiderkwast left a comment

Use the async api when a connection error triggers a slot update #144

Use the async api when a connection error triggers a slot update #144

Conversation

bjosv commented Feb 22, 2023 • edited Loading

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast Mar 1, 2023

Choose a reason for hiding this comment

bjosv Mar 1, 2023

Choose a reason for hiding this comment

zuiderkwast Mar 1, 2023

Choose a reason for hiding this comment

bjosv Mar 1, 2023

Choose a reason for hiding this comment

bjosv Mar 7, 2023

Choose a reason for hiding this comment

zuiderkwast left a comment • edited Loading

Choose a reason for hiding this comment

bjosv commented Mar 13, 2023

zuiderkwast commented Mar 13, 2023

zuiderkwast left a comment

Choose a reason for hiding this comment

bjosv commented Feb 22, 2023 •

edited

Loading

zuiderkwast left a comment •

edited

Loading