Make users' cursors trigger cursor read at the shards when needed - [MOD-5580] #3853

GuyAv46 · 2023-09-13T07:31:49Z

Describe the changes in the pull request
Previously, any FT.AGREGATE query (WITHCURSOR or without) triggered a train of cursor calls on the shards (after the initial _FT.AGGREGATE WITHCURSOR, each reply triggers a callback that pushes to the queue the next _FT.CURSOR READ) until the cursor on the shard depletes.
On users' cursor requests (FT.AGGREGATE WITHCURSOR), they may not supply any number of results limit. It will cause unlimited commands on the shards to deplete, and for all of the replies to accumulate on the reply channel, waiting for the user's cursor to read from. Since such cursors may not need all the results, may be idle for a long time, and since read commands on redis do not obey any memory limit, this could cause a big memory spick that can stay for a long time and even crash the server.

On this fix, we make cursors' commands "manually" trigger the next read from the shards only when the next command is needed. This way, the number of results accumulated on the coordinator's RPNet channel is always limited.
In the current state, we have in the modified code a mild race between two threads (main and uv), one is reading a counter's value and one is modifying it (--). since I wanted to assume some relation between this counter (inProcess) and other variables in the struct, I decided to use atomic operations with an acquire-release memory fencing. Since there is almost no race, and there are other locks in the code flow that are slowing it down anyway, the use of atomic operations should have a very small performance effect.

Future Work

Currently, the threshold for triggering the next read from the shards is a global configuration that defaults to 1 (before processing the last reply in the channel, we trigger the reads). We can make it a command attribute to control this threshold even more. This could be useful if the cursor needs a lot of results to reply, or if the coordinator process part is short compared to the calculations needed at the shards.
The current logic is to trigger all the shards at once - we first wait for all the shards to reply, and only then, if the channel size is below the threshold, we trigger all the shards again (except the ones that are done). We can modify the logic of triggering the reads, but we have to make sure it's thread-safe, and that we can tell which command is currently running and which is idle.

Additional note:
While adding the new configuration for the reply threshold, we discovered the default values for cluster configurations. Luckily, most of them were meant to be 0 and we set the entire structure to 0s so there was no major effect on anything, besides a problematic timeout value. We decided that we are OK with keeping the actual 0 value as a default (and it might even be the preferable value anyway).
From this PR forward, we use the DEFAULT_CLUSTER_CONFIG and it can be used to define default values in future configurations. More information about this is written in this comment.

Which issues this PR fixes

MOD-5580

Main objects this PR modified

The coordinator's cursor read from the shards on the AGGREGATE flow
Should affect users' cursor flow only
New attributes (in different structures)
1. inProcess at MRIteratorCtx: Number of currently running commands on shards. in contrast/complement of pending, which is the number of shards with more results (not depleted)
2. forCursor at MRCommand - a flag that indicates that the user requested a cursor for the aggregation command
3. depleted at MRCommand - indicates that this command (for this shard) is depleted and no need to re-run _FT.CURSOR READ at this shard

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

codecov · 2023-09-13T16:15:04Z

Codecov Report

Attention: 17 lines in your changes are missing coverage. Please review.

Comparison is base (2ed7b84) 82.77% compared to head (82664b5) 82.82%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3853      +/-   ##
==========================================
+ Coverage   82.77%   82.82%   +0.04%     
==========================================
  Files         192      192              
  Lines       32605    32649      +44     
==========================================
+ Hits        26990    27040      +50     
+ Misses       5615     5609       -6

Files	Coverage Δ
coord/src/rmr/command.c	`74.39% <100.00%> (+0.15%)`	⬆️
coord/src/rmr/command.h	`100.00% <ø> (ø)`
src/aggregate/aggregate_exec.c	`94.92% <100.00%> (-0.01%)`	⬇️
src/resp3.h	`100.00% <100.00%> (ø)`
coord/src/module.c	`75.78% <66.66%> (-0.02%)`	⬇️
coord/src/rmr/rmr.c	`84.96% <93.54%> (+0.74%)`	⬆️
coord/src/dist_aggregate.c	`91.78% <84.00%> (+0.23%)`	⬆️
coord/src/config.c	`28.04% <0.00%> (-3.90%)`	⬇️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coord/src/dist_aggregate.c

coord/src/rmr/rmr.c

coord/src/dist_aggregate.c

coord/src/rmr/rmr.c

MeirShpilraien · 2023-09-19T06:19:53Z

coord/src/rmr/rmr.c

@@ -545,10 +546,22 @@ int MRIteratorCallback_ResendCommand(MRIteratorCallbackCtx *ctx, MRCommand *cmd)
                               ctx);
 }

+// Use after modifying `pending` (or any other variable of the iterator) to make sure it's visible to other threads
+void MRIteratorCallback_ProcessDone(MRIteratorCallbackCtx *ctx) {
+  __atomic_fetch_sub(&ctx->ic->inProcess, 1, __ATOMIC_RELEASE);


Why it needs to be atomic? don't we always decrease it/set it from a single thread?

Currently, a single thread sets (with no race), and then a single one decreases while another one observes the change. the atomicity is used more for memory fencing than for counter-modification.

So why not use volatile?

MeirShpilraien · 2023-09-19T06:21:45Z

tests/pytests/test_cursors.py

@@ -246,3 +246,47 @@ def testExceedCursorCapacity(env):
 def testExceedCursorCapacityBG():
    env = Env(moduleArgs='WORKER_THREADS 1 MT_MODE MT_MODE_FULL')
    testExceedCursorCapacity(env)
+
+def testCursorOnCoordinator(env):


Maybe we can think of how to test cases where the cursor is timeout on one shard but not on the other? This could have happened even before this change but now it is more likely to happened.

I'll add a TODO regarding that, let's wait for when we can:

pass MAXIDLE to the shards (currently using the default of 5 minutes)

Tell if a timeout occurred

I guess we will need RediSearch mock to control the RediSearch reply to the coordinator and verify that the coordinator act as expected. Maybe we should open a task for this?

@DvirDukhan @oshadmi

tests/pytests/test_cursors.py

MeirShpilraien

👍 few comments.
If possible, maybe we can improve the PR top comment to give some more details about the new variables (inProcess, depleted, forCursor) and explain what they mean and when they are updated?

MeirShpilraien · 2023-10-08T11:56:04Z

coord/src/config.h

@@ -33,8 +34,9 @@ extern SearchClusterConfig clusterConfig;
    .numPartitions = 0,                                                                    \
    .connPerShard = 0,                                                                     \
    .type = DetectClusterType(),                                                           \
-    .timeoutMS = 500,                                                                      \
+    .timeoutMS = 0,                                                                        \


Why changing default timeout? What am I missing?

TL;DR: it was never in use, and the actual default value was 0, so this is for keeping the default value that was currently used

This default definition was never used, and we always set all the configurations to 0 by default. I noticed that when I made the new configuration with the default value of 1, but in tests it always got it set to 0.
We need to discuss this configuration. It has a few false default values in the code that are always ignored or overridden, and set to 0. The configuration name for the FT.CONFIG API is TIMEOUT, which is also the name of the default timeout of a query, making it inaccessible for modification (since the first TIMEOUT config that is found is always the query timeout one). It is also marked as available to modify on runtime but its value is copied to a global variable once at modul-init and is never changed again. Lastly, it is used for giving a timeout for the client blocking in the rmr calls. Before making this value 0, some tests started failing on random timeouts, and I'm not sure we want to use RM API for handling timeout like that. Maybe it was an initial idea for how to ensure that queries are finished after some time even if some shards did not reply on time.

OK, lets just mention it on the PR top comment with detailed explanation on why it was broken and why the changes keeps the current broken behavior of timeout 0.

coord/src/dist_aggregate.c

coord/src/module.c

coord/src/rmr/rmr.c

MeirShpilraien

Few last comments about the changes.

github-actions · 2023-10-08T16:40:24Z

Successfully created backport PR for 2.8:

[2.8] Make users' cursors trigger cursor read at the shards when needed - [MOD-5580] #3926

github-actions · 2023-10-08T16:40:28Z

Successfully created backport PR for 2.10:

[2.10] Make users' cursors trigger cursor read at the shards when needed - [MOD-5580] #3927

github-actions · 2023-10-08T16:40:29Z

Backport failed for 2.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin 2.6
git worktree add -d .worktree/backport-3853-to-2.6 origin/2.6
cd .worktree/backport-3853-to-2.6
git checkout -b backport-3853-to-2.6
ancref=$(git merge-base 2ed7b842df3041969d53e066cface69d33a61144 82664b5506d72296f3a23668f660c6424949b129)
git cherry-pick -x $ancref..82664b5506d72296f3a23668f660c6424949b129

…MOD-5580] (#3853) * make users' cursors trigger cursor read at the shards when needed * fix waiting indefinitely * improve manually triggering * added a threshold for channel size in `MR_ManuallyTriggerNextIfNeeded` * fix potential read after free * fix triggering logic * change misunderstood threshold * improved readability and made some explanations * fix a comment * another comment fix * added a test * improved test * added some comments * review fixes and improvements * improved test * fix cluster default configuration * improved test * small fix * revert effect of using configuration * explicitly use 0 instead of `timeout_g` (which is always 0)

GuyAv46 added 6 commits September 12, 2023 19:41

make users' cursors trigger cursor read at the shards when needed

4c6a716

fix waiting indefinitely

8fdb1bb

improve manually triggering

47d8150

added a threshold for channel size in MR_ManuallyTriggerNextIfNeeded

becb748

fix potential read after free

4db0a2d

fix triggering logic

c8157d6

GuyAv46 added 5 commits September 13, 2023 19:18

change misunderstood threshold

a1514fc

improved readability and made some explanations

1fbab7a

fix a comment

2b0b86e

another comment fix

39406b7

added a test

9453983

GuyAv46 marked this pull request as ready for review September 14, 2023 09:23

GuyAv46 added backport 2.8 backport 2.10 labels Sep 14, 2023

GuyAv46 requested review from DvirDukhan and MeirShpilraien September 14, 2023 09:26

improved test

f3c7146

GuyAv46 changed the title ~~make users' cursors trigger cursor read at the shards when needed~~ make users' cursors trigger cursor read at the shards when needed - [MOD-5580] Sep 14, 2023

GuyAv46 changed the title ~~make users' cursors trigger cursor read at the shards when needed - [MOD-5580]~~ Make users' cursors trigger cursor read at the shards when needed - [MOD-5580] Sep 14, 2023

GuyAv46 commented Sep 18, 2023

View reviewed changes

coord/src/dist_aggregate.c Outdated Show resolved Hide resolved

GuyAv46 commented Sep 18, 2023

View reviewed changes

coord/src/rmr/rmr.c Show resolved Hide resolved

added some comments

29a471c

MeirShpilraien reviewed Sep 19, 2023

View reviewed changes

coord/src/dist_aggregate.c Outdated Show resolved Hide resolved

MeirShpilraien reviewed Sep 19, 2023

View reviewed changes

coord/src/dist_aggregate.c Show resolved Hide resolved

MeirShpilraien reviewed Sep 19, 2023

View reviewed changes

coord/src/rmr/rmr.c Outdated Show resolved Hide resolved

MeirShpilraien reviewed Sep 19, 2023

View reviewed changes

tests/pytests/test_cursors.py Show resolved Hide resolved

MeirShpilraien reviewed Sep 19, 2023

View reviewed changes

GuyAv46 added 2 commits September 20, 2023 14:13

improved test

70492a4

fix cluster default configuration

5b3dd73

GuyAv46 requested a review from MeirShpilraien September 20, 2023 11:15

GuyAv46 added 4 commits September 20, 2023 14:52

improved test

db68127

small fix

37dd1bb

revert effect of using configuration

cd7a396

explicitly use 0 instead of timeout_g (which is always 0)

869eed7

GuyAv46 added the backport 2.6 label Sep 21, 2023