MOD-15749 Add support for short form of CLUSTERSET by galcohen-redislabs · Pull Request #92 · RedisGears/LibMR

galcohen-redislabs · 2026-05-19T12:07:27Z

This initial commit splits the command into two cases: long and short (which currently does nothing).
It also refactors the long form to help writing a clean version for the upcoming short form.

Note

High Risk
Changes cluster topology discovery and inter-shard CLUSTERSET propagation; short form depends on new RedisModule cluster slot APIs and misconfiguration could break cross-shard messaging.

Overview
Adds a compact CLUSTERSET form (CLUSTERSET or CLUSTERSET AUTH {pwd}) alongside the existing verbose topology command, and refactors how cluster state is built from incoming arguments.

Short form no longer requires the full HASHFUNC / RANGES / SHARD payload: the module discovers masters, addresses, flags, and slot ranges from the Redis server via GetClusterNodesList, GetClusterNodeInfo, and the newly wired GetClusterNodeSlotRanges API (declared in redismodule.h). Optional AUTH applies one password to all shard connections.

Long form behavior is preserved but split into helpers (InitClusterData, ParseShardEntry, SetClusterDataLongForm). Shared init copies argv for later CLUSTERSETFROMSHARD propagation; only long form still substitutes the per-target MYID token when pushing topology on reconnect/hello retries.

Command entry points accept either form via IsLongFormClusterSet / IsShortFormClusterSet instead of a fixed minimum argc of 10.

^{Reviewed by Cursor Bugbot for commit 4de0d20. Bugbot is set up for automated code reviews on this repo. Configure here.}

This initial commit splits the command into two cases: long and short (which currently does nothing). It also refactors the long form to help writing a clean version for the upcoming short form.

gabsow

Solid refactor overall — splitting the long-form into ParseShardEntry reads well, and the short-form path matches MOD-15749's intent. A few things to address, the first one blocking:

Severity	Issue
🔴 Blocking	`slots[]` routing table missing the first range of every node — see inline comment on line 1147
🟠 Important	Slotless fallback claims all 16384 slots
🟠 Important	AUTH password sent even when internal secret is in use (inconsistent with `MR_RefreshClusterData`)
🟡 Cleanup	`InitClusterData` ignores its parameter; `isMe` double-set; log typo; redundant assertion; `uint16_t` loop bound; argc-based form detection is brittle
🟢 Test gap	Only the happy-path 3-shard / no-AUTH case is covered; no slot-routed assertion (the test would pass even with the first-range bug)

gabsow · 2026-05-24T10:48:25Z

+    return argc >= 10;
+}
+
+static bool IsShortFormClusterSet(int argc) {


Argc-based form detection is a bit brittle: short form is argc == 1 || argc == 3, long form is argc >= 10. If the long form ever evolved to use exactly 3 args (or short form to a config with 10+), this would silently dispatch wrong. A leading-token sniff (AUTH / HASHFUNC / nothing) would be more robust. Not blocking, but worth a comment at least.

They will not change that much. Actually we plan to get rid of both. So argc-based logic is good enough.

gabsow · 2026-05-24T10:48:25Z

-        RedisModule_Log(mr_staticCtx, "notice", "Sending cluster topology to %s (%s:%d) after reconnect", n->id, n->ip, n->port);
-        clusterCtx.CurrCluster->clusterSetCommand[CLUSTER_SET_MY_ID_INDEX] = MR_STRDUP(n->id);
+        bool isLongForm = IsLongFormClusterSet(clusterCtx.CurrCluster->clusterSetCommandSize);
+        RedisModule_Log(mr_staticCtx, "notice", "Sending cluster (%s form) topology to %s (%s:%d) on after reconnect",


Typo: "on after reconnect" → "after reconnect" (extra on).

gabsow · 2026-05-24T10:48:25Z

+    cluster->myId[REDISMODULE_NODE_ID_LEN] = '\0';
+}
+
+static void InitClusterData(Cluster* cluster, RedisModuleString** argv, int argc){


InitClusterData takes a Cluster* cluster parameter but ignores it — every line mutates clusterCtx.CurrCluster directly. Either use the parameter throughout (so the helper is testable in isolation and the global mutation is explicit at the call site) or drop the parameter and rename to make it obvious it operates on the global. SetMyId and GenerateRunId get this right.

gabsow · 2026-05-24T10:48:25Z

+
+        if (!(flags & REDISMODULE_NODE_MASTER)) continue;  // Skip replica nodes
+
+        RedisModule_Assert(MR_GetNode(nodeId) == NULL);


Redundant: MR_CreateNode already asserts !MR_GetNode(id) (see line 831). Either drop this pre-check or drop the one inside MR_CreateNode.

gabsow · 2026-05-24T10:48:25Z

+        RedisModuleSlotRangeArray *slots = RedisModule_GetClusterNodeSlotRanges(mr_staticCtx, nodeId);
+        RedisModule_Assert(slots != NULL);
+        uint16_t minSlot = 0;
+        uint16_t maxSlot = NUMBER_OF_SLOTS - 1;


🟠 Slotless-node fallback is incorrect. When slots->num_ranges == 0 (per redis/redis#14953, this happens for unknown nodes / non-cluster mode and plausibly for slotless masters), we fall back to minSlot=0, maxSlot=NUMBER_OF_SLOTS-1 — i.e. we declare this node owns all 16384 slots.

The long form represents slotless with minSlot=0, maxSlot=-1 so that MR_CreateNode adds no range (see if (minSlot <= maxSlot) at line 835). The short form should match — pass 0, -1 (use long long like the long form) and skip the clusterCtx.minSlot/maxSlot overwrite when slotless.

gabsow · 2026-05-24T10:48:25Z

+            maxSlot = slots->ranges[0].end;
+        }
+
+        Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, minSlot, maxSlot);


🔴 Blocking bug: clusterCtx.CurrCluster->slots[] routing table is never populated for the first range.

MR_CreateNode only stores the first range in the node's own slotRanges list (line 836) — it does not write into the global slots[] table. The loop below at line 1155 then iterates j = 1 onward, so the first range of every node is missing from slots[].

Concrete case: a normal 3-shard cluster where each shard owns one contiguous range → after short-form CLUSTERSET, slots[] is entirely NULL. The dispatch at line 283 (slots[sendMsg->slotToSend]) will then NULL-deref or silently misroute every slot-targeted message.

Compare with SetClusterDataLongForm (line 1220-1222) and MR_RefreshClusterData (line 948-950) — both unconditionally fill slots[k] for every range.

Fix: either run the slot-fill loop before the j=1.. extra-range loop, or unify both into a single for (j=0; j<num_ranges; j++) block:

Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, 0, -1); aMasterNode->isMe = (flags & REDISMODULE_NODE_MYSELF) != 0; for (size_t j = 0; j < slots->num_ranges; j++) { uint16_t minSlot = slots->ranges[j].start; uint16_t maxSlot = slots->ranges[j].end; mr_listAddNodeTail(aMasterNode->slotRanges, NewSlotRange(minSlot, maxSlot)); for (int k = minSlot; k <= maxSlot; k++) clusterCtx.CurrCluster->slots[k] = aMasterNode; if (aMasterNode->isMe && j == 0) { clusterCtx.minSlot = minSlot; clusterCtx.maxSlot = maxSlot; } }

Note: the test test_short_form_clusterset did NOT catch this because TS.QUERYINDEX fans out via the node dict, not via slots[]. The dict is populated correctly, so the broadcast works. See my comment on the RTS PR for a suggested slot-routed test.

good catch. fixed.
(we don't use the slots[] anyway in ts because it never routes via a hash. the cluster-wide commands always send to all nodes)

gabsow · 2026-05-24T10:48:25Z

+        }
+
+        Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, minSlot, maxSlot);
+        aMasterNode->isMe = (flags & REDISMODULE_NODE_MYSELF) != 0;


isMe is set twice — MR_CreateNode sets it via strcmp(id, myId) at line 862, then this line overrides via the MYSELF flag. They should always agree, but having two sources of truth is confusing. The MYSELF flag is the more direct signal — consider making MR_CreateNode not set isMe at all when called from this path, or unify.

This is true, and the flag is the more "correct" way. But note that MR_CreateNode is also called from other' flag-less' environment, so we need the strcmp logic as well. I'll add a comment.

gabsow · 2026-05-24T10:48:25Z

+            maxSlot = slots->ranges[0].end;
+        }
+
+        Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, minSlot, maxSlot);


🟠 The password obtained from AUTH <pwd> is passed unconditionally to MR_CreateNode. Compare with MR_RefreshClusterData (line 940):

n = MR_CreateNode(..., RedisModule_GetInternalSecret ? NULL : clusterCtx.password, NULL, ...);

i.e. when an internal secret is available, password auth is intentionally skipped. If a deployment uses internal secrets, the short form will still attempt password auth — at minimum inconsistent with the existing pattern, possibly wrong. Either mirror the same guard or leave a comment explaining the difference.

Note that this is not the same case. Here we fetch the password from the parsing the CLUSTERSET command (similar to what's being done in the parsing of ADDR in the long form). The CLUSTERSET from the DMC will always have a password (which is the bdb's internal_pass attribute).

gabsow · 2026-05-24T10:48:25Z

+            maxSlot = slots->ranges[j].end;
+            mr_listAddNodeTail(aMasterNode->slotRanges, NewSlotRange(minSlot, maxSlot));
+            for (uint16_t k = minSlot ; k <= maxSlot ; k++)
+                clusterCtx.CurrCluster->slots[k] = aMasterNode;


uint16_t k is safe for canonical maxSlot ≤ 16383, but would spin forever if a buggy API ever returns maxSlot == UINT16_MAX (e.g. casting -1). The long-form loop at line 1220 uses int k. Match that here.

gabsow

LGTM ✅

All 9 inline comments addressed:

🔴 Blocking slots[] first-range bug → fixed in 42ce87e
🟠 Slotless fallback → fixed in 458f2ca (now uses long long 0, -1 sentinel)
🟠 AUTH vs internal-secret → accepting your reasoning (symmetric with long-form ADDR parsing; DMC always provides the password)
🟡 InitClusterData parameter, redundant assert, log typo, uint16_t→int loop, isMe rationale comment → all addressed
🟡 Argc-based form detection → accepting (you plan to retire both forms)

Flagging for downstream visibility: the RTS PR (#2034) still pins to ad6a45d3 and CI is fully red because the blocking slots[] bug surfaces in tests there. Bumping the submodule to this PR's HEAD (or to the merge commit) should unblock it.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4de0d20. Configure here.}

cursor · 2026-05-26T07:29:36Z

+                clusterCtx.CurrCluster->slots[k] = aMasterNode;
+        }
+
+        RedisModule_ClusterFreeSlotRanges(mr_staticCtx, slots);


Missing thread-safe context lock for Redis API calls

High Severity

SetClusterDataShortForm calls RedisModule_GetClusterNodesList, RedisModule_GetClusterNodeInfo, RedisModule_GetClusterNodeSlotRanges, and their corresponding free functions using mr_staticCtx without acquiring RedisModule_ThreadSafeContextLock. This function runs on the event loop's dedicated pthread, not the Redis main thread. The existing MR_RefreshClusterData — which runs in the same threading context — correctly wraps its RedisModule_GetClusterNodeInfo and RedisModule_Call invocations with lock/unlock pairs. The omission here creates a data race on shared server state.

^{Reviewed by Cursor Bugbot for commit 4de0d20. Configure here.}

Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25).

ASan caught a heap-buffer-overflow during test_short_form_clusterset in the sanitizer build: the short-form CLUSTERSET path was passing node IDs from RedisModule_GetClusterNodesList -- which redis core allocates as exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight into MR_CreateNode, which treats the id as a C string (MR_GetNode dict lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the 40-byte allocation. This bug was introduced by the short-form path in #92. #94 didn't touch this code; it's just that the strengthened test_short_form_clusterset on the TS side surfaced the existing bug under the sanitizer build. Minimal fix at the call site (SetClusterDataShortForm only): - Copy each id from nodeList[i] into a null-terminated stack buffer of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body. - Pass that buffer through the rest of the iteration -- to RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the %s in the failure-warning Log is now safe too), to RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode. MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged. The change is scoped strictly to the one caller that violated the "id must be null-terminated" contract. Stack from ASan (read of size 41 at a 40-byte allocation): strlen -> zstrdup_usable -> zstrdup -> RM_Strdup -> MR_CreateNode (cluster.c:844) -> SetClusterDataShortForm (cluster.c:1153)

…95) ASan caught a heap-buffer-overflow during test_short_form_clusterset in the sanitizer build: the short-form CLUSTERSET path was passing node IDs from RedisModule_GetClusterNodesList -- which redis core allocates as exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight into MR_CreateNode, which treats the id as a C string (MR_GetNode dict lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the 40-byte allocation. This bug was introduced by the short-form path in #92. #94 didn't touch this code; it's just that the strengthened test_short_form_clusterset on the TS side surfaced the existing bug under the sanitizer build. Minimal fix at the call site (SetClusterDataShortForm only): - Copy each id from nodeList[i] into a null-terminated stack buffer of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body. - Pass that buffer through the rest of the iteration -- to RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the %s in the failure-warning Log is now safe too), to RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode. MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged. The change is scoped strictly to the one caller that violated the "id must be null-terminated" contract. Stack from ASan (read of size 41 at a 40-byte allocation): strlen -> zstrdup_usable -> zstrdup -> RM_Strdup -> MR_CreateNode (cluster.c:844) -> SetClusterDataShortForm (cluster.c:1153)

… add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>

…2038) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * Fix event-ci.yml redis-ref: unstable -> 8.8 The TODO in .github/workflows/event-ci.yml prepare-values step ("redis-ref=unstable # todo change per version/tag") was never done when this branch was cut. Result: CI was running against redis unstable instead of the matching release-line redis, so failures on this branch reflected unstable-redis behavior rather than the target release-line behavior. Fix the hardcoded ref to match the branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * MOD-15749 test_short_form_clusterset: skip on OSS 8.x Short-form CLUSTERSET needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x. The feature ships against RE-private redis; this branch always skips. Master / RE branches keep the assertions intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * MOD-15749 Adopt Tal's REDISMODULE_MAIN Makefile fix (#2043) Mirror the fix already applied on 8.4 / 8.6 cherry-pick PRs. Background: The Makefile typo DREDISMODULE_MAIN (introduced 2024-08-22 by MOD-8014, 4145c70) meant the build was never setting REDISMODULE_MAIN. RTS .o files emitted no storage for SDK function pointers; the .so linked only because LibMR's static library happened to provide storage for every pointer LibMR's vendored redismodule.h declared, plus a handful of hand-patched forward declarations in src/common.h that grew each time someone added a new SDK API not in LibMR's header. On 8.8 the typo never manifested as a CI failure because: - 8.8 src does call the new APIs (GetUserUsername, EnablePostpone*) after MOD-15262, but src/common.h already had matching forward declarations from de204c2 (Tal, 2026-05-04) added as a workaround. - So the .so linked even with the typo. Applying Tal's #2043 fix here for consistency and correctness: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). Net result: module.o now emits storage for all ~404 SDK pointers from the SDK header alone. The common.h forward declarations on 8.8 become dead code, but are intentionally left in place here -- they will be cleaned up by the master cleanup PR (#2044) and propagated to 8.x branches once that lands. Keeping this commit minimal preserves the cherry-pick PR's scope and avoids merge conflict risk with #2044. Verified locally against redis 8.4.3 OSS: module loads cleanly, TS.CREATE / TS.ADD work. nm shows GetUserUsername / EnablePostpone* / DisablePostpone* are now `C` (common) in module.o. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix remaining redis-ref defaults: 'unstable' -> '8.8' event-ci.yml was already fixed (the canonical PR-triggered workflow), but three other workflows still default to 'unstable': - event-tag.yml (tag-event releases) - event-nightly.yml (nightly schedule) - flow-random-traffic.yml (random-OS traffic test, workflow_call+dispatch) Same pattern as the earlier event-ci.yml fix: the TODO comment ("todo change per version/tag") was never honored when this branch was cut, so manual / nightly / tag triggers on the 8.8 branch would run against redis unstable rather than the matching 8.8 release. Also drops the now-stale TODO comment on event-tag.yml since the value is no longer "to-do". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…2041) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x Single squashed change replacing the SDK-pin back-and-forth from prior commits. Net effect vs the cherry-pick: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix from Tal's #2043). The macro must be set by the build so it is in effect before the force-included common.h pulls in redismodule.h. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). - tests/flow/test_asm.py::test_short_form_clusterset: always skip. The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x; ships against RE-private. Master / RE branches keep the assertions intact. Verified locally against built redis 8.4.3 OSS: module loads cleanly, TS.CREATE / TS.ADD work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Drop stale "todo change per version/tag" comments event-ci.yml and event-tag.yml already had the redis-ref value set to '8.4', so the TODO comments next to those values are obsolete. Removed for hygiene -- same cleanup applied to other 8.x branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump LibMR pin to master (#92, #94, #95 landed on master) LibMR master (44d5025f) is tree-equivalent to the side-branch (51e3ad1c) we were tracking -- same three MOD-15749 commits (#92 add short-form, #94 NULL check, #95 nodeId overflow) on the same base (2857cfec). The side-branch was created as a precaution in case LibMR master had unrelated changes incompatible with 8.x; turns out master's tip is just those three PRs, so no conflict. Switching to master per Tal's review comment ("align to master"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…2040) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x Single squashed change replacing the SDK-pin back-and-forth from prior commits. Net effect vs the cherry-pick: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix from Tal's #2043). The macro must be set by the build so it is in effect before the force-included common.h pulls in redismodule.h. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). - tests/flow/test_asm.py::test_short_form_clusterset: always skip. The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x; ships against RE-private. Master / RE branches keep the assertions intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix remaining redis-ref defaults: 'unstable' -> '8.6' flow-random-traffic.yaml still defaulted to 'unstable' on two inputs. Manual / workflow_call triggers on the 8.6 branch would run against redis unstable instead of redis 8.6. Also drops the now-stale TODO comments on event-ci.yml and event-tag.yml since their values are no longer "to-do". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump LibMR pin to master (#92, #94, #95 landed on master) LibMR master (44d5025f) is tree-equivalent to the side-branch (51e3ad1c) we were tracking -- same three MOD-15749 commits (#92 add short-form, #94 NULL check, #95 nodeId overflow) on the same base (2857cfec). The side-branch was created as a precaution in case LibMR master had unrelated changes incompatible with 8.x; turns out master's tip is just those three PRs, so no conflict. Switching to master per Tal's review comment ("align to master"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

galcohen-redislabs added 7 commits May 19, 2026 12:03

MOD-15749 Add support for short form of CLUSTERSET

f5ccd62

This initial commit splits the command into two cases: long and short (which currently does nothing). It also refactors the long form to help writing a clean version for the upcoming short form.

Added the CLUSTERSET short form

fddeda2

No need for MYID in short form

52de49b

Rename stuff to align with redis/redis#14953

802e750

ooops

0e97251

Make AUTH optional in short-form CLUSTERSET

ad6a45d

Handle slotless nodes

24bcf72

gabsow reviewed May 24, 2026

View reviewed changes

gabsow mentioned this pull request May 24, 2026

MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests RedisTimeSeries/RedisTimeSeries#2034

Merged

galcohen-redislabs added 7 commits May 24, 2026 13:06

Don't miss the first slot-range in clusterCtx.CurrCluster->slots[]

42ce87e

Handle slotless nodes properly (bugfix)

458f2ca

typo

97c0b38

Use the 'cluster' argument properly in InitClusterData()

2ae9296

Dropped a redundant assert

bb41df5

Added a comment to clarift potential redundant logic

9515e30

Loop slot ranges using an int index

4de0d20

gabsow approved these changes May 24, 2026

View reviewed changes

gabsow marked this pull request as ready for review May 26, 2026 07:20

cursor Bot reviewed May 26, 2026

View reviewed changes

gabsow merged commit 47a8a68 into master May 26, 2026
8 of 14 checks passed

This was referenced May 26, 2026

MOD-15749 Replace short-form CLUSTERSET assert with soft NULL check #94

Merged

MOD-15749 Fix heap-buffer-overflow on short-form CLUSTERSET node IDs #95

Merged

gabsow mentioned this pull request May 27, 2026

[8.8] Cherry-pick MOD-15749: short-form CLUSTERSET + tests (#2034) RedisTimeSeries/RedisTimeSeries#2038

Merged

2 tasks

gabsow mentioned this pull request May 31, 2026

[8.6] Cherry-pick MOD-15749: short-form CLUSTERSET + tests (#2034) RedisTimeSeries/RedisTimeSeries#2040

Merged

3 tasks


		if (!(flags & REDISMODULE_NODE_MASTER)) continue; // Skip replica nodes

		RedisModule_Assert(MR_GetNode(nodeId) == NULL);

Conversation

galcohen-redislabs commented May 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabsow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabsow left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

Missing thread-safe context lock for Redis API calls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

galcohen-redislabs commented May 19, 2026 •

edited by cursor Bot

Loading