Skip to content

MOD-15749 Add support for short form of CLUSTERSET#92

Merged
gabsow merged 14 commits into
masterfrom
gal-15749-support-shorter-clusterset
May 26, 2026
Merged

MOD-15749 Add support for short form of CLUSTERSET#92
gabsow merged 14 commits into
masterfrom
gal-15749-support-shorter-clusterset

Conversation

@galcohen-redislabs
Copy link
Copy Markdown
Collaborator

@galcohen-redislabs galcohen-redislabs commented May 19, 2026

This initial commit splits the command into two cases: long and short (which currently does nothing).
It also refactors the long form to help writing a clean version for the upcoming short form.


Note

High Risk
Changes cluster topology discovery and inter-shard CLUSTERSET propagation; short form depends on new RedisModule cluster slot APIs and misconfiguration could break cross-shard messaging.

Overview
Adds a compact CLUSTERSET form (CLUSTERSET or CLUSTERSET AUTH {pwd}) alongside the existing verbose topology command, and refactors how cluster state is built from incoming arguments.

Short form no longer requires the full HASHFUNC / RANGES / SHARD payload: the module discovers masters, addresses, flags, and slot ranges from the Redis server via GetClusterNodesList, GetClusterNodeInfo, and the newly wired GetClusterNodeSlotRanges API (declared in redismodule.h). Optional AUTH applies one password to all shard connections.

Long form behavior is preserved but split into helpers (InitClusterData, ParseShardEntry, SetClusterDataLongForm). Shared init copies argv for later CLUSTERSETFROMSHARD propagation; only long form still substitutes the per-target MYID token when pushing topology on reconnect/hello retries.

Command entry points accept either form via IsLongFormClusterSet / IsShortFormClusterSet instead of a fixed minimum argc of 10.

Reviewed by Cursor Bugbot for commit 4de0d20. Bugbot is set up for automated code reviews on this repo. Configure here.

This initial commit splits the command into two cases: long and short (which currently does nothing).
It also refactors the long form to help writing a clean version for the upcoming short form.
Copy link
Copy Markdown
Contributor

@gabsow gabsow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid refactor overall — splitting the long-form into ParseShardEntry reads well, and the short-form path matches MOD-15749's intent. A few things to address, the first one blocking:

Severity Issue
🔴 Blocking slots[] routing table missing the first range of every node — see inline comment on line 1147
🟠 Important Slotless fallback claims all 16384 slots
🟠 Important AUTH password sent even when internal secret is in use (inconsistent with MR_RefreshClusterData)
🟡 Cleanup InitClusterData ignores its parameter; isMe double-set; log typo; redundant assertion; uint16_t loop bound; argc-based form detection is brittle
🟢 Test gap Only the happy-path 3-shard / no-AUTH case is covered; no slot-routed assertion (the test would pass even with the first-range bug)

Comment thread src/cluster.c
return argc >= 10;
}

static bool IsShortFormClusterSet(int argc) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argc-based form detection is a bit brittle: short form is argc == 1 || argc == 3, long form is argc >= 10. If the long form ever evolved to use exactly 3 args (or short form to a config with 10+), this would silently dispatch wrong. A leading-token sniff (AUTH / HASHFUNC / nothing) would be more robust. Not blocking, but worth a comment at least.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will not change that much. Actually we plan to get rid of both. So argc-based logic is good enough.

Comment thread src/cluster.c Outdated
RedisModule_Log(mr_staticCtx, "notice", "Sending cluster topology to %s (%s:%d) after reconnect", n->id, n->ip, n->port);
clusterCtx.CurrCluster->clusterSetCommand[CLUSTER_SET_MY_ID_INDEX] = MR_STRDUP(n->id);
bool isLongForm = IsLongFormClusterSet(clusterCtx.CurrCluster->clusterSetCommandSize);
RedisModule_Log(mr_staticCtx, "notice", "Sending cluster (%s form) topology to %s (%s:%d) on after reconnect",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "on after reconnect""after reconnect" (extra on).

Comment thread src/cluster.c
cluster->myId[REDISMODULE_NODE_ID_LEN] = '\0';
}

static void InitClusterData(Cluster* cluster, RedisModuleString** argv, int argc){
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitClusterData takes a Cluster* cluster parameter but ignores it — every line mutates clusterCtx.CurrCluster directly. Either use the parameter throughout (so the helper is testable in isolation and the global mutation is explicit at the call site) or drop the parameter and rename to make it obvious it operates on the global. SetMyId and GenerateRunId get this right.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oy! fixed.

Comment thread src/cluster.c Outdated

if (!(flags & REDISMODULE_NODE_MASTER)) continue; // Skip replica nodes

RedisModule_Assert(MR_GetNode(nodeId) == NULL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant: MR_CreateNode already asserts !MR_GetNode(id) (see line 831). Either drop this pre-check or drop the one inside MR_CreateNode.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped

Comment thread src/cluster.c Outdated
RedisModuleSlotRangeArray *slots = RedisModule_GetClusterNodeSlotRanges(mr_staticCtx, nodeId);
RedisModule_Assert(slots != NULL);
uint16_t minSlot = 0;
uint16_t maxSlot = NUMBER_OF_SLOTS - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Slotless-node fallback is incorrect. When slots->num_ranges == 0 (per redis/redis#14953, this happens for unknown nodes / non-cluster mode and plausibly for slotless masters), we fall back to minSlot=0, maxSlot=NUMBER_OF_SLOTS-1 — i.e. we declare this node owns all 16384 slots.

The long form represents slotless with minSlot=0, maxSlot=-1 so that MR_CreateNode adds no range (see if (minSlot <= maxSlot) at line 835). The short form should match — pass 0, -1 (use long long like the long form) and skip the clusterCtx.minSlot/maxSlot overwrite when slotless.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread src/cluster.c
maxSlot = slots->ranges[0].end;
}

Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, minSlot, maxSlot);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking bug: clusterCtx.CurrCluster->slots[] routing table is never populated for the first range.

MR_CreateNode only stores the first range in the node's own slotRanges list (line 836) — it does not write into the global slots[] table. The loop below at line 1155 then iterates j = 1 onward, so the first range of every node is missing from slots[].

Concrete case: a normal 3-shard cluster where each shard owns one contiguous range → after short-form CLUSTERSET, slots[] is entirely NULL. The dispatch at line 283 (slots[sendMsg->slotToSend]) will then NULL-deref or silently misroute every slot-targeted message.

Compare with SetClusterDataLongForm (line 1220-1222) and MR_RefreshClusterData (line 948-950) — both unconditionally fill slots[k] for every range.

Fix: either run the slot-fill loop before the j=1.. extra-range loop, or unify both into a single for (j=0; j<num_ranges; j++) block:

Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, 0, -1);
aMasterNode->isMe = (flags & REDISMODULE_NODE_MYSELF) != 0;
for (size_t j = 0; j < slots->num_ranges; j++) {
    uint16_t minSlot = slots->ranges[j].start;
    uint16_t maxSlot = slots->ranges[j].end;
    mr_listAddNodeTail(aMasterNode->slotRanges, NewSlotRange(minSlot, maxSlot));
    for (int k = minSlot; k <= maxSlot; k++)
        clusterCtx.CurrCluster->slots[k] = aMasterNode;
    if (aMasterNode->isMe && j == 0) {
        clusterCtx.minSlot = minSlot;
        clusterCtx.maxSlot = maxSlot;
    }
}

Note: the test test_short_form_clusterset did NOT catch this because TS.QUERYINDEX fans out via the node dict, not via slots[]. The dict is populated correctly, so the broadcast works. See my comment on the RTS PR for a suggested slot-routed test.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. fixed.
(we don't use the slots[] anyway in ts because it never routes via a hash. the cluster-wide commands always send to all nodes)

Comment thread src/cluster.c
}

Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, minSlot, maxSlot);
aMasterNode->isMe = (flags & REDISMODULE_NODE_MYSELF) != 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isMe is set twice — MR_CreateNode sets it via strcmp(id, myId) at line 862, then this line overrides via the MYSELF flag. They should always agree, but having two sources of truth is confusing. The MYSELF flag is the more direct signal — consider making MR_CreateNode not set isMe at all when called from this path, or unify.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, and the flag is the more "correct" way. But note that MR_CreateNode is also called from other' flag-less' environment, so we need the strcmp logic as well. I'll add a comment.

Comment thread src/cluster.c
maxSlot = slots->ranges[0].end;
}

Node* aMasterNode = MR_CreateNode(nodeId, ip, port, password, NULL, minSlot, maxSlot);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 The password obtained from AUTH <pwd> is passed unconditionally to MR_CreateNode. Compare with MR_RefreshClusterData (line 940):

n = MR_CreateNode(..., RedisModule_GetInternalSecret ? NULL : clusterCtx.password, NULL, ...);

i.e. when an internal secret is available, password auth is intentionally skipped. If a deployment uses internal secrets, the short form will still attempt password auth — at minimum inconsistent with the existing pattern, possibly wrong. Either mirror the same guard or leave a comment explaining the difference.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not the same case. Here we fetch the password from the parsing the CLUSTERSET command (similar to what's being done in the parsing of ADDR in the long form). The CLUSTERSET from the DMC will always have a password (which is the bdb's internal_pass attribute).

Comment thread src/cluster.c
maxSlot = slots->ranges[j].end;
mr_listAddNodeTail(aMasterNode->slotRanges, NewSlotRange(minSlot, maxSlot));
for (uint16_t k = minSlot ; k <= maxSlot ; k++)
clusterCtx.CurrCluster->slots[k] = aMasterNode;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint16_t k is safe for canonical maxSlot ≤ 16383, but would spin forever if a buggy API ever returns maxSlot == UINT16_MAX (e.g. casting -1). The long-form loop at line 1220 uses int k. Match that here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matched

Copy link
Copy Markdown
Contributor

@gabsow gabsow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

All 9 inline comments addressed:

  • 🔴 Blocking slots[] first-range bug → fixed in 42ce87e
  • 🟠 Slotless fallback → fixed in 458f2ca (now uses long long 0, -1 sentinel)
  • 🟠 AUTH vs internal-secret → accepting your reasoning (symmetric with long-form ADDR parsing; DMC always provides the password)
  • 🟡 InitClusterData parameter, redundant assert, log typo, uint16_tint loop, isMe rationale comment → all addressed
  • 🟡 Argc-based form detection → accepting (you plan to retire both forms)

Flagging for downstream visibility: the RTS PR (#2034) still pins to ad6a45d3 and CI is fully red because the blocking slots[] bug surfaces in tests there. Bumping the submodule to this PR's HEAD (or to the merge commit) should unblock it.

@gabsow gabsow marked this pull request as ready for review May 26, 2026 07:20
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4de0d20. Configure here.

Comment thread src/cluster.c
clusterCtx.CurrCluster->slots[k] = aMasterNode;
}

RedisModule_ClusterFreeSlotRanges(mr_staticCtx, slots);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing thread-safe context lock for Redis API calls

High Severity

SetClusterDataShortForm calls RedisModule_GetClusterNodesList, RedisModule_GetClusterNodeInfo, RedisModule_GetClusterNodeSlotRanges, and their corresponding free functions using mr_staticCtx without acquiring RedisModule_ThreadSafeContextLock. This function runs on the event loop's dedicated pthread, not the Redis main thread. The existing MR_RefreshClusterData — which runs in the same threading context — correctly wraps its RedisModule_GetClusterNodeInfo and RedisModule_Call invocations with lock/unlock pairs. The omission here creates a data race on shared server state.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4de0d20. Configure here.

@gabsow gabsow merged commit 47a8a68 into master May 26, 2026
8 of 14 checks passed
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request May 26, 2026
Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request May 26, 2026
Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).
gabsow added a commit that referenced this pull request May 26, 2026
ASan caught a heap-buffer-overflow during test_short_form_clusterset in
the sanitizer build: the short-form CLUSTERSET path was passing node IDs
from RedisModule_GetClusterNodesList -- which redis core allocates as
exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight
into MR_CreateNode, which treats the id as a C string (MR_GetNode dict
lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the
40-byte allocation.

This bug was introduced by the short-form path in #92. #94 didn't touch
this code; it's just that the strengthened test_short_form_clusterset on
the TS side surfaced the existing bug under the sanitizer build.

Minimal fix at the call site (SetClusterDataShortForm only):
  - Copy each id from nodeList[i] into a null-terminated stack buffer
    of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body.
  - Pass that buffer through the rest of the iteration -- to
    RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the
    %s in the failure-warning Log is now safe too), to
    RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode.

MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged.
The change is scoped strictly to the one caller that violated the
"id must be null-terminated" contract.

Stack from ASan (read of size 41 at a 40-byte allocation):
  strlen -> zstrdup_usable -> zstrdup -> RM_Strdup
    -> MR_CreateNode (cluster.c:844)
    -> SetClusterDataShortForm (cluster.c:1153)
gabsow added a commit that referenced this pull request May 26, 2026
ASan caught a heap-buffer-overflow during test_short_form_clusterset in
the sanitizer build: the short-form CLUSTERSET path was passing node IDs
from RedisModule_GetClusterNodesList -- which redis core allocates as
exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight
into MR_CreateNode, which treats the id as a C string (MR_GetNode dict
lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the
40-byte allocation.

This bug was introduced by the short-form path in #92. #94 didn't touch
this code; it's just that the strengthened test_short_form_clusterset on
the TS side surfaced the existing bug under the sanitizer build.

Minimal fix at the call site (SetClusterDataShortForm only):
  - Copy each id from nodeList[i] into a null-terminated stack buffer
    of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body.
  - Pass that buffer through the rest of the iteration -- to
    RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the
    %s in the failure-warning Log is now safe too), to
    RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode.

MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged.
The change is scoped strictly to the one caller that violated the
"id must be null-terminated" contract.

Stack from ASan (read of size 41 at a 40-byte allocation):
  strlen -> zstrdup_usable -> zstrdup -> RM_Strdup
    -> MR_CreateNode (cluster.c:844)
    -> SetClusterDataShortForm (cluster.c:1153)
gabsow added a commit that referenced this pull request May 26, 2026
ASan caught a heap-buffer-overflow during test_short_form_clusterset in
the sanitizer build: the short-form CLUSTERSET path was passing node IDs
from RedisModule_GetClusterNodesList -- which redis core allocates as
exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight
into MR_CreateNode, which treats the id as a C string (MR_GetNode dict
lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the
40-byte allocation.

This bug was introduced by the short-form path in #92. #94 didn't touch
this code; it's just that the strengthened test_short_form_clusterset on
the TS side surfaced the existing bug under the sanitizer build.

Minimal fix at the call site (SetClusterDataShortForm only):
  - Copy each id from nodeList[i] into a null-terminated stack buffer
    of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body.
  - Pass that buffer through the rest of the iteration -- to
    RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the
    %s in the failure-warning Log is now safe too), to
    RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode.

MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged.
The change is scoped strictly to the one caller that violated the
"id must be null-terminated" contract.

Stack from ASan (read of size 41 at a 40-byte allocation):
  strlen -> zstrdup_usable -> zstrdup -> RM_Strdup
    -> MR_CreateNode (cluster.c:844)
    -> SetClusterDataShortForm (cluster.c:1153)
gabsow added a commit that referenced this pull request May 27, 2026
…95)

ASan caught a heap-buffer-overflow during test_short_form_clusterset in
the sanitizer build: the short-form CLUSTERSET path was passing node IDs
from RedisModule_GetClusterNodesList -- which redis core allocates as
exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight
into MR_CreateNode, which treats the id as a C string (MR_GetNode dict
lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the
40-byte allocation.

This bug was introduced by the short-form path in #92. #94 didn't touch
this code; it's just that the strengthened test_short_form_clusterset on
the TS side surfaced the existing bug under the sanitizer build.

Minimal fix at the call site (SetClusterDataShortForm only):
  - Copy each id from nodeList[i] into a null-terminated stack buffer
    of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body.
  - Pass that buffer through the rest of the iteration -- to
    RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the
    %s in the failure-warning Log is now safe too), to
    RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode.

MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged.
The change is scoped strictly to the one caller that violated the
"id must be null-terminated" contract.

Stack from ASan (read of size 41 at a 40-byte allocation):
  strlen -> zstrdup_usable -> zstrdup -> RM_Strdup
    -> MR_CreateNode (cluster.c:844)
    -> SetClusterDataShortForm (cluster.c:1153)
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request May 27, 2026
… add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request Jun 1, 2026
…2038)

* MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
(cherry picked from commit b0ecf8a)

* Fix event-ci.yml redis-ref: unstable -> 8.8

The TODO in .github/workflows/event-ci.yml prepare-values step
("redis-ref=unstable # todo change per version/tag") was never
done when this branch was cut. Result: CI was running against redis
unstable instead of the matching release-line redis, so failures
on this branch reflected unstable-redis behavior rather than the
target release-line behavior.

Fix the hardcoded ref to match the branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* MOD-15749 test_short_form_clusterset: skip on OSS 8.x

Short-form CLUSTERSET needs RM_GetClusterNodeSlotRanges
(redis/redis#14953), unavailable in OSS 8.x. The feature ships
against RE-private redis; this branch always skips. Master / RE
branches keep the assertions intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* MOD-15749 Adopt Tal's REDISMODULE_MAIN Makefile fix (#2043)

Mirror the fix already applied on 8.4 / 8.6 cherry-pick PRs. Background:

The Makefile typo DREDISMODULE_MAIN (introduced 2024-08-22 by MOD-8014,
4145c70) meant the build was never setting REDISMODULE_MAIN. RTS .o
files emitted no storage for SDK function pointers; the .so linked only
because LibMR's static library happened to provide storage for every
pointer LibMR's vendored redismodule.h declared, plus a handful of
hand-patched forward declarations in src/common.h that grew each time
someone added a new SDK API not in LibMR's header.

On 8.8 the typo never manifested as a CI failure because:
  - 8.8 src does call the new APIs (GetUserUsername, EnablePostpone*)
    after MOD-15262, but src/common.h already had matching forward
    declarations from de204c2 (Tal, 2026-05-04) added as a workaround.
  - So the .so linked even with the typo.

Applying Tal's #2043 fix here for consistency and correctness:
  - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN.
  - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be
    too late once common.h is force-included).

Net result: module.o now emits storage for all ~404 SDK pointers from
the SDK header alone. The common.h forward declarations on 8.8 become
dead code, but are intentionally left in place here -- they will be
cleaned up by the master cleanup PR (#2044) and propagated to 8.x
branches once that lands. Keeping this commit minimal preserves the
cherry-pick PR's scope and avoids merge conflict risk with #2044.

Verified locally against redis 8.4.3 OSS: module loads cleanly,
TS.CREATE / TS.ADD work. nm shows GetUserUsername / EnablePostpone* /
DisablePostpone* are now `C` (common) in module.o.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix remaining redis-ref defaults: 'unstable' -> '8.8'

event-ci.yml was already fixed (the canonical PR-triggered workflow),
but three other workflows still default to 'unstable':

  - event-tag.yml          (tag-event releases)
  - event-nightly.yml      (nightly schedule)
  - flow-random-traffic.yml (random-OS traffic test, workflow_call+dispatch)

Same pattern as the earlier event-ci.yml fix: the TODO comment
("todo change per version/tag") was never honored when this branch
was cut, so manual / nightly / tag triggers on the 8.8 branch would
run against redis unstable rather than the matching 8.8 release.

Also drops the now-stale TODO comment on event-tag.yml since the
value is no longer "to-do".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request Jun 1, 2026
…2041)

* MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
(cherry picked from commit b0ecf8a)

* MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x

Single squashed change replacing the SDK-pin back-and-forth from
prior commits. Net effect vs the cherry-pick:

  - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix
    from Tal's #2043). The macro must be set by the build so it is in
    effect before the force-included common.h pulls in redismodule.h.
  - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be
    too late once common.h is force-included).
  - tests/flow/test_asm.py::test_short_form_clusterset: always skip.
    The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953),
    unavailable in OSS 8.x; ships against RE-private. Master / RE
    branches keep the assertions intact.

Verified locally against built redis 8.4.3 OSS: module loads cleanly,
TS.CREATE / TS.ADD work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Drop stale "todo change per version/tag" comments

event-ci.yml and event-tag.yml already had the redis-ref value set
to '8.4', so the TODO comments next to those values are obsolete.
Removed for hygiene -- same cleanup applied to other 8.x branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump LibMR pin to master (#92, #94, #95 landed on master)

LibMR master (44d5025f) is tree-equivalent to the side-branch
(51e3ad1c) we were tracking -- same three MOD-15749 commits
(#92 add short-form, #94 NULL check, #95 nodeId overflow) on the
same base (2857cfec). The side-branch was created as a precaution
in case LibMR master had unrelated changes incompatible with 8.x;
turns out master's tip is just those three PRs, so no conflict.

Switching to master per Tal's review comment ("align to master").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request Jun 1, 2026
…2040)

* MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
(cherry picked from commit b0ecf8a)

* MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x

Single squashed change replacing the SDK-pin back-and-forth from
prior commits. Net effect vs the cherry-pick:

  - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix
    from Tal's #2043). The macro must be set by the build so it is in
    effect before the force-included common.h pulls in redismodule.h.
  - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be
    too late once common.h is force-included).
  - tests/flow/test_asm.py::test_short_form_clusterset: always skip.
    The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953),
    unavailable in OSS 8.x; ships against RE-private. Master / RE
    branches keep the assertions intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix remaining redis-ref defaults: 'unstable' -> '8.6'

flow-random-traffic.yaml still defaulted to 'unstable' on two inputs.
Manual / workflow_call triggers on the 8.6 branch would run against
redis unstable instead of redis 8.6.

Also drops the now-stale TODO comments on event-ci.yml and
event-tag.yml since their values are no longer "to-do".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump LibMR pin to master (#92, #94, #95 landed on master)

LibMR master (44d5025f) is tree-equivalent to the side-branch
(51e3ad1c) we were tracking -- same three MOD-15749 commits
(#92 add short-form, #94 NULL check, #95 nodeId overflow) on the
same base (2857cfec). The side-branch was created as a precaution
in case LibMR master had unrelated changes incompatible with 8.x;
turns out master's tip is just those three PRs, so no conflict.

Switching to master per Tal's review comment ("align to master").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants