Add RLEC cluster node slot ranges API#88
Conversation
8e0356c to
43ccfa7
Compare
Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.
gabsow
left a comment
There was a problem hiding this comment.
Reviewing — looks good, signature matches the merged redis/redis#14953 byte-for-byte and the RLEC-only placement is consistent with the decision to skip the OSS 8.4/8.6/8.8 backport. One observation, inline.
| REDISMODULE_API int (*RedisModule_IsKeyInRam)(RedisModuleCtx *ctx, RedisModuleString *key) REDISMODULE_ATTR; | ||
| REDISMODULE_API int (*RedisModule_EnablePostponeClients)(void) REDISMODULE_ATTR; | ||
| REDISMODULE_API int (*RedisModule_DisablePostponeClients)(void) REDISMODULE_ATTR; | ||
| REDISMODULE_API RedisModuleSlotRangeArray *(*RedisModule_GetClusterNodeSlotRanges)(RedisModuleCtx *ctx, const char *nodeid) REDISMODULE_ATTR; |
There was a problem hiding this comment.
Worth considering as a follow-up: redis/redis#14953 introduced three APIs together — GetClusterNodeSlotRanges (this one), ClusterGetLocalSlotRanges, and ClusterFreeSlotRanges. The latter two are currently declared in the generic redismodule.h (lines 1358–1359 of master), and REDISMODULE_GET_API(ClusterGetLocalSlotRanges) is wired up at line 1759.
If the "OSS Redis 8.4/8.6/8.8 won't get the backport" decision applies to all three (which seems implied — they're a single redis-core PR), the generic header is currently misleading: modules compiled against this SDK get NULL function pointers at runtime when loaded on OSS Redis without the backport, same crash class we're guarding against in RedisGears/LibMR#94.
Not a blocker for this PR, but if your read of the backport policy matches mine, moving the other two to redismodule-rlec.h here (or in a follow-up) would keep the SDK aligned with actual API availability.
There was a problem hiding this comment.
ClusterGetLocalSlotRanges and ClusterFreeSlotRanges are already present in redismodule.h, both on the master and 8.4 directory variants. These were added officially in OSS 8.4, so no special handling is required for them
Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off.
… add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
…2038) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * Fix event-ci.yml redis-ref: unstable -> 8.8 The TODO in .github/workflows/event-ci.yml prepare-values step ("redis-ref=unstable # todo change per version/tag") was never done when this branch was cut. Result: CI was running against redis unstable instead of the matching release-line redis, so failures on this branch reflected unstable-redis behavior rather than the target release-line behavior. Fix the hardcoded ref to match the branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * MOD-15749 test_short_form_clusterset: skip on OSS 8.x Short-form CLUSTERSET needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x. The feature ships against RE-private redis; this branch always skips. Master / RE branches keep the assertions intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * MOD-15749 Adopt Tal's REDISMODULE_MAIN Makefile fix (#2043) Mirror the fix already applied on 8.4 / 8.6 cherry-pick PRs. Background: The Makefile typo DREDISMODULE_MAIN (introduced 2024-08-22 by MOD-8014, 4145c70) meant the build was never setting REDISMODULE_MAIN. RTS .o files emitted no storage for SDK function pointers; the .so linked only because LibMR's static library happened to provide storage for every pointer LibMR's vendored redismodule.h declared, plus a handful of hand-patched forward declarations in src/common.h that grew each time someone added a new SDK API not in LibMR's header. On 8.8 the typo never manifested as a CI failure because: - 8.8 src does call the new APIs (GetUserUsername, EnablePostpone*) after MOD-15262, but src/common.h already had matching forward declarations from de204c2 (Tal, 2026-05-04) added as a workaround. - So the .so linked even with the typo. Applying Tal's #2043 fix here for consistency and correctness: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). Net result: module.o now emits storage for all ~404 SDK pointers from the SDK header alone. The common.h forward declarations on 8.8 become dead code, but are intentionally left in place here -- they will be cleaned up by the master cleanup PR (#2044) and propagated to 8.x branches once that lands. Keeping this commit minimal preserves the cherry-pick PR's scope and avoids merge conflict risk with #2044. Verified locally against redis 8.4.3 OSS: module loads cleanly, TS.CREATE / TS.ADD work. nm shows GetUserUsername / EnablePostpone* / DisablePostpone* are now `C` (common) in module.o. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix remaining redis-ref defaults: 'unstable' -> '8.8' event-ci.yml was already fixed (the canonical PR-triggered workflow), but three other workflows still default to 'unstable': - event-tag.yml (tag-event releases) - event-nightly.yml (nightly schedule) - flow-random-traffic.yml (random-OS traffic test, workflow_call+dispatch) Same pattern as the earlier event-ci.yml fix: the TODO comment ("todo change per version/tag") was never honored when this branch was cut, so manual / nightly / tag triggers on the 8.8 branch would run against redis unstable rather than the matching 8.8 release. Also drops the now-stale TODO comment on event-tag.yml since the value is no longer "to-do". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…2041) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x Single squashed change replacing the SDK-pin back-and-forth from prior commits. Net effect vs the cherry-pick: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix from Tal's #2043). The macro must be set by the build so it is in effect before the force-included common.h pulls in redismodule.h. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). - tests/flow/test_asm.py::test_short_form_clusterset: always skip. The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x; ships against RE-private. Master / RE branches keep the assertions intact. Verified locally against built redis 8.4.3 OSS: module loads cleanly, TS.CREATE / TS.ADD work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Drop stale "todo change per version/tag" comments event-ci.yml and event-tag.yml already had the redis-ref value set to '8.4', so the TODO comments next to those values are obsolete. Removed for hygiene -- same cleanup applied to other 8.x branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump LibMR pin to master (#92, #94, #95 landed on master) LibMR master (44d5025f) is tree-equivalent to the side-branch (51e3ad1c) we were tracking -- same three MOD-15749 commits (#92 add short-form, #94 NULL check, #95 nodeId overflow) on the same base (2857cfec). The side-branch was created as a precaution in case LibMR master had unrelated changes incompatible with 8.x; turns out master's tip is just those three PRs, so no conflict. Switching to master per Tal's review comment ("align to master"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…2040) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x Single squashed change replacing the SDK-pin back-and-forth from prior commits. Net effect vs the cherry-pick: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix from Tal's #2043). The macro must be set by the build so it is in effect before the force-included common.h pulls in redismodule.h. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). - tests/flow/test_asm.py::test_short_form_clusterset: always skip. The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x; ships against RE-private. Master / RE branches keep the assertions intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix remaining redis-ref defaults: 'unstable' -> '8.6' flow-random-traffic.yaml still defaulted to 'unstable' on two inputs. Manual / workflow_call triggers on the 8.6 branch would run against redis unstable instead of redis 8.6. Also drops the now-stale TODO comments on event-ci.yml and event-tag.yml since their values are no longer "to-do". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump LibMR pin to master (#92, #94, #95 landed on master) LibMR master (44d5025f) is tree-equivalent to the side-branch (51e3ad1c) we were tracking -- same three MOD-15749 commits (#92 add short-form, #94 NULL check, #95 nodeId overflow) on the same base (2857cfec). The side-branch was created as a precaution in case LibMR master had unrelated changes incompatible with 8.x; turns out master's tip is just those three PRs, so no conflict. Switching to master per Tal's review comment ("align to master"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Add the RLEC-only
RedisModule_GetClusterNodeSlotRangesmodule API declaration and loader entry.Details
RedisModule_GetClusterNodeSlotRangesinredismodule-rlec.h.REDISMODULE_GET_API(GetClusterNodeSlotRanges)toREDISMODULE_RLEC_API_DEFSso RLEC builds load the API duringRedisModule_Init.redismodule.hunchanged because this API is only expected in Enterprise/RLEC for the target RediSearch backports.Verification
Header-only SDK change. Downstream RediSearch backport branches will be rebuilt and tested after bumping this submodule commit.