MOD-15749 Replace short-form CLUSTERSET assert with soft NULL check#94
Merged
Merged
Conversation
The short-form CLUSTERSET handler asserts that RedisModule_GetClusterNodeSlotRanges is non-NULL. If the module is loaded against a redis build that does not export this API (e.g. OSS Redis 8.4 without a private backport of redis/redis#14953), the function pointer stays NULL after RedisModule_Init and the assert crashes the shard the moment any caller sends short-form CLUSTERSET. Replace the assert with a soft NULL check: log a warning naming the missing API and the recovery paths (long-form CLUSTERSET, REFRESHCLUSTER), then return without touching cluster state. This matches the defensiveness pattern in RediSearch's RedisEnterprise_ParseTopology (src/coord/rmr/redise.c:114), which gates on both `argc == 3` and the function pointer. Behavior summary: - Redis with the API + short-form CLUSTERSET : works (unchanged) - Redis with the API + long-form CLUSTERSET : works (unchanged) - Redis without the API + long-form CLUSTERSET : works (unchanged, long form carries the topology in its args) - Redis without the API + short-form CLUSTERSET : before: assert -> process crash after: warning logged, cluster stays unconfigured, shard alive In Enterprise deployments the DMC is supposed to gate dispatch on module version (RED-197643) so the "redis without the API + short form" cell is not expected; in OSS there is no DMC and any operator/test that sends short-form CLUSTERSET against an old redis would hit this path. The change is keyed on module behavior, not deployment mode. Testing: LibMR's CI infrastructure lacks a way to mock-load the module against a redis build without the API (the gap tracked by MOD-15862), so this change is currently exercised only via code review. The RedisTimeSeries flow test test_short_form_clusterset implicitly verifies the happy path against a redis that does export the API.
491c27b to
0de9150
Compare
gabsow
added a commit
to RedisTimeSeries/RedisTimeSeries
that referenced
this pull request
May 26, 2026
Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master.
TalBarYakar
reviewed
May 26, 2026
|
|
||
| // Verify we have the help we need from the server | ||
| RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL); | ||
| // Short-form CLUSTERSET requires RedisModule_GetClusterNodeSlotRanges, which |
Contributor
There was a problem hiding this comment.
remove this comment - it will be irrelevant after it merges in redis
Contributor
Author
There was a problem hiding this comment.
Trimmed in b6a223c — kept the load-bearing 'this pointer may be NULL on OSS without the backport' bit (otherwise a future reader wouldn't know when this branch fires), dropped the timeline / PR-link rot. Let me know if you'd prefer it removed entirely.
Per review on #94: shrink the explanatory comment to the load-bearing 'why' (the API may be NULL on some Redis builds) and drop the timeline / PR-reference bits that age out once the redis-core PR has merged.
TalBarYakar
approved these changes
May 26, 2026
gabsow
added a commit
to RedisTimeSeries/RedisTimeSeries
that referenced
this pull request
May 26, 2026
Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off.
2 tasks
gabsow
added a commit
that referenced
this pull request
May 26, 2026
ASan caught a heap-buffer-overflow during test_short_form_clusterset in the sanitizer build: the short-form CLUSTERSET path was passing node IDs from RedisModule_GetClusterNodesList -- which redis core allocates as exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight into MR_CreateNode, which treats the id as a C string (MR_GetNode dict lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the 40-byte allocation. This bug was introduced by the short-form path in #92. #94 didn't touch this code; it's just that the strengthened test_short_form_clusterset on the TS side surfaced the existing bug under the sanitizer build. Minimal fix at the call site (SetClusterDataShortForm only): - Copy each id from nodeList[i] into a null-terminated stack buffer of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body. - Pass that buffer through the rest of the iteration -- to RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the %s in the failure-warning Log is now safe too), to RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode. MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged. The change is scoped strictly to the one caller that violated the "id must be null-terminated" contract. Stack from ASan (read of size 41 at a 40-byte allocation): strlen -> zstrdup_usable -> zstrdup -> RM_Strdup -> MR_CreateNode (cluster.c:844) -> SetClusterDataShortForm (cluster.c:1153)
gabsow
added a commit
that referenced
this pull request
May 26, 2026
ASan caught a heap-buffer-overflow during test_short_form_clusterset in the sanitizer build: the short-form CLUSTERSET path was passing node IDs from RedisModule_GetClusterNodesList -- which redis core allocates as exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight into MR_CreateNode, which treats the id as a C string (MR_GetNode dict lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the 40-byte allocation. This bug was introduced by the short-form path in #92. #94 didn't touch this code; it's just that the strengthened test_short_form_clusterset on the TS side surfaced the existing bug under the sanitizer build. Minimal fix at the call site (SetClusterDataShortForm only): - Copy each id from nodeList[i] into a null-terminated stack buffer of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body. - Pass that buffer through the rest of the iteration -- to RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the %s in the failure-warning Log is now safe too), to RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode. MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged. The change is scoped strictly to the one caller that violated the "id must be null-terminated" contract. Stack from ASan (read of size 41 at a 40-byte allocation): strlen -> zstrdup_usable -> zstrdup -> RM_Strdup -> MR_CreateNode (cluster.c:844) -> SetClusterDataShortForm (cluster.c:1153)
gabsow
added a commit
that referenced
this pull request
May 26, 2026
ASan caught a heap-buffer-overflow during test_short_form_clusterset in the sanitizer build: the short-form CLUSTERSET path was passing node IDs from RedisModule_GetClusterNodesList -- which redis core allocates as exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight into MR_CreateNode, which treats the id as a C string (MR_GetNode dict lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the 40-byte allocation. This bug was introduced by the short-form path in #92. #94 didn't touch this code; it's just that the strengthened test_short_form_clusterset on the TS side surfaced the existing bug under the sanitizer build. Minimal fix at the call site (SetClusterDataShortForm only): - Copy each id from nodeList[i] into a null-terminated stack buffer of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body. - Pass that buffer through the rest of the iteration -- to RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the %s in the failure-warning Log is now safe too), to RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode. MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged. The change is scoped strictly to the one caller that violated the "id must be null-terminated" contract. Stack from ASan (read of size 41 at a 40-byte allocation): strlen -> zstrdup_usable -> zstrdup -> RM_Strdup -> MR_CreateNode (cluster.c:844) -> SetClusterDataShortForm (cluster.c:1153)
gabsow
added a commit
that referenced
this pull request
May 27, 2026
…95) ASan caught a heap-buffer-overflow during test_short_form_clusterset in the sanitizer build: the short-form CLUSTERSET path was passing node IDs from RedisModule_GetClusterNodesList -- which redis core allocates as exactly REDISMODULE_NODE_ID_LEN bytes with no null terminator -- straight into MR_CreateNode, which treats the id as a C string (MR_GetNode dict lookup, MR_STRDUP, strcmp). strlen and strcmp walk one byte past the 40-byte allocation. This bug was introduced by the short-form path in #92. #94 didn't touch this code; it's just that the strengthened test_short_form_clusterset on the TS side surfaced the existing bug under the sanitizer build. Minimal fix at the call site (SetClusterDataShortForm only): - Copy each id from nodeList[i] into a null-terminated stack buffer of size REDISMODULE_NODE_ID_LEN + 1 at the top of the loop body. - Pass that buffer through the rest of the iteration -- to RedisModule_GetClusterNodeInfo (uses fixed length anyway, but the %s in the failure-warning Log is now safe too), to RedisModule_GetClusterNodeSlotRanges, and to MR_CreateNode. MR_CreateNode is unchanged. The long-form CLUSTERSET path is unchanged. The change is scoped strictly to the one caller that violated the "id must be null-terminated" contract. Stack from ASan (read of size 41 at a 40-byte allocation): strlen -> zstrdup_usable -> zstrdup -> RM_Strdup -> MR_CreateNode (cluster.c:844) -> SetClusterDataShortForm (cluster.c:1153)
gabsow
added a commit
to RedisTimeSeries/RedisTimeSeries
that referenced
this pull request
May 27, 2026
… add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
3 tasks
gabsow
added a commit
to RedisTimeSeries/RedisTimeSeries
that referenced
this pull request
Jun 1, 2026
…2038) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * Fix event-ci.yml redis-ref: unstable -> 8.8 The TODO in .github/workflows/event-ci.yml prepare-values step ("redis-ref=unstable # todo change per version/tag") was never done when this branch was cut. Result: CI was running against redis unstable instead of the matching release-line redis, so failures on this branch reflected unstable-redis behavior rather than the target release-line behavior. Fix the hardcoded ref to match the branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * MOD-15749 test_short_form_clusterset: skip on OSS 8.x Short-form CLUSTERSET needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x. The feature ships against RE-private redis; this branch always skips. Master / RE branches keep the assertions intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * MOD-15749 Adopt Tal's REDISMODULE_MAIN Makefile fix (#2043) Mirror the fix already applied on 8.4 / 8.6 cherry-pick PRs. Background: The Makefile typo DREDISMODULE_MAIN (introduced 2024-08-22 by MOD-8014, 4145c70) meant the build was never setting REDISMODULE_MAIN. RTS .o files emitted no storage for SDK function pointers; the .so linked only because LibMR's static library happened to provide storage for every pointer LibMR's vendored redismodule.h declared, plus a handful of hand-patched forward declarations in src/common.h that grew each time someone added a new SDK API not in LibMR's header. On 8.8 the typo never manifested as a CI failure because: - 8.8 src does call the new APIs (GetUserUsername, EnablePostpone*) after MOD-15262, but src/common.h already had matching forward declarations from de204c2 (Tal, 2026-05-04) added as a workaround. - So the .so linked even with the typo. Applying Tal's #2043 fix here for consistency and correctness: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). Net result: module.o now emits storage for all ~404 SDK pointers from the SDK header alone. The common.h forward declarations on 8.8 become dead code, but are intentionally left in place here -- they will be cleaned up by the master cleanup PR (#2044) and propagated to 8.x branches once that lands. Keeping this commit minimal preserves the cherry-pick PR's scope and avoids merge conflict risk with #2044. Verified locally against redis 8.4.3 OSS: module loads cleanly, TS.CREATE / TS.ADD work. nm shows GetUserUsername / EnablePostpone* / DisablePostpone* are now `C` (common) in module.o. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix remaining redis-ref defaults: 'unstable' -> '8.8' event-ci.yml was already fixed (the canonical PR-triggered workflow), but three other workflows still default to 'unstable': - event-tag.yml (tag-event releases) - event-nightly.yml (nightly schedule) - flow-random-traffic.yml (random-OS traffic test, workflow_call+dispatch) Same pattern as the earlier event-ci.yml fix: the TODO comment ("todo change per version/tag") was never honored when this branch was cut, so manual / nightly / tag triggers on the 8.8 branch would run against redis unstable rather than the matching 8.8 release. Also drops the now-stale TODO comment on event-tag.yml since the value is no longer "to-do". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gabsow
added a commit
to RedisTimeSeries/RedisTimeSeries
that referenced
this pull request
Jun 1, 2026
…2041) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x Single squashed change replacing the SDK-pin back-and-forth from prior commits. Net effect vs the cherry-pick: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix from Tal's #2043). The macro must be set by the build so it is in effect before the force-included common.h pulls in redismodule.h. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). - tests/flow/test_asm.py::test_short_form_clusterset: always skip. The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x; ships against RE-private. Master / RE branches keep the assertions intact. Verified locally against built redis 8.4.3 OSS: module loads cleanly, TS.CREATE / TS.ADD work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Drop stale "todo change per version/tag" comments event-ci.yml and event-tag.yml already had the redis-ref value set to '8.4', so the TODO comments next to those values are obsolete. Removed for hygiene -- same cleanup applied to other 8.x branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump LibMR pin to master (#92, #94, #95 landed on master) LibMR master (44d5025f) is tree-equivalent to the side-branch (51e3ad1c) we were tracking -- same three MOD-15749 commits (#92 add short-form, #94 NULL check, #95 nodeId overflow) on the same base (2857cfec). The side-branch was created as a precaution in case LibMR master had unrelated changes incompatible with 8.x; turns out master's tip is just those three PRs, so no conflict. Switching to master per Tal's review comment ("align to master"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gabsow
added a commit
to RedisTimeSeries/RedisTimeSeries
that referenced
this pull request
Jun 1, 2026
…2040) * MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034) * Link to new version of LibMR with optional AUTH * Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER * Added a test for the new short form of CLUSTERSET * MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed) Picks up the merged RedisGears/LibMR#92 — which now includes: - slots[] first-range population fix - slotless-node handling - signed loop index over num_ranges Built against the merged redis core API RM_GetClusterNodeSlotRanges (redis/redis#14953, in redis unstable as of 2026-05-25). * MOD-15749 Strengthen test_short_form_clusterset (review feedback) Three additive changes to the new test in tests/flow/test_asm.py; no existing assertion is removed or weakened: 1. Slot-routed validation. The original assertions only exercise LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via RedisModule_GetClusterNodesList). The whole reason for the companion LibMR change (#92) was a bug in clusterCtx.CurrCluster->slots[] population, which the existing asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count block that goes through MR_RunReshuffleStep -> MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads slots[hslot]. This is the exact table short-form CLUSTERSET populates from RedisModule_GetClusterNodeSlotRanges. 2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion: '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'), so an empty/error reply no longer trivially passes. 3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET'); previously the return value was discarded and a CLUSTERSET error on any shard would only surface via the downstream symptom. Also lifted samples_per_key into a local so the new GROUPBY block can reference it (value unchanged). * MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix) Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on RedisGears/LibMR#94. That commit replaces the RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in SetClusterDataShortForm with a log + early return -- nothing more. Earlier iteration also added error propagation through the CLUSTERSET reply, but the consumer side (DMC) doesn't read the reply anyway and RediSearch's analogous code path (redise.c:114) doesn't propagate either. Dropped the upgrade as YAGNI. The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034 to TS master until LibMR#94 squash-merges and the pin is re-bumped to LibMR master. * MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges) Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's RedisLabsModules/RedisModulesSDK#88, which adds RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only, matching the redis-core decision to backport redis/redis#14953 only to private Enterprise / RLEC). TS source does not directly call RedisModule_GetClusterNodeSlotRanges today -- LibMR is the consumer of that API. But keeping the SDK header aligned with the merged redis core API matters because: - TS src/ includes RedisModulesSDK/redismodule.h directly and depends on the same SlotRange types (libmr_commands.c, libmr_integration.c already use RedisModuleSlotRange / RedisModuleSlotRangeArray / RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges). - Future TS code may need the per-node API for orchestrator-style work. - Avoids header drift between LibMR's vendored redismodule.h and the shared SDK header. This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once SDK#88 lands, follow-up re-pin to SDK master before merging TS to master. * MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed Both upstream PRs are merged: - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET) landed as merge commit 8f84041 on LibMR master. - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges declaration) landed as merge commit 58ae8fc on SDK master. Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 -> 58ae8fc, both moving from PR-branch SHAs to the proper master HEAD. This removes the "do not merge to TS master" caveat that was on the previous two bump commits. TS#2034 is now safe to merge once review + CI sign off. * MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs) Bump deps/LibMR from 8f84041 (master) to the HEAD of RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow). That branch contains the fix for the ASan heap-buffer-overflow that the linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode was treating the 40-byte non-null-terminated node IDs returned by RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN. Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR either confirms the fix (red -> green) or shows whatever ASan finds next. Do not merge TS#2034 to master until LibMR#95 squash-merges and this pin is bumped back to LibMR master. * MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call) Updated LibMR#95 to address Cursor Bugbot's high-severity finding: the original commit normalized the node ID only inside the body of MR_CreateNode, but the assert at the top of the function -- which calls MR_GetNode(id) -> mr_dictFind -- already used the raw non- null-terminated id, hitting the same overflow class. The fixup commit (6741c11) moves the normalization to the very top of MR_CreateNode so MR_GetNode operates on a safe null-terminated copy. Re-pin here to pick that up. * MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization) LibMR#95 was reworked to the minimal shape: - Fix is scoped to SetClusterDataShortForm only. - MR_CreateNode is byte-identical to LibMR master. - 9 lines added in the one function whose caller-side contract violation introduced the bug. Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the minimal fix. * MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy) * MOD-15749 Bump LibMR to #95 (final shape: matches realId style) * MOD-15749 Re-pin LibMR to master after #95 landed LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs) merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR master HEAD. This removes the "do not merge to TS master" caveat. TS#2034 is now pinned to a clean LibMR master commit and is functionally ready to merge once review + CI sign off. * MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback) Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets broadcast returning None or [] silently skip the per-shard assert -- the test would proceed as if CLUSTERSET succeeded on all shards. Add an explicit pre-check that env.broadcast returned a non-None list of exactly env.shardsCount entries before iterating, so framework-level surprises fail at the right place instead of leaking into a confusing TS.QUERYINDEX count mismatch later. Pure strengthening: existing per-shard reply assertion is unchanged. * MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback) Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots index, so a regression that leaves most entries NULL (the exact short-form CLUSTERSET failure mode) still passes by luck." The existing GROUPBY-by-label assertion uses one label value ("test"), so the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry -- which can pass even if most slots[] entries are NULL. Add a stronger slot-routed assertion: 1. Tag each key with a `group=g<i % number_of_groups>` label using a per-key callable in fill_some_data (purely additive -- existing string/int label values still work as-is). 2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group REDUCE count and assert we see exactly number_of_groups distinct group-blocks, each with samples_per_key timestamps and keys_per_group as the per-timestamp count. The previous single-group assertion is unchanged -- still validates the basic aggregation path. * MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET env.broadcast() in rltest is fire-and-forget -- it returns None, not a list of replies. The previous `for ... in enumerate(replies or [])` pattern silently absorbed this (Cursor Bugbot's exact finding: the per-shard reply check was dead code). My earlier strengthening correctly exposed that with `assert replies is not None and len(...)` which fired immediately in CI because broadcast really does return None. Replace with an explicit per-shard send: getConnection(i).execute_command() for each shard, asserting each reply is OK. Same semantic intent as the original broadcast (every shard gets the short-form CLUSTERSET) but now we actually verify each reply. * MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check Tom pointed out: the DMC dispatches CLUSTERSET in production via broadcast (fire-and-forget to every shard), so the test should mirror that wire-level pattern. The previous iteration (per-shard sequential sends) sidestepped the env.broadcast-returns-None problem but lost fidelity to how the orchestrator actually issues this command. Cursor Bugbot's original concern was specifically that the per-shard reply check (`for ... in enumerate(replies or [])`) was dead code, because env.broadcast returns None. The reply check never executed since the test was added -- so removing it isn't a coverage loss, just deletion of code that never ran. Restore env.broadcast and remove the unreachable reply loop. The downstream assertions (TS.QUERYINDEX expecting number_of_keys after CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting number_of_groups blocks) catch any actual failure end-to-end. * MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check DMC's actual production pattern (per LibMR cluster.c:991): send timeseries.CLUSTERSET to ONE shard; that shard self-configures and then propagates the topology to peers by sending them CLUSTERSETFROMSHARD on rg.hello / reconnect. There's no DMC-side broadcast. The previous test used env.broadcast(...) which sent to every shard directly -- not what the DMC does in production, and skipped exercising the FROMSHARD propagation that production relies on. Switch to the DMC pattern: - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK check that env.broadcast couldn't give us). - Poll INFOCLUSTER on every shard until none returns the "no cluster mode" simple-string (LibMR cluster.c:1472, returned while clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second. - On timeout, report which shards never received the propagated CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream QUERYINDEX-count mismatch. This is a stronger end-to-end check: the single-shard send + propagation path is exactly what the DMC relies on, so a regression in either the short-form parser OR the FROMSHARD forwarding now fails the test explicitly instead of via a confusing downstream symptom. * MOD-15749 test_short_form_clusterset: address self-review findings Three nits found on fresh self-review of the propagation block: 1. Scale propagation deadline for valgrind / sanitizer (real fix). The fixed 10s deadline is fine on normal builds but under valgrind every operation is ~30x slower; the reconnect + async-send chain that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds. Scale to 60s when VALGRIND or SANITIZER, matching the existing pattern at line 41 of this file. Avoids a real flake risk in the linux-valgrind / x64-jammy job. 2. Cache per-shard connections outside the polling loop. The 100ms inner loop was opening/closing a fresh connection per shard per iteration (worst case ~300 fresh sockets). Hoist the connections once before the loop -- correctness unchanged, just stops thrashing redis client sockets. 3. Wrong variable in assertion message. `assert withlabels == [], filtered_by` -- the message printed the group name instead of the actually-unexpected withlabels content. Cosmetic but misleading on failure. Fixed to print `withlabels`. * MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style The test had grown several paragraphs of explanatory comments and fragmented every redis call into its own `with env.getConnection(0)` block. The other tests in this file (test_asm_with_data, test_asm_with_data_and_queries_during_migrations, etc.) use a single `conn = env.getConnection(0)` hoist and one-line comments at section transitions -- no docstrings, no paragraph-style explanations. Align style: - Single conn hoist; reuse for every per-shard-0 call. - Replace 4-7-line paragraph comments with one-line section notes. - Drop the trivially-always-true `number_of_keys % number_of_groups` self-check. - Drop the verbose f-string assertion message on the OK reply check (other simple equality asserts in the file are messageless). - Compress fill_some_data's callable-label docstring to one line. No assertion is removed or weakened -- same pre/post-CLUSTERSET QUERYINDEX checks, same propagation poll, both single-group and multi-group GROUPBY assertions retained. The valgrind-aware deadline scaling stays in place (just inlined). Net diff: 112 lines down to 54. * MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll INFOCLUSTER is registered with the `internal` command flag in OSS builds (LibMR cluster.c sets `command_flags = "readonly deny-script internal"` when RedisModule_GetInternalSecret is available). Internal commands are reachable only over the cluster bus -- not from a regular client connection -- so the previous INFOCLUSTER poll always saw "unknown command 'timeseries.INFOCLUSTER'". Switch to polling TS.QUERYINDEX on shard 0: before propagation it returns the local subset (~number_of_keys/shardsCount), after propagation it returns the full set. This uses a publicly-callable command, observes the actual feature we care about (cross-shard fan-out), and replaces the broken probe with one that runs in the same OSS-cluster mode as the rest of the test. Verified locally: test passes in 6 seconds against redis unstable (post-#14953) on macos-arm64. --------- Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com> Co-authored-by: Tom Gabsow <tom.gabsow@redis.com> (cherry picked from commit b0ecf8a) * MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x Single squashed change replacing the SDK-pin back-and-forth from prior commits. Net effect vs the cherry-pick: - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix from Tal's #2043). The macro must be set by the build so it is in effect before the force-included common.h pulls in redismodule.h. - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be too late once common.h is force-included). - tests/flow/test_asm.py::test_short_form_clusterset: always skip. The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953), unavailable in OSS 8.x; ships against RE-private. Master / RE branches keep the assertions intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix remaining redis-ref defaults: 'unstable' -> '8.6' flow-random-traffic.yaml still defaulted to 'unstable' on two inputs. Manual / workflow_call triggers on the 8.6 branch would run against redis unstable instead of redis 8.6. Also drops the now-stale TODO comments on event-ci.yml and event-tag.yml since their values are no longer "to-do". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bump LibMR pin to master (#92, #94, #95 landed on master) LibMR master (44d5025f) is tree-equivalent to the side-branch (51e3ad1c) we were tracking -- same three MOD-15749 commits (#92 add short-form, #94 NULL check, #95 nodeId overflow) on the same base (2857cfec). The side-branch was created as a precaution in case LibMR master had unrelated changes incompatible with 8.x; turns out master's tip is just those three PRs, so no conflict. Switching to master per Tal's review comment ("align to master"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #92. Replace the hard
RedisModule_AssertatSetClusterDataShortForm's entry with a logged warning + early return, so a module loaded against a redis build that doesn't exportRedisModule_GetClusterNodeSlotRangessurvives a short-formCLUSTERSETinstead of aborting the shard.Why
#92added the short-form path which relies onRedisModule_GetClusterNodeSlotRanges(added to redis/redis in #14953, merged 2026-05-25 tounstable, planned for OSS 8.10 and Enterprise 8.4 backport).REDISMODULE_GET_API(GetClusterNodeSlotRanges)at module init silently leaves the function pointer NULL if the host redis doesn't expose the API —RedisModule_GetApi's return value is unchecked, and the module continues to load. The next short-formCLUSTERSETthen tripsRedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL)and crashes the shard.In Enterprise this is supposed to be prevented by the DMC dispatching by module version (RED-197643) so short-form is only sent to modules loaded against a redis that has the API. But:
CLUSTERSETto a module on OSS Redis 8.4 without the API takes the shard down.RedisEnterprise_ParseTopologygates on bothargc == 3and the function pointer being non-NULL, then falls through to long-form parsing — same defensive pattern (RediSearch#9779).Behavior matrix
Recovery from the new "no API + short form" outcome is the same path OSS already uses: send
REFRESHCLUSTER, or a long-formCLUSTERSETfrom a topology-aware caller.Test plan
LibMR CI doesn't currently have infrastructure to mock-load the module against a redis build without the new API (the broader testability gap tracked by MOD-15862), so this change is exercised via:
redise.c:114.test_short_form_clustersetin RedisTimeSeries#2034, which runs LibMR pinned to a post-MOD-15749 Add support for short form of CLUSTERSET #92 master against redis built fromunstable(which has the API).A LibMR-side unit test for the no-API case would require the redismock-style mock cluster API plumbing RediSearch added in #9779 — out of scope here.
Risk
Very low. One assert -> one NULL-check + log + early return. No behavior change on hosts with the API. No public API change.
Note
Low Risk
Single defensive guard in cluster setup; no change when the Redis module API is present; only affects the unsupported short-form path.
Overview
Short-form
CLUSTERSETno longer aborts the shard when the host Redis build does not exportRedisModule_GetClusterNodeSlotRanges(optional API resolved at module load viaRedisModule_GetApi).In
SetClusterDataShortForm, the entryRedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL)is replaced with a NULL check, a warning log that points operators to long-formCLUSTERSETorREFRESHCLUSTER, and an early return. Cluster topology is left unchanged / unconfigured instead of crashing; behavior on Redis builds that do provide the API is unchanged.Reviewed by Cursor Bugbot for commit b6a223c. Bugbot is set up for automated code reviews on this repo. Configure here.