Skip to content

Add RLEC cluster node slot ranges API#88

Merged
GuyAv46 merged 1 commit into
masterfrom
codex/rlec-cluster-node-slot-ranges
May 26, 2026
Merged

Add RLEC cluster node slot ranges API#88
GuyAv46 merged 1 commit into
masterfrom
codex/rlec-cluster-node-slot-ranges

Conversation

@GuyAv46
Copy link
Copy Markdown
Collaborator

@GuyAv46 GuyAv46 commented May 26, 2026

Summary

Add the RLEC-only RedisModule_GetClusterNodeSlotRanges module API declaration and loader entry.

Details

  • Declares RedisModule_GetClusterNodeSlotRanges in redismodule-rlec.h.
  • Adds REDISMODULE_GET_API(GetClusterNodeSlotRanges) to REDISMODULE_RLEC_API_DEFS so RLEC builds load the API during RedisModule_Init.
  • Leaves generic redismodule.h unchanged because this API is only expected in Enterprise/RLEC for the target RediSearch backports.

Verification

Header-only SDK change. Downstream RediSearch backport branches will be rebuilt and tested after bumping this submodule commit.

@GuyAv46 GuyAv46 force-pushed the codex/rlec-cluster-node-slot-ranges branch from 8e0356c to 43ccfa7 Compare May 26, 2026 10:37
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request May 26, 2026
Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.
Copy link
Copy Markdown

@gabsow gabsow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing — looks good, signature matches the merged redis/redis#14953 byte-for-byte and the RLEC-only placement is consistent with the decision to skip the OSS 8.4/8.6/8.8 backport. One observation, inline.

Comment thread redismodule-rlec.h
REDISMODULE_API int (*RedisModule_IsKeyInRam)(RedisModuleCtx *ctx, RedisModuleString *key) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_EnablePostponeClients)(void) REDISMODULE_ATTR;
REDISMODULE_API int (*RedisModule_DisablePostponeClients)(void) REDISMODULE_ATTR;
REDISMODULE_API RedisModuleSlotRangeArray *(*RedisModule_GetClusterNodeSlotRanges)(RedisModuleCtx *ctx, const char *nodeid) REDISMODULE_ATTR;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth considering as a follow-up: redis/redis#14953 introduced three APIs together — GetClusterNodeSlotRanges (this one), ClusterGetLocalSlotRanges, and ClusterFreeSlotRanges. The latter two are currently declared in the generic redismodule.h (lines 1358–1359 of master), and REDISMODULE_GET_API(ClusterGetLocalSlotRanges) is wired up at line 1759.

If the "OSS Redis 8.4/8.6/8.8 won't get the backport" decision applies to all three (which seems implied — they're a single redis-core PR), the generic header is currently misleading: modules compiled against this SDK get NULL function pointers at runtime when loaded on OSS Redis without the backport, same crash class we're guarding against in RedisGears/LibMR#94.

Not a blocker for this PR, but if your read of the backport policy matches mine, moving the other two to redismodule-rlec.h here (or in a follow-up) would keep the SDK aligned with actual API availability.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterGetLocalSlotRanges and ClusterFreeSlotRanges are already present in redismodule.h, both on the master and 8.4 directory variants. These were added officially in OSS 8.4, so no special handling is required for them

@GuyAv46 GuyAv46 merged commit 58ae8fc into master May 26, 2026
@GuyAv46 GuyAv46 deleted the codex/rlec-cluster-node-slot-ranges branch May 26, 2026 11:16
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request May 26, 2026
Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request May 27, 2026
… add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request Jun 1, 2026
…2038)

* MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
(cherry picked from commit b0ecf8a)

* Fix event-ci.yml redis-ref: unstable -> 8.8

The TODO in .github/workflows/event-ci.yml prepare-values step
("redis-ref=unstable # todo change per version/tag") was never
done when this branch was cut. Result: CI was running against redis
unstable instead of the matching release-line redis, so failures
on this branch reflected unstable-redis behavior rather than the
target release-line behavior.

Fix the hardcoded ref to match the branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* MOD-15749 test_short_form_clusterset: skip on OSS 8.x

Short-form CLUSTERSET needs RM_GetClusterNodeSlotRanges
(redis/redis#14953), unavailable in OSS 8.x. The feature ships
against RE-private redis; this branch always skips. Master / RE
branches keep the assertions intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* MOD-15749 Adopt Tal's REDISMODULE_MAIN Makefile fix (#2043)

Mirror the fix already applied on 8.4 / 8.6 cherry-pick PRs. Background:

The Makefile typo DREDISMODULE_MAIN (introduced 2024-08-22 by MOD-8014,
4145c70) meant the build was never setting REDISMODULE_MAIN. RTS .o
files emitted no storage for SDK function pointers; the .so linked only
because LibMR's static library happened to provide storage for every
pointer LibMR's vendored redismodule.h declared, plus a handful of
hand-patched forward declarations in src/common.h that grew each time
someone added a new SDK API not in LibMR's header.

On 8.8 the typo never manifested as a CI failure because:
  - 8.8 src does call the new APIs (GetUserUsername, EnablePostpone*)
    after MOD-15262, but src/common.h already had matching forward
    declarations from de204c2 (Tal, 2026-05-04) added as a workaround.
  - So the .so linked even with the typo.

Applying Tal's #2043 fix here for consistency and correctness:
  - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN.
  - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be
    too late once common.h is force-included).

Net result: module.o now emits storage for all ~404 SDK pointers from
the SDK header alone. The common.h forward declarations on 8.8 become
dead code, but are intentionally left in place here -- they will be
cleaned up by the master cleanup PR (#2044) and propagated to 8.x
branches once that lands. Keeping this commit minimal preserves the
cherry-pick PR's scope and avoids merge conflict risk with #2044.

Verified locally against redis 8.4.3 OSS: module loads cleanly,
TS.CREATE / TS.ADD work. nm shows GetUserUsername / EnablePostpone* /
DisablePostpone* are now `C` (common) in module.o.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix remaining redis-ref defaults: 'unstable' -> '8.8'

event-ci.yml was already fixed (the canonical PR-triggered workflow),
but three other workflows still default to 'unstable':

  - event-tag.yml          (tag-event releases)
  - event-nightly.yml      (nightly schedule)
  - flow-random-traffic.yml (random-OS traffic test, workflow_call+dispatch)

Same pattern as the earlier event-ci.yml fix: the TODO comment
("todo change per version/tag") was never honored when this branch
was cut, so manual / nightly / tag triggers on the 8.8 branch would
run against redis unstable rather than the matching 8.8 release.

Also drops the now-stale TODO comment on event-tag.yml since the
value is no longer "to-do".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request Jun 1, 2026
…2041)

* MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
(cherry picked from commit b0ecf8a)

* MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x

Single squashed change replacing the SDK-pin back-and-forth from
prior commits. Net effect vs the cherry-pick:

  - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix
    from Tal's #2043). The macro must be set by the build so it is in
    effect before the force-included common.h pulls in redismodule.h.
  - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be
    too late once common.h is force-included).
  - tests/flow/test_asm.py::test_short_form_clusterset: always skip.
    The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953),
    unavailable in OSS 8.x; ships against RE-private. Master / RE
    branches keep the assertions intact.

Verified locally against built redis 8.4.3 OSS: module loads cleanly,
TS.CREATE / TS.ADD work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Drop stale "todo change per version/tag" comments

event-ci.yml and event-tag.yml already had the redis-ref value set
to '8.4', so the TODO comments next to those values are obsolete.
Removed for hygiene -- same cleanup applied to other 8.x branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump LibMR pin to master (#92, #94, #95 landed on master)

LibMR master (44d5025f) is tree-equivalent to the side-branch
(51e3ad1c) we were tracking -- same three MOD-15749 commits
(#92 add short-form, #94 NULL check, #95 nodeId overflow) on the
same base (2857cfec). The side-branch was created as a precaution
in case LibMR master had unrelated changes incompatible with 8.x;
turns out master's tip is just those three PRs, so no conflict.

Switching to master per Tal's review comment ("align to master").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request Jun 1, 2026
…2040)

* MOD-15749 Link to new version of LibMR with short-form CLUSTERSET and add tests (#2034)

* Link to new version of LibMR with optional AUTH

* Add an option to create an oss-cluster env without sending the initial REFRESHCLUSTER

* Added a test for the new short form of CLUSTERSET

* MOD-15749 Bump deps/LibMR to master (short-form CLUSTERSET landed)

Picks up the merged RedisGears/LibMR#92 — which now includes:
  - slots[] first-range population fix
  - slotless-node handling
  - signed loop index over num_ranges

Built against the merged redis core API RM_GetClusterNodeSlotRanges
(redis/redis#14953, in redis unstable as of 2026-05-25).

* MOD-15749 Strengthen test_short_form_clusterset (review feedback)

Three additive changes to the new test in tests/flow/test_asm.py; no
existing assertion is removed or weakened:

1. Slot-routed validation. The original assertions only exercise
   LibMR's node-list fan-out path (TS.QUERYINDEX dispatches via
   RedisModule_GetClusterNodesList). The whole reason for the
   companion LibMR change (#92) was a bug in
   clusterCtx.CurrCluster->slots[] population, which the existing
   asserts would not catch. Added a TS.MRANGE GROUPBY REDUCE count
   block that goes through MR_RunReshuffleStep ->
   MR_SendRecordToSlot -> MR_ClusterSendMsgBySlot, which reads
   slots[hslot]. This is the exact table short-form CLUSTERSET
   populates from RedisModule_GetClusterNodeSlotRanges.

2. Tighter lower bound on the pre-CLUSTERSET QUERYINDEX assertion:
   '0 < len(queryindex) < number_of_keys' (was '< number_of_keys'),
   so an empty/error reply no longer trivially passes.

3. Per-shard reply check on env.broadcast('timeseries.CLUSTERSET');
   previously the return value was discarded and a CLUSTERSET error
   on any shard would only surface via the downstream symptom.

Also lifted samples_per_key into a local so the new GROUPBY block
can reference it (value unchanged).

* MOD-15749 Pin LibMR to defensive-only branch HEAD (cheap fix)

Re-pin deps/LibMR from 47a8a68 (LibMR master post-#92) to 0de9150 on
RedisGears/LibMR#94. That commit replaces the
RedisModule_Assert(RedisModule_GetClusterNodeSlotRanges != NULL) in
SetClusterDataShortForm with a log + early return -- nothing more.

Earlier iteration also added error propagation through the CLUSTERSET
reply, but the consumer side (DMC) doesn't read the reply anyway and
RediSearch's analogous code path (redise.c:114) doesn't propagate either.
Dropped the upgrade as YAGNI.

The pin is still a PR-branch SHA, not LibMR master. Do not merge TS#2034
to TS master until LibMR#94 squash-merges and the pin is re-bumped to
LibMR master.

* MOD-15749 Pin RedisModulesSDK to PR #88 (RLEC GetClusterNodeSlotRanges)

Bump deps/RedisModulesSDK from 850e0bb (master) to 43ccfa7 on Guy's
RedisLabsModules/RedisModulesSDK#88, which adds
RedisModule_GetClusterNodeSlotRanges to redismodule-rlec.h (RLEC-only,
matching the redis-core decision to backport redis/redis#14953 only to
private Enterprise / RLEC).

TS source does not directly call RedisModule_GetClusterNodeSlotRanges
today -- LibMR is the consumer of that API. But keeping the SDK header
aligned with the merged redis core API matters because:
  - TS src/ includes RedisModulesSDK/redismodule.h directly and depends
    on the same SlotRange types (libmr_commands.c, libmr_integration.c
    already use RedisModuleSlotRange / RedisModuleSlotRangeArray /
    RedisModule_ClusterGetLocalSlotRanges / RedisModule_ClusterFreeSlotRanges).
  - Future TS code may need the per-node API for orchestrator-style work.
  - Avoids header drift between LibMR's vendored redismodule.h and the
    shared SDK header.

This pin is to a PR-branch SHA, same shape as the LibMR#94 pin. Once
SDK#88 lands, follow-up re-pin to SDK master before merging TS to master.

* MOD-15749 Re-pin LibMR and RedisModulesSDK to master after PRs landed

Both upstream PRs are merged:
  - RedisGears/LibMR#94 (defensive NULL check for short-form CLUSTERSET)
    landed as merge commit 8f84041 on LibMR master.
  - RedisLabsModules/RedisModulesSDK#88 (RLEC GetClusterNodeSlotRanges
    declaration) landed as merge commit 58ae8fc on SDK master.

Bump deps/LibMR 0de9150 -> 8f84041 and deps/RedisModulesSDK 43ccfa7 ->
58ae8fc, both moving from PR-branch SHAs to the proper master HEAD.

This removes the "do not merge to TS master" caveat that was on the
previous two bump commits. TS#2034 is now safe to merge once review +
CI sign off.

* MOD-15749 Pin LibMR to fix branch (heap-buffer-overflow on node IDs)

Bump deps/LibMR from 8f84041 (master) to the HEAD of
RedisGears/LibMR#95 (tom.gabsow/fix-node-id-buffer-overflow).

That branch contains the fix for the ASan heap-buffer-overflow that the
linux-sanitizer / x64-jammy job caught on the previous CI run -- MR_CreateNode
was treating the 40-byte non-null-terminated node IDs returned by
RedisModule_GetClusterNodesList as C strings. The fix replaces MR_STRDUP/strcmp
with memcpy + explicit null terminator + memcmp on REDISMODULE_NODE_ID_LEN.

Pinning to the PR-branch SHA here so the next sanitizer CI run on this PR
either confirms the fix (red -> green) or shows whatever ASan finds next.
Do not merge TS#2034 to master until LibMR#95 squash-merges and this
pin is bumped back to LibMR master.

* MOD-15749 Bump LibMR pin to #95 fixup (also covers MR_GetNode call)

Updated LibMR#95 to address Cursor Bugbot's high-severity finding:
the original commit normalized the node ID only inside the body of
MR_CreateNode, but the assert at the top of the function -- which
calls MR_GetNode(id) -> mr_dictFind -- already used the raw non-
null-terminated id, hitting the same overflow class.

The fixup commit (6741c11) moves the normalization to the very top
of MR_CreateNode so MR_GetNode operates on a safe null-terminated
copy. Re-pin here to pick that up.

* MOD-15749 Bump LibMR to #95 minimal fix (call-site normalization)

LibMR#95 was reworked to the minimal shape:
  - Fix is scoped to SetClusterDataShortForm only.
  - MR_CreateNode is byte-identical to LibMR master.
  - 9 lines added in the one function whose caller-side contract
    violation introduced the bug.

Bump deps/LibMR to the new HEAD so the sanitizer CI verifies the
minimal fix.

* MOD-15749 Bump LibMR to #95 minimal fix (null-first, then memcpy)

* MOD-15749 Bump LibMR to #95 (final shape: matches realId style)

* MOD-15749 Re-pin LibMR to master after #95 landed

LibMR#95 (fix heap-buffer-overflow on short-form CLUSTERSET node IDs)
merged. Bump deps/LibMR from the PR-branch SHA (8851627) to LibMR
master HEAD.

This removes the "do not merge to TS master" caveat. TS#2034 is now
pinned to a clean LibMR master commit and is functionally ready to
merge once review + CI sign off.

* MOD-15749 Strengthen broadcast reply guard (Cursor Bugbot feedback)

Cursor Bugbot pointed out that `for ... in enumerate(replies or [])` lets
broadcast returning None or [] silently skip the per-shard assert -- the
test would proceed as if CLUSTERSET succeeded on all shards. Add an
explicit pre-check that env.broadcast returned a non-None list of exactly
env.shardsCount entries before iterating, so framework-level surprises
fail at the right place instead of leaking into a confusing TS.QUERYINDEX
count mismatch later.

Pure strengthening: existing per-shard reply assertion is unchanged.

* MOD-15749 Widen GROUPBY test to multiple distinct groups (Tal feedback)

Tal: "with a single group, MR_SendRecordToSlot only ever reads one slots
index, so a regression that leaves most entries NULL (the exact short-form
CLUSTERSET failure mode) still passes by luck."

The existing GROUPBY-by-label assertion uses one label value ("test"), so
the GROUPBY shuffle hits exactly one slots[hash_slot("test")] entry --
which can pass even if most slots[] entries are NULL.

Add a stronger slot-routed assertion:
  1. Tag each key with a `group=g<i % number_of_groups>` label using a
     per-key callable in fill_some_data (purely additive -- existing
     string/int label values still work as-is).
  2. After CLUSTERSET, run TS.MRANGE FILTER label=test GROUPBY group
     REDUCE count and assert we see exactly number_of_groups distinct
     group-blocks, each with samples_per_key timestamps and
     keys_per_group as the per-timestamp count.

The previous single-group assertion is unchanged -- still validates the
basic aggregation path.

* MOD-15749 Replace env.broadcast with per-shard send for CLUSTERSET

env.broadcast() in rltest is fire-and-forget -- it returns None, not a
list of replies. The previous `for ... in enumerate(replies or [])`
pattern silently absorbed this (Cursor Bugbot's exact finding: the
per-shard reply check was dead code). My earlier strengthening
correctly exposed that with `assert replies is not None and len(...)`
which fired immediately in CI because broadcast really does return None.

Replace with an explicit per-shard send: getConnection(i).execute_command()
for each shard, asserting each reply is OK. Same semantic intent as the
original broadcast (every shard gets the short-form CLUSTERSET) but now
we actually verify each reply.

* MOD-15749 Keep env.broadcast for CLUSTERSET, drop dead reply check

Tom pointed out: the DMC dispatches CLUSTERSET in production via
broadcast (fire-and-forget to every shard), so the test should mirror
that wire-level pattern. The previous iteration (per-shard sequential
sends) sidestepped the env.broadcast-returns-None problem but lost
fidelity to how the orchestrator actually issues this command.

Cursor Bugbot's original concern was specifically that the per-shard
reply check (`for ... in enumerate(replies or [])`) was dead code,
because env.broadcast returns None. The reply check never executed
since the test was added -- so removing it isn't a coverage loss, just
deletion of code that never ran.

Restore env.broadcast and remove the unreachable reply loop. The
downstream assertions (TS.QUERYINDEX expecting number_of_keys after
CLUSTERSET; TS.MRANGE GROUPBY group REDUCE count expecting
number_of_groups blocks) catch any actual failure end-to-end.

* MOD-15749 Mirror DMC's single-shard CLUSTERSET dispatch + propagation check

DMC's actual production pattern (per LibMR cluster.c:991): send
timeseries.CLUSTERSET to ONE shard; that shard self-configures and then
propagates the topology to peers by sending them CLUSTERSETFROMSHARD on
rg.hello / reconnect. There's no DMC-side broadcast.

The previous test used env.broadcast(...) which sent to every shard
directly -- not what the DMC does in production, and skipped exercising
the FROMSHARD propagation that production relies on.

Switch to the DMC pattern:
  - Send timeseries.CLUSTERSET to shard 0 only (with the per-reply OK
    check that env.broadcast couldn't give us).
  - Poll INFOCLUSTER on every shard until none returns the "no cluster
    mode" simple-string (LibMR cluster.c:1472, returned while
    clusterCtx.CurrCluster is NULL). Deadline 10s; usually sub-second.
  - On timeout, report which shards never received the propagated
    CLUSTERSETFROMSHARD -- a sharper signal than the previous downstream
    QUERYINDEX-count mismatch.

This is a stronger end-to-end check: the single-shard send + propagation
path is exactly what the DMC relies on, so a regression in either the
short-form parser OR the FROMSHARD forwarding now fails the test
explicitly instead of via a confusing downstream symptom.

* MOD-15749 test_short_form_clusterset: address self-review findings

Three nits found on fresh self-review of the propagation block:

1. Scale propagation deadline for valgrind / sanitizer (real fix).
   The fixed 10s deadline is fine on normal builds but under valgrind
   every operation is ~30x slower; the reconnect + async-send chain
   that delivers CLUSTERSETFROMSHARD can plausibly take tens of seconds.
   Scale to 60s when VALGRIND or SANITIZER, matching the existing
   pattern at line 41 of this file. Avoids a real flake risk in the
   linux-valgrind / x64-jammy job.

2. Cache per-shard connections outside the polling loop.
   The 100ms inner loop was opening/closing a fresh connection per
   shard per iteration (worst case ~300 fresh sockets). Hoist the
   connections once before the loop -- correctness unchanged, just
   stops thrashing redis client sockets.

3. Wrong variable in assertion message.
   `assert withlabels == [], filtered_by` -- the message printed the
   group name instead of the actually-unexpected withlabels content.
   Cosmetic but misleading on failure. Fixed to print `withlabels`.

* MOD-15749 test_short_form_clusterset: trim to match neighboring tests' style

The test had grown several paragraphs of explanatory comments and
fragmented every redis call into its own `with env.getConnection(0)`
block. The other tests in this file (test_asm_with_data,
test_asm_with_data_and_queries_during_migrations, etc.) use a single
`conn = env.getConnection(0)` hoist and one-line comments at section
transitions -- no docstrings, no paragraph-style explanations.

Align style:
  - Single conn hoist; reuse for every per-shard-0 call.
  - Replace 4-7-line paragraph comments with one-line section notes.
  - Drop the trivially-always-true `number_of_keys % number_of_groups`
    self-check.
  - Drop the verbose f-string assertion message on the OK reply check
    (other simple equality asserts in the file are messageless).
  - Compress fill_some_data's callable-label docstring to one line.

No assertion is removed or weakened -- same pre/post-CLUSTERSET
QUERYINDEX checks, same propagation poll, both single-group and
multi-group GROUPBY assertions retained. The valgrind-aware deadline
scaling stays in place (just inlined). Net diff: 112 lines down to 54.

* MOD-15749 Replace INFOCLUSTER poll with TS.QUERYINDEX poll

INFOCLUSTER is registered with the `internal` command flag in OSS
builds (LibMR cluster.c sets `command_flags = "readonly deny-script
internal"` when RedisModule_GetInternalSecret is available). Internal
commands are reachable only over the cluster bus -- not from a regular
client connection -- so the previous INFOCLUSTER poll always saw
"unknown command 'timeseries.INFOCLUSTER'".

Switch to polling TS.QUERYINDEX on shard 0: before propagation it
returns the local subset (~number_of_keys/shardsCount), after
propagation it returns the full set. This uses a publicly-callable
command, observes the actual feature we care about (cross-shard
fan-out), and replaces the broken probe with one that runs in the
same OSS-cluster mode as the rest of the test.

Verified locally: test passes in 6 seconds against redis unstable
(post-#14953) on macos-arm64.

---------

Co-authored-by: Tom Gabsow <gabsow.tom@gmail.com>
Co-authored-by: Tom Gabsow <tom.gabsow@redis.com>
(cherry picked from commit b0ecf8a)

* MOD-15749 Adopt Tal's REDISMODULE_MAIN fix; skip test on OSS 8.x

Single squashed change replacing the SDK-pin back-and-forth from
prior commits. Net effect vs the cherry-pick:

  - Makefile CC_DEFS: DREDISMODULE_MAIN -> REDISMODULE_MAIN (typo fix
    from Tal's #2043). The macro must be set by the build so it is in
    effect before the force-included common.h pulls in redismodule.h.
  - src/module.c: drop the late `#define REDISMODULE_MAIN` (would be
    too late once common.h is force-included).
  - tests/flow/test_asm.py::test_short_form_clusterset: always skip.
    The feature needs RM_GetClusterNodeSlotRanges (redis/redis#14953),
    unavailable in OSS 8.x; ships against RE-private. Master / RE
    branches keep the assertions intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix remaining redis-ref defaults: 'unstable' -> '8.6'

flow-random-traffic.yaml still defaulted to 'unstable' on two inputs.
Manual / workflow_call triggers on the 8.6 branch would run against
redis unstable instead of redis 8.6.

Also drops the now-stale TODO comments on event-ci.yml and
event-tag.yml since their values are no longer "to-do".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Bump LibMR pin to master (#92, #94, #95 landed on master)

LibMR master (44d5025f) is tree-equivalent to the side-branch
(51e3ad1c) we were tracking -- same three MOD-15749 commits
(#92 add short-form, #94 NULL check, #95 nodeId overflow) on the
same base (2857cfec). The side-branch was created as a precaution
in case LibMR master had unrelated changes incompatible with 8.x;
turns out master's tip is just those three PRs, so no conflict.

Switching to master per Tal's review comment ("align to master").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Gal Cohen <38293267+galcohen-redislabs@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants