[MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator by lerman25 · Pull Request #8191 · RediSearch/RediSearch

lerman25 · 2026-01-27T15:51:51Z

PR Description

Refactor Coordinator Search to Use Proper Blocking Client Callbacks

This PR refactors the distributed search (FT.SEARCH and FT.PROFILE) flow in coordinator mode to properly use Redis module blocking operations, following the Redis Modules Blocking Operations best practices.

Motivation

The previous implementation had several issues:

Timeout handling in the reducer - Timeout was checked in the searchResultReducer background thread by comparing elapsed time, which is error-prone and doesn't integrate with Redis's native timeout mechanism
Reply from background thread - The reducer replied directly using a thread-safe context, rather than using the proper unblock callback pattern

Key Changes

1. Proper Callback Architecture

DistSearchUnblockClient (reply callback): Now only handles sending the reply to the client. Gets the reduced results from the context and calls sendSearchResults or profileSearchReply
DistSearchFreePrivData (free_privdata callback): New callback responsible for all cleanup - frees searchReducerCtx, searchProfileReducerCtx, searchRequestCtx, and MRCtx
DistSearchTimeoutFailClient (timeout callback): New callback for TimeoutPolicy_Fail - replies with timeout error when the blocking client times out
DistSearchBlockClientWithTimeout: New helper that configures the blocked client with the appropriate callbacks based on timeout policy

2. Reducer Refactoring (searchResultReducer)

Allocates searchReducerCtx on the heap (was stack-allocated) and stores it in req->rctx
For FT.PROFILE, stores profile-specific data in searchProfileReducerCtx
Error paths use RedisModule_AbortBlock and call DistSearchFreePrivData directly
Success path calls RedisModule_UnblockClient(bc, mc) with private data
Removed inline timeout checking (now handled by Redis timeout callback)

3. Memory Leak Fix

Checks the return value of RedisModule_UnblockClient
If it returns REDISMODULE_ERR (client already unblocked by timeout), calls DistSearchFreePrivData directly to avoid leaking resources

4. Type Definition Fix

Changed typedef struct { ... } searchReducerCtx; to typedef struct searchReducerCtx { ... } searchReducerCtx;
This ensures struct searchReducerCtx * and searchReducerCtx * are compatible types

Flow Diagram

Client Request (FT.SEARCH)
    │
    ▼
DistSearchBlockClientWithTimeout()
    │ - Registers: reply_callback, timeout_callback, free_privdata
    ▼
MR_Fanout() → shards
    │
    ▼
searchResultReducer (background thread)
    ├── Error? ──► AbortBlock() + DistSearchFreePrivData()
    │
    └── Success? ─► UnblockClient(bc, mc)
                        │
                        ├── OK ──► DistSearchUnblockClient (reply)
                        │              └──► DistSearchFreePrivData (cleanup)
                        │
                        └── ERR (timeout race) ──► DistSearchFreePrivData() directly

Testing

Added new test file tests/pytests/test_blocked_client_timeout.py with:

test_fail_timeout_search - Tests timeout handling for FT.SEARCH
test_fail_timeout_profile - Tests timeout handling for FT.PROFILE

The tests simulate timeout by pausing a shard and triggering client unblock via CLIENT UNBLOCK ... TIMEOUT.

This PR requires release notes
This PR does not require release notes

Note

Medium Risk
Refactors coordinator-side distributed search to rely on Redis blocking client reply/timeout/free callbacks and changes timeout parsing/type, which can affect query lifecycle, error propagation, and memory cleanup under race/timeout conditions.

Overview
Coordinator-mode FT.SEARCH/FT.PROFILE is refactored to follow Redis blocked-client best practices: reduction now runs in the background, but all replies/errors/timeouts are sent on the main thread via DistSearchUnblockClient, a new timeout callback, and a new free_privdata cleanup path.

Timeout handling is moved from elapsed-time checks in the reducer to Redis’s native blocked-client timeout mechanism (with early TIMEOUT arg parsing), and error propagation is unified by storing a QueryError inside MRCtx (including a new MR_CreateBailoutCtx path for early failures).

Misc fixes include switching parseTimeout to size_t/AC_GetSize, zeroing MRCommand on free, and adding cluster tests (test_blocked_client_timeout.py) that simulate coordinator/shard stalls and validate FAIL-timeout behavior.

^{Written by Cursor Bugbot for commit 1d2f821. This will update automatically on new commits. Configure here.}

src/module.c

codecov · 2026-01-27T16:21:20Z

Codecov Report

❌ Patch coverage is 95.49550% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.55%. Comparing base (f1e6935) to head (1d2f821).
⚠️ Report is 19 commits behind head on master.

Files with missing lines	Patch %	Lines
src/module.c	94.79%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8191      +/-   ##
==========================================
- Coverage   82.57%   82.55%   -0.02%     
==========================================
  Files         368      383      +15     
  Lines       55366    56330     +964     
  Branches    14340    15161     +821     
==========================================
+ Hits        45719    46506     +787     
- Misses       9499     9670     +171     
- Partials      148      154       +6

Flag	Coverage Δ
flow	`83.82% <95.49%> (-0.68%)`	⬇️
unit	`50.48% <0.90%> (+0.82%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/module.c

fcostaoliveira · 2026-01-27T17:32:52Z

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

In summary:

Detected a total of 35 stable tests between versions.
Detected a total of 6 highly unstable benchmarks (6 baseline).
Detected a total of 1 improvements above the improvement water line.
Detected a total of 1 regressions bellow the regression water line 8.0%.

You can check a comparison in detail via the grafana link

Performance Improvements - Comparison between master and omerl-search-coord-new-timeout.

Time Period from 30 days ago. (environment used: oss-standalone)

Test Case	Baseline master (median obs. +- std.dev)	Comparison omerl-search-coord-new-timeout (median obs. +- std.dev)	% change (higher-better)	Note
search-numeric-sortby	2339 +- 2.7% (4 datapoints)	3683	57.4%	IMPROVEMENT

Performance Regressions and Issues - Comparison between master and omerl-search-coord-new-timeout.

Time Period from 30 days ago. (environment used: oss-standalone)

Test Case	Baseline master (median obs. +- std.dev)	Comparison omerl-search-coord-new-timeout (median obs. +- std.dev)	% change (higher-better)	Note
search-numeric	3019 +- 25.3% UNSTABLE (4 datapoints)	2208	-26.9%	UNSTABLE (baseline high variance); server: FT.SEARCH p50 increased 34.4% (baseline CV=25.6%); client: client latency stable; only server side confirms regression (client side stable) - insufficient evidence
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable	38 +- 14.5% UNSTABLE (7 datapoints)	32	-17.3%	UNSTABLE (baseline high variance); server: FT.SEARCH p50 increased 12.1% (baseline CV=9.3%); client: client latency stable; only server side confirms regression (client side stable) - insufficient evidence
search-ftsb-10K-enwiki_abstract-hashes-fulltext-search-sortby-limit-0-100	938 +- 3.1% (4 datapoints)	834	-11.1%	REGRESSION
search-numeric-sortby-desc-optimize	29 +- 10.5% UNSTABLE (7 datapoints)	29	-1.0%	UNSTABLE (baseline high variance); server: p50 latency stable; client: client latency stable; neither server nor client side confirms regression
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query	820 +- 19.9% UNSTABLE (7 datapoints)	820	0.1%	UNSTABLE (baseline high variance); server: p50 latency stable; client: client latency stable; neither server nor client side confirms regression
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable	970 +- 13.1% UNSTABLE (7 datapoints)	1002	3.3%	UNSTABLE (baseline high variance); server: p50 latency stable; client: client latency stable; neither server nor client side confirms regression
search-numeric-sortby-desc	2298 +- 31.0% UNSTABLE (7 datapoints)	3615	57.3%	UNSTABLE (baseline high variance); server: FT.SEARCH p50 decreased 37.3% (baseline CV=22.1%); client: Latency decreased 36.4% (baseline CV=21.1%); neither server nor client side confirms regression

Tests with No Significant Changes (35 tests)

Tests with No Significant Changes

Test Case	Baseline master (median obs. +- std.dev)	Comparison omerl-search-coord-new-timeout (median obs. +- std.dev)	% change (higher-better)	Note
ftsb-10K-enwiki_abstract-hashes-fulltext-sortby	72 +- 5.2% (7 datapoints)	69.00	-2.9%	No Change
ftsb-10K-enwiki_abstract-hashes-term-prefix	6157 +- 2.8% (7 datapoints)	6086.00	-1.2%	No Change
ftsb-10K-enwiki_abstract-hashes-term-suffix	2274 +- 2.7% (7 datapoints)	2280.00	0.3%	No Change
ftsb-10K-enwiki_abstract-hashes-term-suffix-withsuffixtrie	16468 +- 1.1% (7 datapoints)	16427.00	-0.2%	No Change
ftsb-10K-enwiki_abstract-hashes-term-wildcard	8837 +- 5.6% (7 datapoints)	8916.00	0.9%	No Change
ftsb-10K-enwiki_pages-hashes-fulltext-mixed_simple-1word-query_write_1_to_read_20.yml	1013 +- 2.2% (7 datapoints)	1027.00	1.3%	No Change
ftsb-10K-enwiki_pages-hashes-load	62825 +- 4.2% (7 datapoints)	57840.00	-7.9%	potential REGRESSION
ftsb-10K-multivalue-numeric-json	993 +- 0.8% (7 datapoints)	977.00	-1.7%	No Change
ftsb-10K-singlevalue-numeric-json	483 +- 0.3% (4 datapoints)	473.00	-2.1%	No Change
ftsb-1K-enwiki_abstract-hashes-term-contains	1934 +- 2.0% (7 datapoints)	1989.00	2.8%	No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query	393 +- 7.9% (7 datapoints)	400.00	1.7%	No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query	3062 +- 5.1% (7 datapoints)	3110.00	1.6%	No Change
ftsb-1M-enwiki_abstract-hashes-load	21305 +- 4.7% (7 datapoints)	21615.00	1.5%	No Change
ftsb-1M-nyc_taxis-ftadd-load	27761 +- 2.8% (7 datapoints)	28702.00	3.4%	potential IMPROVEMENT
ftsb-1M-nyc_taxis-hashes-load	29147 +- 3.6% (7 datapoints)	29606.00	1.6%	No Change
search-aggregate-post-filter-simple.yml	17491 +- 1.4% (7 datapoints)	16934.00	-3.2%	potential REGRESSION
search-filtering-tag-numeric	272 +- 9.2% (7 datapoints)	294.00	8.2%	waterline=9.2%. potential IMPROVEMENT
search-filtering-tag-numeric-filter-pipeline	10939 +- 1.4% (7 datapoints)	10953.00	0.1%	No Change
search-ftsb-10K-enwiki_abstract-hashes-fulltext-aggregate-sortby-limit-0-100	839 +- 2.4% (7 datapoints)	812.00	-3.3%	potential REGRESSION
search-ftsb-10K-enwiki_abstract-hashes-term-withoutsuffix-trie	14291 +- 1.2% (7 datapoints)	14152.00	-1.0%	No Change
search-ftsb-10K-enwiki_abstract-hashes-term-withsuffix-trie	14187 +- 1.7% (4 datapoints)	14148.00	-0.3%	No Change
search-ftsb-1700K-docs-union-iterators-q3	8.0 +- 1.4% (7 datapoints)	8.10	0.5%	No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query-non-sortable	169 +- 5.8% (4 datapoints)	174.00	3.3%	potential IMPROVEMENT
search-ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query-one-indexed-field	7507 +- 2.2% (4 datapoints)	7516.00	0.1%	No Change
search-ftsb-370K-docs-union-iterators-q4	8.3 +- 0.9% (7 datapoints)	8.30	-0.1%	No Change
search-ftsb-5200K-docs-union-iterators-q1	0.84 +- 1.4% (7 datapoints)	0.85	1.2%	No Change
search-ftsb-5500K-docs-union-iterators-q2	1.2 +- 0.8% (7 datapoints)	1.20	1.7%	No Change
search-geo	220 +- 2.2% (4 datapoints)	226.00	2.7%	No Change
search-high-cardinality-negation-term-baseline	37 +- 1.2% (7 datapoints)	37.00	-0.7%	No Change
search-high-cardinality-negation-term-comparison_union_all_other_terms	14 +- 2.0% (7 datapoints)	14.00	0.8%	No Change
search-numeric-optimize	7982 +- 1.1% (7 datapoints)	8111.00	1.6%	No Change
search-numeric-sortby-optimize	28 +- 8.1% (7 datapoints)	29.00	4.6%	waterline=8.1%. potential IMPROVEMENT
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-fulltext-filter	613 +- 2.8% (7 datapoints)	593.00	-3.2%	potential REGRESSION
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-numeric-filter	156 +- 9.2% (7 datapoints)	158.00	1.5%	waterline=9.2%. No Change
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-tag-filter	15731 +- 1.4% (7 datapoints)	16154.00	2.7%	No Change

src/module.c

tests/pytests/test_blocked_client_timeout.py

src/module.c

tests/pytests/test_blocked_client_timeout.py

src/module.c

cursor · 2026-01-28T15:14:40Z

src/module.c

-        QueryErrorsGlobalStats_UpdateError(errCode, 1, COORD_ERR_WARN);
-        res = MR_ReplyWithMRReply(reply, curr_rep);
-        goto cleanup;
+        QueryError_SetError(&rCtx->status, errCode, NULL);


Shard error messages lost when replying to client

Medium Severity

When a shard returns an error, rCtx->lastError is set to the original reply containing the specific error message, but QueryError_SetError is called with NULL as the message parameter. Later in DistSearchUnblockClient, only rCtx->status is used to reply via QueryError_ReplyAndClear, which will produce a generic error message for the error code. The original error's specific message text is stored but never used, resulting in loss of diagnostic information. The old code used MR_ReplyWithMRReply(reply, curr_rep) to preserve the exact error message from the shard.

Additional Locations (1)

src/module.c#L3820-L3825

cursor · 2026-01-28T15:14:40Z

src/module.c

+}
+
+typedef RedisModuleCmdFunc BlockedClientTimeoutCB ;
+typedef void (*BlockedClientFreePrivDataCB) (RedisModuleCtx *ctx, void *privdata);


Unnecessary typedefs and constant variable add complexity

Low Severity

The typedefs BlockedClientTimeoutCB and BlockedClientFreePrivDataCB are each used only once within DistSearchBlockClientWithTimeout. Additionally, freePrivDataCallback is always assigned to DistSearchFreePrivData and never changes, making it an unnecessary intermediate variable. These abstractions don't add semantic value and could be simplified by using the types directly and inlining DistSearchFreePrivData in the RedisModule_BlockClient call. This aligns with the PR reviewer comment "Avoid passing arguments you don't need."

Additional Locations (1)

src/module.c#L3922-L3924

cursor · 2026-01-28T20:27:27Z

src/module.c

+      profileSearchReply(reply, rCtx, MRCtx_GetNumReplied(mrctx), MRCtx_GetReplies(mrctx), &req->profileClock, rs_wall_clock_now_ns());
+    } else {
+      // Non-profile command
+      sendSearchResults(reply, rCtx);


NULL dereference when fanout has zero expected replies

High Severity

When the fanout command sends to zero shards (numExpected == 0), the code path in uvFanoutRequest directly unblocks the client without calling searchResultReducer and without setting an error. The new DistSearchUnblockClient only checks for explicit errors via QueryError_HasError, then proceeds to access req->rctx. Since the reducer never ran, req->rctx is NULL (from rm_calloc in searchRequestCtx_New). Calling sendSearchResults or profileSearchReply with NULL rCtx causes a NULL pointer dereference crash. The old code handled this with MRCtx_GetNumReplied(mrctx) == 0 check, but this was removed.

cursor · 2026-01-28T23:04:25Z

src/coord/rmr/rmr.c

+  // cmd can be NULL in case of bailout
+  if(ctx->cmd) {
+    MRCommand_Free(&ctx->cmd);
+  }


Invalid struct truthiness check in MRCtx_Free

High Severity

The check if(ctx->cmd) is invalid because MRCommand cmd is a struct, not a pointer. In C, a struct cannot be evaluated for truthiness directly. The comment states "cmd can be NULL in case of bailout" but structs cannot be NULL. Additionally, MR_CreateCtx uses rm_malloc without initializing the cmd field, so in the bailout path (MR_CreateBailoutCtx), the cmd field contains garbage memory. If this check somehow passes, calling MRCommand_Free on uninitialized data will cause crashes or memory corruption.

Additional Locations (1)

src/coord/rmr/rmr.c#L93-L110

cursor · 2026-01-28T23:04:25Z

src/module.c

+
+    searchRequestCtx *req = MRCtx_GetPrivData(mrctx);
+
+    searchReducerCtx *rCtx = req->rctx;


NULL dereference when bailout called without error status

High Severity

The new DistSearchUnblockClient callback assumes an error is always set when bailOut is called, but rscParseRequest can return NULL without setting an error (when LIMIT values are negative at lines 2171-2173). In this case, QueryError_HasError returns false, the early return is skipped, and MRCtx_GetPrivData(mrctx) returns NULL (since bailout sets privdata to NULL). The subsequent req->rctx dereference causes a NULL pointer crash.

Additional Locations (1)

src/module.c#L2170-L2173

src/coord/rmr/rmr.c

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-01-29T09:56:55Z

src/coord/rmr/rmr.c

+  Used when we need to send an error to the client, and we don't expect any replies.
+  The status parameter is used to pass the error to the client after we unblock it, must not be NULL or OK.*/
+MRCtx *MR_CreateBailoutCtx(RedisModuleCtx *ctx, RedisModuleBlockedClient *bc, QueryError *status) {
+  RS_ASSERT(status && QueryError_HasError(status));


Assertion failure when rscParseRequest fails without setting error

Medium Severity

The new MR_CreateBailoutCtx function has RS_ASSERT(status && QueryError_HasError(status)), but bailOut can be called from createReq when rscParseRequest returns NULL without setting an error. Existing code paths like invalid LIMIT values (line 2171-2173) or malformed SORTBY (line 2184-2186) return NULL without calling QueryError_SetError, causing the assertion to fail when bailOut attempts to create the bailout context.

Additional Locations (1)

src/module.c#L3652-L3653

src/module.c

…nator (#8191) * draft * Use abort on error path * use free priv data * test * avoid leak * skip SA * Cursor comments * profile reply with unblock * remove assert * test profile * Cleanup for PR * move postProcess to before unblocking the client * test timeout before fanout * remove double postProcess * review Phase1 * Phase 2 * fast check resp3 * fix resp3 * move clear errror * Fix error pass * Move error handling to MRCTX * check cmd before free * fix * Better assert and comment * Change check * Assign Query error if not assigned * MRcommand free safe

* initial commit (cherry picked from commit ab63a17) * fix api * some redesign of how search results are maintained * expose link_static_lib to allow disk to use it * add a way to control if async api will be used * fail debug command if async io is not supported * separate functions * update debug test and command * simplify some of the flow * put async after regular version * use double buffering * dynamically increase index result buffer * use double linked list to manage and track reads * code review fixes * change disk api to simplify passing and cleaning up dmd pointer * code review fixes * simplify error handling api * fix compilation * few fixes * fix cursor code review comments * minor fixes * code review fixes * remove check * pass right allocate callback * moved async state to its own file, added state machine like tests * fix cursor's code review comment * code review fix * address code review comments * fix code review comments * small fix * fix code review comments * minor refactoring to address code review comment * cide review fixes * code review comment * address code review comments * use case insensitive comparison * update header * [MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator (RediSearch#8191) * draft * Use abort on error path * use free priv data * test * avoid leak * skip SA * Cursor comments * profile reply with unblock * remove assert * test profile * Cleanup for PR * move postProcess to before unblocking the client * test timeout before fanout * remove double postProcess * review Phase1 * Phase 2 * fast check resp3 * fix resp3 * move clear errror * Fix error pass * Move error handling to MRCTX * check cmd before free * fix * Better assert and comment * Change check * Assign Query error if not assigned * MRcommand free safe * MOD-13701: Update deepdiff dependency version in pyproject.toml to >=8.6.1 (RediSearch#8212) * Update deepdiff dependency version in pyproject.toml to >=8.6.1 * Update dependency versions for deepdiff and orderly-set in uv.lock and pyproject.toml * Update deepdiff and orderly-set versions in uv.lock and pyproject.toml to latest releases * Pin deepdiff version to 8.6.1 in uv.lock and pyproject.toml for consistency * ci: update redisbench-admin (RediSearch#8213) * update to make sure load is recycled * bump redisbench-admin version * benchmark reqs update * update ver redisbench-admin * bump ver * update redisbench-admin * remove dataset name from load benchmark * Update tests/benchmarks/search-msmarco-6M-documents-load.yml * fix: Don't check license headers in the target directory (RediSearch#8226) * MOD-13602 Add queue time tracking to FT.PROFILE (RediSearch#8210) * Add validation tests for FT.PROFILE queue time bug Add design document and validation tests that confirm the bug exists: - testParsingTimeIncludesWorkersQueueTime_BUG: Confirms workers queue wait time is incorrectly included in 'Parsing time' (bug) - testParsingTimeDoesNotIncludeCoordQueueTime: Confirms coordinator queue time is correctly separate from shard's Parsing time Both tests pass, confirming the bug exists as described in the design doc. * Implement Part 1: Workers queue time in FT.PROFILE - Add profileQueueTime field to AREQ struct in aggregate.h - Capture queue time at start of AREQ_Execute_Callback() in aggregate_exec.c - Queue time = elapsed time since initClock was set before enqueueing - Reset initClock after capturing queue time for accurate parsing time - Print 'Workers queue time' in profile output in profile.c - Add testWorkersQueueTimeInProfile test to verify the fix - Mark testParsingTimeIncludesWorkersQueueTime_BUG as skipped (bug is fixed) This fixes the bug where FT.PROFILE's 'Parsing time' incorrectly included time spent waiting in the workers thread pool queue. * Add coordinator queue time tracking to FT.PROFILE Track time spent waiting in the coordinator thread pool queue and report it as 'Coordinator queue time' in FT.PROFILE output. Changes: - Add coordQueueTime field to ConcurrentSearchHandlerCtx struct - Add coordQueueTime field to searchRequestCtx struct - Calculate queue time in DistSearchCommandHandler and DEBUG_DistSearchCommandHandler - Copy queue time to searchRequestCtx in FlatSearchCommandHandler - Print 'Coordinator queue time' in profileSearchReplyCoordinator * Add test for coordinator queue time in FT.PROFILE Add testCoordinatorQueueTimeInProfile to verify that coordinator queue time is correctly captured in cluster mode when the coordinator thread pool is paused. * Remove design document (intermediate artifact) * Fix tests for new Workers queue time and Coordinator queue time fields * Address Cursor Bugbot review: remove obsolete test and unused env parameters * Add AGENTS.local.md to .gitignore * Fix test expectations for new Workers queue time field in FT.PROFILE output * Remove redundant comments from queue time tracking code * Fix test_RED_86036 index after adding Workers queue time to FT.PROFILE * Fix dead code issue * Revert unrelated changes to AGENTS.md and .gitignore * CR- larger debug pauses, clock only on profile and different output order * CR- dont init clock * MOD-13357: Add disk expiration support in OSS side (RediSearch#8218) Add disk expiration support in OSS side * try and align with master * add assert --------- Co-authored-by: lerman25 <58445352+lerman25@users.noreply.github.com> Co-authored-by: Itzikvaknin <82322982+Itzikvaknin@users.noreply.github.com> Co-authored-by: Joan Fontanals <jfontanalsmartinez@gmail.com> Co-authored-by: Luca Palmieri <20745048+LukeMathWalker@users.noreply.github.com> Co-authored-by: ofiryanai <ofiryanai1@gmail.com>

draft

a9e917e

github-actions bot added the size:M label Jan 27, 2026

lerman25 added action:run-benchmark action:run-msmarco-benchmark-fast action:run-msmarco-benchmark labels Jan 27, 2026

Use abort on error path

76fd0d0

lerman25 force-pushed the omerl-search-coord-new-timeout branch from 2b094cd to 76fd0d0 Compare January 27, 2026 15:53

cursor bot reviewed Jan 27, 2026

View reviewed changes

src/module.c Outdated Show resolved Hide resolved

src/module.c Outdated Show resolved Hide resolved

src/module.c Outdated Show resolved Hide resolved

use free priv data

63c8eb5

cursor bot reviewed Jan 27, 2026

View reviewed changes

src/module.c Outdated Show resolved Hide resolved

lerman25 added 2 commits January 27, 2026 19:47

test

ab6c4d6

avoid leak

1ba5cf7

cursor bot reviewed Jan 27, 2026

View reviewed changes

src/module.c Show resolved Hide resolved

skip SA

b05ba33

cursor bot reviewed Jan 27, 2026

View reviewed changes

Cursor comments

c8df6e4

cursor bot reviewed Jan 28, 2026

View reviewed changes

src/module.c Show resolved Hide resolved

src/module.c Outdated Show resolved Hide resolved

src/module.c Outdated Show resolved Hide resolved

tests/pytests/test_blocked_client_timeout.py Outdated Show resolved Hide resolved

lerman25 added 3 commits January 28, 2026 10:14

profile reply with unblock

40b6065

remove assert

8fe7b7b

test profile

12b72bb

cursor bot reviewed Jan 28, 2026

View reviewed changes

src/module.c Show resolved Hide resolved

lerman25 added 2 commits January 28, 2026 11:03

Cleanup for PR

f4ea979

move postProcess to before unblocking the client

204f014

lerman25 changed the title ~~Omerl search coord new timeout~~ [MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator Jan 28, 2026

test timeout before fanout

0e01ce1

lerman25 marked this pull request as draft January 28, 2026 09:48

lerman25 marked this pull request as ready for review January 28, 2026 09:48

cursor bot reviewed Jan 28, 2026

View reviewed changes

src/module.c Outdated Show resolved Hide resolved

src/module.c Outdated Show resolved Hide resolved

lerman25 dismissed GuyAv46’s stale review via 75fd91d January 28, 2026 14:57

move clear errror

f2d48a6

GuyAv46 previously approved these changes Jan 28, 2026

View reviewed changes

cursor bot reviewed Jan 28, 2026

View reviewed changes

lerman25 disabled auto-merge January 28, 2026 15:34

Fix error pass

2b651b3

lerman25 dismissed GuyAv46’s stale review via 2b651b3 January 28, 2026 16:29

Move error handling to MRCTX

c005c10

github-actions bot added the size:L label Jan 28, 2026

cursor bot reviewed Jan 28, 2026

View reviewed changes

check cmd before free

9691b3c

cursor bot reviewed Jan 28, 2026

View reviewed changes

lerman25 added 2 commits January 29, 2026 09:44

fix

f3aca0f

Better assert and comment

63540a8

GuyAv46 reviewed Jan 29, 2026

View reviewed changes

src/coord/rmr/rmr.c Outdated Show resolved Hide resolved

Change check

15edea9

cursor bot reviewed Jan 29, 2026

View reviewed changes

Assign Query error if not assigned

789e4f3

GuyAv46 reviewed Jan 29, 2026

View reviewed changes

src/module.c Outdated Show resolved Hide resolved

MRcommand free safe

1d2f821

GuyAv46 approved these changes Jan 29, 2026

View reviewed changes

lerman25 enabled auto-merge January 29, 2026 12:57

lerman25 added this pull request to the merge queue Jan 29, 2026

Merged via the queue into master with commit 877718a Jan 29, 2026
82 checks passed

lerman25 deleted the omerl-search-coord-new-timeout branch January 29, 2026 15:22

lerman25 mentioned this pull request Feb 22, 2026

[MOD-14073] Coordinator-level FAIL timeout for FT.HYBRID #8420

Merged

4 tasks

lerman25 mentioned this pull request Mar 18, 2026

Coordinator-level FT.AGGREGATE: use blocked-client FAIL timeout mechanism #8752

Open

4 tasks


		searchRequestCtx *req = MRCtx_GetPrivData(mrctx);

		searchReducerCtx *rCtx = req->rctx;

Conversation

lerman25 commented Jan 27, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Refactor Coordinator Search to Use Proper Blocking Client Callbacks

Motivation

Key Changes

Flow Diagram

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

fcostaoliveira commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated performance analysis summary

Performance Improvements - Comparison between master and omerl-search-coord-new-timeout.

Performance Regressions and Issues - Comparison between master and omerl-search-coord-new-timeout.

Tests with No Significant Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 28, 2026

Choose a reason for hiding this comment

Shard error messages lost when replying to client

Uh oh!

cursor bot Jan 28, 2026

Choose a reason for hiding this comment

Unnecessary typedefs and constant variable add complexity

Uh oh!

cursor bot Jan 28, 2026

Choose a reason for hiding this comment

NULL dereference when fanout has zero expected replies

Uh oh!

cursor bot Jan 28, 2026

Choose a reason for hiding this comment

Invalid struct truthiness check in MRCtx_Free

Uh oh!

cursor bot Jan 28, 2026

Choose a reason for hiding this comment

NULL dereference when bailout called without error status

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 29, 2026

Choose a reason for hiding this comment

Assertion failure when rscParseRequest fails without setting error

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lerman25 commented Jan 27, 2026 •

edited by cursor bot

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading

fcostaoliveira commented Jan 27, 2026 •

edited

Loading

Invalid struct truthiness check in `MRCtx_Free`

Assertion failure when `rscParseRequest` fails without setting error