[MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator#8191
[MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator#8191
Conversation
2b094cd to
76fd0d0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8191 +/- ##
==========================================
- Coverage 82.57% 82.55% -0.02%
==========================================
Files 368 383 +15
Lines 55366 56330 +964
Branches 14340 15161 +821
==========================================
+ Hits 45719 46506 +787
- Misses 9499 9670 +171
- Partials 148 154 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. In summary:
You can check a comparison in detail via the grafana link Performance Improvements - Comparison between master and omerl-search-coord-new-timeout.Time Period from 30 days ago. (environment used: oss-standalone)
Performance Regressions and Issues - Comparison between master and omerl-search-coord-new-timeout.Time Period from 30 days ago. (environment used: oss-standalone)
Tests with No Significant Changes (35 tests)Tests with No Significant Changes
|
src/module.c
Outdated
| QueryErrorsGlobalStats_UpdateError(errCode, 1, COORD_ERR_WARN); | ||
| res = MR_ReplyWithMRReply(reply, curr_rep); | ||
| goto cleanup; | ||
| QueryError_SetError(&rCtx->status, errCode, NULL); |
There was a problem hiding this comment.
Shard error messages lost when replying to client
Medium Severity
When a shard returns an error, rCtx->lastError is set to the original reply containing the specific error message, but QueryError_SetError is called with NULL as the message parameter. Later in DistSearchUnblockClient, only rCtx->status is used to reply via QueryError_ReplyAndClear, which will produce a generic error message for the error code. The original error's specific message text is stored but never used, resulting in loss of diagnostic information. The old code used MR_ReplyWithMRReply(reply, curr_rep) to preserve the exact error message from the shard.
Additional Locations (1)
| } | ||
|
|
||
| typedef RedisModuleCmdFunc BlockedClientTimeoutCB ; | ||
| typedef void (*BlockedClientFreePrivDataCB) (RedisModuleCtx *ctx, void *privdata); |
There was a problem hiding this comment.
Unnecessary typedefs and constant variable add complexity
Low Severity
The typedefs BlockedClientTimeoutCB and BlockedClientFreePrivDataCB are each used only once within DistSearchBlockClientWithTimeout. Additionally, freePrivDataCallback is always assigned to DistSearchFreePrivData and never changes, making it an unnecessary intermediate variable. These abstractions don't add semantic value and could be simplified by using the types directly and inlining DistSearchFreePrivData in the RedisModule_BlockClient call. This aligns with the PR reviewer comment "Avoid passing arguments you don't need."
Additional Locations (1)
| profileSearchReply(reply, rCtx, MRCtx_GetNumReplied(mrctx), MRCtx_GetReplies(mrctx), &req->profileClock, rs_wall_clock_now_ns()); | ||
| } else { | ||
| // Non-profile command | ||
| sendSearchResults(reply, rCtx); |
There was a problem hiding this comment.
NULL dereference when fanout has zero expected replies
High Severity
When the fanout command sends to zero shards (numExpected == 0), the code path in uvFanoutRequest directly unblocks the client without calling searchResultReducer and without setting an error. The new DistSearchUnblockClient only checks for explicit errors via QueryError_HasError, then proceeds to access req->rctx. Since the reducer never ran, req->rctx is NULL (from rm_calloc in searchRequestCtx_New). Calling sendSearchResults or profileSearchReply with NULL rCtx causes a NULL pointer dereference crash. The old code handled this with MRCtx_GetNumReplied(mrctx) == 0 check, but this was removed.
src/coord/rmr/rmr.c
Outdated
| // cmd can be NULL in case of bailout | ||
| if(ctx->cmd) { | ||
| MRCommand_Free(&ctx->cmd); | ||
| } |
There was a problem hiding this comment.
Invalid struct truthiness check in MRCtx_Free
High Severity
The check if(ctx->cmd) is invalid because MRCommand cmd is a struct, not a pointer. In C, a struct cannot be evaluated for truthiness directly. The comment states "cmd can be NULL in case of bailout" but structs cannot be NULL. Additionally, MR_CreateCtx uses rm_malloc without initializing the cmd field, so in the bailout path (MR_CreateBailoutCtx), the cmd field contains garbage memory. If this check somehow passes, calling MRCommand_Free on uninitialized data will cause crashes or memory corruption.
Additional Locations (1)
|
|
||
| searchRequestCtx *req = MRCtx_GetPrivData(mrctx); | ||
|
|
||
| searchReducerCtx *rCtx = req->rctx; |
There was a problem hiding this comment.
NULL dereference when bailout called without error status
High Severity
The new DistSearchUnblockClient callback assumes an error is always set when bailOut is called, but rscParseRequest can return NULL without setting an error (when LIMIT values are negative at lines 2171-2173). In this case, QueryError_HasError returns false, the early return is skipped, and MRCtx_GetPrivData(mrctx) returns NULL (since bailout sets privdata to NULL). The subsequent req->rctx dereference causes a NULL pointer crash.
Additional Locations (1)
| Used when we need to send an error to the client, and we don't expect any replies. | ||
| The status parameter is used to pass the error to the client after we unblock it, must not be NULL or OK.*/ | ||
| MRCtx *MR_CreateBailoutCtx(RedisModuleCtx *ctx, RedisModuleBlockedClient *bc, QueryError *status) { | ||
| RS_ASSERT(status && QueryError_HasError(status)); |
There was a problem hiding this comment.
Assertion failure when rscParseRequest fails without setting error
Medium Severity
The new MR_CreateBailoutCtx function has RS_ASSERT(status && QueryError_HasError(status)), but bailOut can be called from createReq when rscParseRequest returns NULL without setting an error. Existing code paths like invalid LIMIT values (line 2171-2173) or malformed SORTBY (line 2184-2186) return NULL without calling QueryError_SetError, causing the assertion to fail when bailOut attempts to create the bailout context.
Additional Locations (1)
…nator (#8191) * draft * Use abort on error path * use free priv data * test * avoid leak * skip SA * Cursor comments * profile reply with unblock * remove assert * test profile * Cleanup for PR * move postProcess to before unblocking the client * test timeout before fanout * remove double postProcess * review Phase1 * Phase 2 * fast check resp3 * fix resp3 * move clear errror * Fix error pass * Move error handling to MRCTX * check cmd before free * fix * Better assert and comment * Change check * Assign Query error if not assigned * MRcommand free safe
* initial commit (cherry picked from commit ab63a17) * fix api * some redesign of how search results are maintained * expose link_static_lib to allow disk to use it * add a way to control if async api will be used * fail debug command if async io is not supported * separate functions * update debug test and command * simplify some of the flow * put async after regular version * use double buffering * dynamically increase index result buffer * use double linked list to manage and track reads * code review fixes * change disk api to simplify passing and cleaning up dmd pointer * code review fixes * simplify error handling api * fix compilation * few fixes * fix cursor code review comments * minor fixes * code review fixes * remove check * pass right allocate callback * moved async state to its own file, added state machine like tests * fix cursor's code review comment * code review fix * address code review comments * fix code review comments * small fix * fix code review comments * minor refactoring to address code review comment * cide review fixes * code review comment * address code review comments * use case insensitive comparison * update header * [MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator (RediSearch#8191) * draft * Use abort on error path * use free priv data * test * avoid leak * skip SA * Cursor comments * profile reply with unblock * remove assert * test profile * Cleanup for PR * move postProcess to before unblocking the client * test timeout before fanout * remove double postProcess * review Phase1 * Phase 2 * fast check resp3 * fix resp3 * move clear errror * Fix error pass * Move error handling to MRCTX * check cmd before free * fix * Better assert and comment * Change check * Assign Query error if not assigned * MRcommand free safe * MOD-13701: Update deepdiff dependency version in pyproject.toml to >=8.6.1 (RediSearch#8212) * Update deepdiff dependency version in pyproject.toml to >=8.6.1 * Update dependency versions for deepdiff and orderly-set in uv.lock and pyproject.toml * Update deepdiff and orderly-set versions in uv.lock and pyproject.toml to latest releases * Pin deepdiff version to 8.6.1 in uv.lock and pyproject.toml for consistency * ci: update redisbench-admin (RediSearch#8213) * update to make sure load is recycled * bump redisbench-admin version * benchmark reqs update * update ver redisbench-admin * bump ver * update redisbench-admin * remove dataset name from load benchmark * Update tests/benchmarks/search-msmarco-6M-documents-load.yml * fix: Don't check license headers in the target directory (RediSearch#8226) * MOD-13602 Add queue time tracking to FT.PROFILE (RediSearch#8210) * Add validation tests for FT.PROFILE queue time bug Add design document and validation tests that confirm the bug exists: - testParsingTimeIncludesWorkersQueueTime_BUG: Confirms workers queue wait time is incorrectly included in 'Parsing time' (bug) - testParsingTimeDoesNotIncludeCoordQueueTime: Confirms coordinator queue time is correctly separate from shard's Parsing time Both tests pass, confirming the bug exists as described in the design doc. * Implement Part 1: Workers queue time in FT.PROFILE - Add profileQueueTime field to AREQ struct in aggregate.h - Capture queue time at start of AREQ_Execute_Callback() in aggregate_exec.c - Queue time = elapsed time since initClock was set before enqueueing - Reset initClock after capturing queue time for accurate parsing time - Print 'Workers queue time' in profile output in profile.c - Add testWorkersQueueTimeInProfile test to verify the fix - Mark testParsingTimeIncludesWorkersQueueTime_BUG as skipped (bug is fixed) This fixes the bug where FT.PROFILE's 'Parsing time' incorrectly included time spent waiting in the workers thread pool queue. * Add coordinator queue time tracking to FT.PROFILE Track time spent waiting in the coordinator thread pool queue and report it as 'Coordinator queue time' in FT.PROFILE output. Changes: - Add coordQueueTime field to ConcurrentSearchHandlerCtx struct - Add coordQueueTime field to searchRequestCtx struct - Calculate queue time in DistSearchCommandHandler and DEBUG_DistSearchCommandHandler - Copy queue time to searchRequestCtx in FlatSearchCommandHandler - Print 'Coordinator queue time' in profileSearchReplyCoordinator * Add test for coordinator queue time in FT.PROFILE Add testCoordinatorQueueTimeInProfile to verify that coordinator queue time is correctly captured in cluster mode when the coordinator thread pool is paused. * Remove design document (intermediate artifact) * Fix tests for new Workers queue time and Coordinator queue time fields * Address Cursor Bugbot review: remove obsolete test and unused env parameters * Add AGENTS.local.md to .gitignore * Fix test expectations for new Workers queue time field in FT.PROFILE output * Remove redundant comments from queue time tracking code * Fix test_RED_86036 index after adding Workers queue time to FT.PROFILE * Fix dead code issue * Revert unrelated changes to AGENTS.md and .gitignore * CR- larger debug pauses, clock only on profile and different output order * CR- dont init clock * MOD-13357: Add disk expiration support in OSS side (RediSearch#8218) Add disk expiration support in OSS side * try and align with master * add assert --------- Co-authored-by: lerman25 <58445352+lerman25@users.noreply.github.com> Co-authored-by: Itzikvaknin <82322982+Itzikvaknin@users.noreply.github.com> Co-authored-by: Joan Fontanals <jfontanalsmartinez@gmail.com> Co-authored-by: Luca Palmieri <20745048+LukeMathWalker@users.noreply.github.com> Co-authored-by: ofiryanai <ofiryanai1@gmail.com>
PR Description
Refactor Coordinator Search to Use Proper Blocking Client Callbacks
This PR refactors the distributed search (
FT.SEARCHandFT.PROFILE) flow in coordinator mode to properly use Redis module blocking operations, following the Redis Modules Blocking Operations best practices.Motivation
The previous implementation had several issues:
searchResultReducerbackground thread by comparing elapsed time, which is error-prone and doesn't integrate with Redis's native timeout mechanismKey Changes
1. Proper Callback Architecture
DistSearchUnblockClient(reply callback): Now only handles sending the reply to the client. Gets the reduced results from the context and callssendSearchResultsorprofileSearchReplyDistSearchFreePrivData(free_privdata callback): New callback responsible for all cleanup - freessearchReducerCtx,searchProfileReducerCtx,searchRequestCtx, andMRCtxDistSearchTimeoutFailClient(timeout callback): New callback forTimeoutPolicy_Fail- replies with timeout error when the blocking client times outDistSearchBlockClientWithTimeout: New helper that configures the blocked client with the appropriate callbacks based on timeout policy2. Reducer Refactoring (
searchResultReducer)searchReducerCtxon the heap (was stack-allocated) and stores it inreq->rctxFT.PROFILE, stores profile-specific data insearchProfileReducerCtxRedisModule_AbortBlockand callDistSearchFreePrivDatadirectlyRedisModule_UnblockClient(bc, mc)with private data3. Memory Leak Fix
RedisModule_UnblockClientREDISMODULE_ERR(client already unblocked by timeout), callsDistSearchFreePrivDatadirectly to avoid leaking resources4. Type Definition Fix
typedef struct { ... } searchReducerCtx;totypedef struct searchReducerCtx { ... } searchReducerCtx;struct searchReducerCtx *andsearchReducerCtx *are compatible typesFlow Diagram
Testing
Added new test file
tests/pytests/test_blocked_client_timeout.pywith:test_fail_timeout_search- Tests timeout handling forFT.SEARCHtest_fail_timeout_profile- Tests timeout handling forFT.PROFILEThe tests simulate timeout by pausing a shard and triggering client unblock via
CLIENT UNBLOCK ... TIMEOUT.Note
Medium Risk
Refactors coordinator-side distributed search to rely on Redis blocking client reply/timeout/free callbacks and changes timeout parsing/type, which can affect query lifecycle, error propagation, and memory cleanup under race/timeout conditions.
Overview
Coordinator-mode
FT.SEARCH/FT.PROFILEis refactored to follow Redis blocked-client best practices: reduction now runs in the background, but all replies/errors/timeouts are sent on the main thread viaDistSearchUnblockClient, a new timeout callback, and a newfree_privdatacleanup path.Timeout handling is moved from elapsed-time checks in the reducer to Redis’s native blocked-client timeout mechanism (with early
TIMEOUTarg parsing), and error propagation is unified by storing aQueryErrorinsideMRCtx(including a newMR_CreateBailoutCtxpath for early failures).Misc fixes include switching
parseTimeouttosize_t/AC_GetSize, zeroingMRCommandon free, and adding cluster tests (test_blocked_client_timeout.py) that simulate coordinator/shard stalls and validate FAIL-timeout behavior.Written by Cursor Bugbot for commit 1d2f821. This will update automatically on new commits. Configure here.