Skip to content

[MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator#8191

Merged
lerman25 merged 27 commits intomasterfrom
omerl-search-coord-new-timeout
Jan 29, 2026
Merged

[MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator#8191
lerman25 merged 27 commits intomasterfrom
omerl-search-coord-new-timeout

Conversation

@lerman25
Copy link
Copy Markdown
Collaborator

@lerman25 lerman25 commented Jan 27, 2026

PR Description

Refactor Coordinator Search to Use Proper Blocking Client Callbacks

This PR refactors the distributed search (FT.SEARCH and FT.PROFILE) flow in coordinator mode to properly use Redis module blocking operations, following the Redis Modules Blocking Operations best practices.

Motivation

The previous implementation had several issues:

  1. Timeout handling in the reducer - Timeout was checked in the searchResultReducer background thread by comparing elapsed time, which is error-prone and doesn't integrate with Redis's native timeout mechanism
  2. Reply from background thread - The reducer replied directly using a thread-safe context, rather than using the proper unblock callback pattern

Key Changes

1. Proper Callback Architecture

  • DistSearchUnblockClient (reply callback): Now only handles sending the reply to the client. Gets the reduced results from the context and calls sendSearchResults or profileSearchReply
  • DistSearchFreePrivData (free_privdata callback): New callback responsible for all cleanup - frees searchReducerCtx, searchProfileReducerCtx, searchRequestCtx, and MRCtx
  • DistSearchTimeoutFailClient (timeout callback): New callback for TimeoutPolicy_Fail - replies with timeout error when the blocking client times out
  • DistSearchBlockClientWithTimeout: New helper that configures the blocked client with the appropriate callbacks based on timeout policy

2. Reducer Refactoring (searchResultReducer)

  • Allocates searchReducerCtx on the heap (was stack-allocated) and stores it in req->rctx
  • For FT.PROFILE, stores profile-specific data in searchProfileReducerCtx
  • Error paths use RedisModule_AbortBlock and call DistSearchFreePrivData directly
  • Success path calls RedisModule_UnblockClient(bc, mc) with private data
  • Removed inline timeout checking (now handled by Redis timeout callback)

3. Memory Leak Fix

  • Checks the return value of RedisModule_UnblockClient
  • If it returns REDISMODULE_ERR (client already unblocked by timeout), calls DistSearchFreePrivData directly to avoid leaking resources

4. Type Definition Fix

  • Changed typedef struct { ... } searchReducerCtx; to typedef struct searchReducerCtx { ... } searchReducerCtx;
  • This ensures struct searchReducerCtx * and searchReducerCtx * are compatible types

Flow Diagram

Client Request (FT.SEARCH)
    │
    ▼
DistSearchBlockClientWithTimeout()
    │ - Registers: reply_callback, timeout_callback, free_privdata
    ▼
MR_Fanout() → shards
    │
    ▼
searchResultReducer (background thread)
    ├── Error? ──► AbortBlock() + DistSearchFreePrivData()
    │
    └── Success? ─► UnblockClient(bc, mc)
                        │
                        ├── OK ──► DistSearchUnblockClient (reply)
                        │              └──► DistSearchFreePrivData (cleanup)
                        │
                        └── ERR (timeout race) ──► DistSearchFreePrivData() directly

Testing

Added new test file tests/pytests/test_blocked_client_timeout.py with:

  • test_fail_timeout_search - Tests timeout handling for FT.SEARCH
  • test_fail_timeout_profile - Tests timeout handling for FT.PROFILE

The tests simulate timeout by pausing a shard and triggering client unblock via CLIENT UNBLOCK ... TIMEOUT.

  • This PR requires release notes
  • This PR does not require release notes

Note

Medium Risk
Refactors coordinator-side distributed search to rely on Redis blocking client reply/timeout/free callbacks and changes timeout parsing/type, which can affect query lifecycle, error propagation, and memory cleanup under race/timeout conditions.

Overview
Coordinator-mode FT.SEARCH/FT.PROFILE is refactored to follow Redis blocked-client best practices: reduction now runs in the background, but all replies/errors/timeouts are sent on the main thread via DistSearchUnblockClient, a new timeout callback, and a new free_privdata cleanup path.

Timeout handling is moved from elapsed-time checks in the reducer to Redis’s native blocked-client timeout mechanism (with early TIMEOUT arg parsing), and error propagation is unified by storing a QueryError inside MRCtx (including a new MR_CreateBailoutCtx path for early failures).

Misc fixes include switching parseTimeout to size_t/AC_GetSize, zeroing MRCommand on free, and adding cluster tests (test_blocked_client_timeout.py) that simulate coordinator/shard stalls and validate FAIL-timeout behavior.

Written by Cursor Bugbot for commit 1d2f821. This will update automatically on new commits. Configure here.

@lerman25 lerman25 force-pushed the omerl-search-coord-new-timeout branch from 2b094cd to 76fd0d0 Compare January 27, 2026 15:53
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 95.49550% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.55%. Comparing base (f1e6935) to head (1d2f821).
⚠️ Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
src/module.c 94.79% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8191      +/-   ##
==========================================
- Coverage   82.57%   82.55%   -0.02%     
==========================================
  Files         368      383      +15     
  Lines       55366    56330     +964     
  Branches    14340    15161     +821     
==========================================
+ Hits        45719    46506     +787     
- Misses       9499     9670     +171     
- Partials      148      154       +6     
Flag Coverage Δ
flow 83.82% <95.49%> (-0.68%) ⬇️
unit 50.48% <0.90%> (+0.82%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fcostaoliveira
Copy link
Copy Markdown
Contributor

fcostaoliveira commented Jan 27, 2026

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

In summary:

  • Detected a total of 35 stable tests between versions.
  • Detected a total of 6 highly unstable benchmarks (6 baseline).
  • Detected a total of 1 improvements above the improvement water line.
  • Detected a total of 1 regressions bellow the regression water line 8.0%.

You can check a comparison in detail via the grafana link

Performance Improvements - Comparison between master and omerl-search-coord-new-timeout.

Time Period from 30 days ago. (environment used: oss-standalone)

Test Case Baseline master (median obs. +- std.dev) Comparison omerl-search-coord-new-timeout (median obs. +- std.dev) % change (higher-better) Note
search-numeric-sortby 2339 +- 2.7% (4 datapoints) 3683 57.4% IMPROVEMENT

Performance Regressions and Issues - Comparison between master and omerl-search-coord-new-timeout.

Time Period from 30 days ago. (environment used: oss-standalone)

Test Case Baseline master (median obs. +- std.dev) Comparison omerl-search-coord-new-timeout (median obs. +- std.dev) % change (higher-better) Note
search-numeric 3019 +- 25.3% UNSTABLE (4 datapoints) 2208 -26.9% UNSTABLE (baseline high variance); server: FT.SEARCH p50 increased 34.4% (baseline CV=25.6%); client: client latency stable; only server side confirms regression (client side stable) - insufficient evidence
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable 38 +- 14.5% UNSTABLE (7 datapoints) 32 -17.3% UNSTABLE (baseline high variance); server: FT.SEARCH p50 increased 12.1% (baseline CV=9.3%); client: client latency stable; only server side confirms regression (client side stable) - insufficient evidence
search-ftsb-10K-enwiki_abstract-hashes-fulltext-search-sortby-limit-0-100 938 +- 3.1% (4 datapoints) 834 -11.1% REGRESSION
search-numeric-sortby-desc-optimize 29 +- 10.5% UNSTABLE (7 datapoints) 29 -1.0% UNSTABLE (baseline high variance); server: p50 latency stable; client: client latency stable; neither server nor client side confirms regression
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query 820 +- 19.9% UNSTABLE (7 datapoints) 820 0.1% UNSTABLE (baseline high variance); server: p50 latency stable; client: client latency stable; neither server nor client side confirms regression
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable 970 +- 13.1% UNSTABLE (7 datapoints) 1002 3.3% UNSTABLE (baseline high variance); server: p50 latency stable; client: client latency stable; neither server nor client side confirms regression
search-numeric-sortby-desc 2298 +- 31.0% UNSTABLE (7 datapoints) 3615 57.3% UNSTABLE (baseline high variance); server: FT.SEARCH p50 decreased 37.3% (baseline CV=22.1%); client: Latency decreased 36.4% (baseline CV=21.1%); neither server nor client side confirms regression
Tests with No Significant Changes (35 tests)

Tests with No Significant Changes

Test Case Baseline master (median obs. +- std.dev) Comparison omerl-search-coord-new-timeout (median obs. +- std.dev) % change (higher-better) Note
ftsb-10K-enwiki_abstract-hashes-fulltext-sortby 72 +- 5.2% (7 datapoints) 69.00 -2.9% No Change
ftsb-10K-enwiki_abstract-hashes-term-prefix 6157 +- 2.8% (7 datapoints) 6086.00 -1.2% No Change
ftsb-10K-enwiki_abstract-hashes-term-suffix 2274 +- 2.7% (7 datapoints) 2280.00 0.3% No Change
ftsb-10K-enwiki_abstract-hashes-term-suffix-withsuffixtrie 16468 +- 1.1% (7 datapoints) 16427.00 -0.2% No Change
ftsb-10K-enwiki_abstract-hashes-term-wildcard 8837 +- 5.6% (7 datapoints) 8916.00 0.9% No Change
ftsb-10K-enwiki_pages-hashes-fulltext-mixed_simple-1word-query_write_1_to_read_20.yml 1013 +- 2.2% (7 datapoints) 1027.00 1.3% No Change
ftsb-10K-enwiki_pages-hashes-load 62825 +- 4.2% (7 datapoints) 57840.00 -7.9% potential REGRESSION
ftsb-10K-multivalue-numeric-json 993 +- 0.8% (7 datapoints) 977.00 -1.7% No Change
ftsb-10K-singlevalue-numeric-json 483 +- 0.3% (4 datapoints) 473.00 -2.1% No Change
ftsb-1K-enwiki_abstract-hashes-term-contains 1934 +- 2.0% (7 datapoints) 1989.00 2.8% No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query 393 +- 7.9% (7 datapoints) 400.00 1.7% No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query 3062 +- 5.1% (7 datapoints) 3110.00 1.6% No Change
ftsb-1M-enwiki_abstract-hashes-load 21305 +- 4.7% (7 datapoints) 21615.00 1.5% No Change
ftsb-1M-nyc_taxis-ftadd-load 27761 +- 2.8% (7 datapoints) 28702.00 3.4% potential IMPROVEMENT
ftsb-1M-nyc_taxis-hashes-load 29147 +- 3.6% (7 datapoints) 29606.00 1.6% No Change
search-aggregate-post-filter-simple.yml 17491 +- 1.4% (7 datapoints) 16934.00 -3.2% potential REGRESSION
search-filtering-tag-numeric 272 +- 9.2% (7 datapoints) 294.00 8.2% waterline=9.2%. potential IMPROVEMENT
search-filtering-tag-numeric-filter-pipeline 10939 +- 1.4% (7 datapoints) 10953.00 0.1% No Change
search-ftsb-10K-enwiki_abstract-hashes-fulltext-aggregate-sortby-limit-0-100 839 +- 2.4% (7 datapoints) 812.00 -3.3% potential REGRESSION
search-ftsb-10K-enwiki_abstract-hashes-term-withoutsuffix-trie 14291 +- 1.2% (7 datapoints) 14152.00 -1.0% No Change
search-ftsb-10K-enwiki_abstract-hashes-term-withsuffix-trie 14187 +- 1.7% (4 datapoints) 14148.00 -0.3% No Change
search-ftsb-1700K-docs-union-iterators-q3 8.0 +- 1.4% (7 datapoints) 8.10 0.5% No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query-non-sortable 169 +- 5.8% (4 datapoints) 174.00 3.3% potential IMPROVEMENT
search-ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query-one-indexed-field 7507 +- 2.2% (4 datapoints) 7516.00 0.1% No Change
search-ftsb-370K-docs-union-iterators-q4 8.3 +- 0.9% (7 datapoints) 8.30 -0.1% No Change
search-ftsb-5200K-docs-union-iterators-q1 0.84 +- 1.4% (7 datapoints) 0.85 1.2% No Change
search-ftsb-5500K-docs-union-iterators-q2 1.2 +- 0.8% (7 datapoints) 1.20 1.7% No Change
search-geo 220 +- 2.2% (4 datapoints) 226.00 2.7% No Change
search-high-cardinality-negation-term-baseline 37 +- 1.2% (7 datapoints) 37.00 -0.7% No Change
search-high-cardinality-negation-term-comparison_union_all_other_terms 14 +- 2.0% (7 datapoints) 14.00 0.8% No Change
search-numeric-optimize 7982 +- 1.1% (7 datapoints) 8111.00 1.6% No Change
search-numeric-sortby-optimize 28 +- 8.1% (7 datapoints) 29.00 4.6% waterline=8.1%. potential IMPROVEMENT
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-fulltext-filter 613 +- 2.8% (7 datapoints) 593.00 -3.2% potential REGRESSION
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-numeric-filter 156 +- 9.2% (7 datapoints) 158.00 1.5% waterline=9.2%. No Change
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-tag-filter 15731 +- 1.4% (7 datapoints) 16154.00 2.7% No Change

@lerman25 lerman25 changed the title Omerl search coord new timeout [MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator Jan 28, 2026
@lerman25 lerman25 marked this pull request as draft January 28, 2026 09:48
@lerman25 lerman25 marked this pull request as ready for review January 28, 2026 09:48
GuyAv46
GuyAv46 previously approved these changes Jan 28, 2026
src/module.c Outdated
QueryErrorsGlobalStats_UpdateError(errCode, 1, COORD_ERR_WARN);
res = MR_ReplyWithMRReply(reply, curr_rep);
goto cleanup;
QueryError_SetError(&rCtx->status, errCode, NULL);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shard error messages lost when replying to client

Medium Severity

When a shard returns an error, rCtx->lastError is set to the original reply containing the specific error message, but QueryError_SetError is called with NULL as the message parameter. Later in DistSearchUnblockClient, only rCtx->status is used to reply via QueryError_ReplyAndClear, which will produce a generic error message for the error code. The original error's specific message text is stored but never used, resulting in loss of diagnostic information. The old code used MR_ReplyWithMRReply(reply, curr_rep) to preserve the exact error message from the shard.

Additional Locations (1)

Fix in Cursor Fix in Web

}

typedef RedisModuleCmdFunc BlockedClientTimeoutCB ;
typedef void (*BlockedClientFreePrivDataCB) (RedisModuleCtx *ctx, void *privdata);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary typedefs and constant variable add complexity

Low Severity

The typedefs BlockedClientTimeoutCB and BlockedClientFreePrivDataCB are each used only once within DistSearchBlockClientWithTimeout. Additionally, freePrivDataCallback is always assigned to DistSearchFreePrivData and never changes, making it an unnecessary intermediate variable. These abstractions don't add semantic value and could be simplified by using the types directly and inlining DistSearchFreePrivData in the RedisModule_BlockClient call. This aligns with the PR reviewer comment "Avoid passing arguments you don't need."

Additional Locations (1)

Fix in Cursor Fix in Web

@lerman25 lerman25 disabled auto-merge January 28, 2026 15:34
profileSearchReply(reply, rCtx, MRCtx_GetNumReplied(mrctx), MRCtx_GetReplies(mrctx), &req->profileClock, rs_wall_clock_now_ns());
} else {
// Non-profile command
sendSearchResults(reply, rCtx);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL dereference when fanout has zero expected replies

High Severity

When the fanout command sends to zero shards (numExpected == 0), the code path in uvFanoutRequest directly unblocks the client without calling searchResultReducer and without setting an error. The new DistSearchUnblockClient only checks for explicit errors via QueryError_HasError, then proceeds to access req->rctx. Since the reducer never ran, req->rctx is NULL (from rm_calloc in searchRequestCtx_New). Calling sendSearchResults or profileSearchReply with NULL rCtx causes a NULL pointer dereference crash. The old code handled this with MRCtx_GetNumReplied(mrctx) == 0 check, but this was removed.

Fix in Cursor Fix in Web

// cmd can be NULL in case of bailout
if(ctx->cmd) {
MRCommand_Free(&ctx->cmd);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid struct truthiness check in MRCtx_Free

High Severity

The check if(ctx->cmd) is invalid because MRCommand cmd is a struct, not a pointer. In C, a struct cannot be evaluated for truthiness directly. The comment states "cmd can be NULL in case of bailout" but structs cannot be NULL. Additionally, MR_CreateCtx uses rm_malloc without initializing the cmd field, so in the bailout path (MR_CreateBailoutCtx), the cmd field contains garbage memory. If this check somehow passes, calling MRCommand_Free on uninitialized data will cause crashes or memory corruption.

Additional Locations (1)

Fix in Cursor Fix in Web


searchRequestCtx *req = MRCtx_GetPrivData(mrctx);

searchReducerCtx *rCtx = req->rctx;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL dereference when bailout called without error status

High Severity

The new DistSearchUnblockClient callback assumes an error is always set when bailOut is called, but rscParseRequest can return NULL without setting an error (when LIMIT values are negative at lines 2171-2173). In this case, QueryError_HasError returns false, the early return is skipped, and MRCtx_GetPrivData(mrctx) returns NULL (since bailout sets privdata to NULL). The subsequent req->rctx dereference causes a NULL pointer crash.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Used when we need to send an error to the client, and we don't expect any replies.
The status parameter is used to pass the error to the client after we unblock it, must not be NULL or OK.*/
MRCtx *MR_CreateBailoutCtx(RedisModuleCtx *ctx, RedisModuleBlockedClient *bc, QueryError *status) {
RS_ASSERT(status && QueryError_HasError(status));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion failure when rscParseRequest fails without setting error

Medium Severity

The new MR_CreateBailoutCtx function has RS_ASSERT(status && QueryError_HasError(status)), but bailOut can be called from createReq when rscParseRequest returns NULL without setting an error. Existing code paths like invalid LIMIT values (line 2171-2173) or malformed SORTBY (line 2184-2186) return NULL without calling QueryError_SetError, causing the assertion to fail when bailOut attempts to create the bailout context.

Additional Locations (1)

Fix in Cursor Fix in Web

@lerman25 lerman25 enabled auto-merge January 29, 2026 12:57
@lerman25 lerman25 added this pull request to the merge queue Jan 29, 2026
Merged via the queue into master with commit 877718a Jan 29, 2026
82 checks passed
@lerman25 lerman25 deleted the omerl-search-coord-new-timeout branch January 29, 2026 15:22
kei-nan pushed a commit that referenced this pull request Feb 3, 2026
…nator (#8191)

* draft

* Use abort on error path

* use free priv data

* test

* avoid leak

* skip SA

* Cursor comments

* profile reply with unblock

* remove assert

* test profile

* Cleanup for PR

* move postProcess to before unblocking the client

* test timeout before fanout

* remove double postProcess

* review Phase1

* Phase 2

* fast check resp3

* fix resp3

* move clear errror

* Fix error pass

* Move error handling to MRCTX

* check cmd before free

* fix

* Better assert and comment

* Change check

* Assign Query error if not assigned

* MRcommand free safe
pull bot pushed a commit to Mu-L/RediSearch that referenced this pull request Feb 3, 2026
* initial commit

(cherry picked from commit ab63a17)

* fix api

* some redesign of how search results are maintained

* expose link_static_lib to allow disk to use it

* add a way to control if async api will be used

* fail debug command if async io is not supported

* separate functions

* update debug test and command

* simplify some of the flow

* put async after regular version

* use double buffering

* dynamically increase index result buffer

* use double linked list to manage and track reads

* code review fixes

* change disk api to simplify passing and cleaning up dmd pointer

* code review fixes

* simplify error handling api

* fix compilation

* few fixes

* fix cursor code review comments

* minor fixes

* code review fixes

* remove check

* pass right allocate callback

* moved async state to its own file, added state machine like tests

* fix cursor's code review comment

* code review fix

* address code review comments

* fix code review comments

* small fix

* fix code review comments

* minor refactoring to address code review comment

* cide review fixes

* code review comment

* address code review comments

* use case insensitive comparison

* update header

* [MOD-13616] Use new strict FAIL timeout mechanism in FT.SEARCH coordinator (RediSearch#8191)

* draft

* Use abort on error path

* use free priv data

* test

* avoid leak

* skip SA

* Cursor comments

* profile reply with unblock

* remove assert

* test profile

* Cleanup for PR

* move postProcess to before unblocking the client

* test timeout before fanout

* remove double postProcess

* review Phase1

* Phase 2

* fast check resp3

* fix resp3

* move clear errror

* Fix error pass

* Move error handling to MRCTX

* check cmd before free

* fix

* Better assert and comment

* Change check

* Assign Query error if not assigned

* MRcommand free safe

* MOD-13701: Update deepdiff dependency version in pyproject.toml to >=8.6.1 (RediSearch#8212)

* Update deepdiff dependency version in pyproject.toml to >=8.6.1

* Update dependency versions for deepdiff and orderly-set in uv.lock and pyproject.toml

* Update deepdiff and orderly-set versions in uv.lock and pyproject.toml to latest releases

* Pin deepdiff version to 8.6.1 in uv.lock and pyproject.toml for consistency

* ci: update redisbench-admin (RediSearch#8213)

* update to make sure load is recycled

* bump redisbench-admin version

* benchmark reqs update

* update ver redisbench-admin

* bump ver

* update redisbench-admin

* remove dataset name from load benchmark

* Update tests/benchmarks/search-msmarco-6M-documents-load.yml

* fix: Don't check license headers in the target directory (RediSearch#8226)

* MOD-13602 Add queue time tracking to FT.PROFILE  (RediSearch#8210)

* Add validation tests for FT.PROFILE queue time bug

Add design document and validation tests that confirm the bug exists:
- testParsingTimeIncludesWorkersQueueTime_BUG: Confirms workers queue
  wait time is incorrectly included in 'Parsing time' (bug)
- testParsingTimeDoesNotIncludeCoordQueueTime: Confirms coordinator
  queue time is correctly separate from shard's Parsing time

Both tests pass, confirming the bug exists as described in the design doc.

* Implement Part 1: Workers queue time in FT.PROFILE

- Add profileQueueTime field to AREQ struct in aggregate.h
- Capture queue time at start of AREQ_Execute_Callback() in aggregate_exec.c
  - Queue time = elapsed time since initClock was set before enqueueing
  - Reset initClock after capturing queue time for accurate parsing time
- Print 'Workers queue time' in profile output in profile.c
- Add testWorkersQueueTimeInProfile test to verify the fix
- Mark testParsingTimeIncludesWorkersQueueTime_BUG as skipped (bug is fixed)

This fixes the bug where FT.PROFILE's 'Parsing time' incorrectly included
time spent waiting in the workers thread pool queue.

* Add coordinator queue time tracking to FT.PROFILE

Track time spent waiting in the coordinator thread pool queue and
report it as 'Coordinator queue time' in FT.PROFILE output.

Changes:
- Add coordQueueTime field to ConcurrentSearchHandlerCtx struct
- Add coordQueueTime field to searchRequestCtx struct
- Calculate queue time in DistSearchCommandHandler and DEBUG_DistSearchCommandHandler
- Copy queue time to searchRequestCtx in FlatSearchCommandHandler
- Print 'Coordinator queue time' in profileSearchReplyCoordinator

* Add test for coordinator queue time in FT.PROFILE

Add testCoordinatorQueueTimeInProfile to verify that coordinator queue
time is correctly captured in cluster mode when the coordinator thread
pool is paused.

* Remove design document (intermediate artifact)

* Fix tests for new Workers queue time and Coordinator queue time fields

* Address Cursor Bugbot review: remove obsolete test and unused env parameters

* Add AGENTS.local.md to .gitignore

* Fix test expectations for new Workers queue time field in FT.PROFILE output

* Remove redundant comments from queue time tracking code

* Fix test_RED_86036 index after adding Workers queue time to FT.PROFILE

* Fix dead code issue

* Revert unrelated changes to AGENTS.md and .gitignore

* CR- larger debug pauses, clock only on profile and different output order

* CR- dont init clock

* MOD-13357: Add disk expiration support in OSS side (RediSearch#8218)

Add disk expiration support in OSS side

* try and align with master

* add assert

---------

Co-authored-by: lerman25 <58445352+lerman25@users.noreply.github.com>
Co-authored-by: Itzikvaknin <82322982+Itzikvaknin@users.noreply.github.com>
Co-authored-by: Joan Fontanals <jfontanalsmartinez@gmail.com>
Co-authored-by: Luca Palmieri <20745048+LukeMathWalker@users.noreply.github.com>
Co-authored-by: ofiryanai <ofiryanai1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants