MOD-5948: Respect timeout policy P1 (single shard) #4038

raz-mon · 2023-11-09T12:46:47Z

Currently, the ON_TIMEOUT configuration parameter isn't being respected well enough, mainly due to the following:
The main problem today, is that the sendChunk function populates the response to the client 'on the fly', i.e., as results from the pipeline are received. This is problematic, since there is no discarding API for a Redis response, such that if we experience a timeout throughout pipeline execution, we can no longer report this to the client (in Resp3 we can report it, but it will have to be accompanied by the results that were already serialized to the response, which is incompatible with our desires of the ON_TIMEOUT FAIL timeout policy).

In order to respect the ON_TIMEOUT FAIL timeout policy, we aggregate the results from the pipeline prior to populating the response to the client, such that we can reply with an error/the results respectively to whether we experienced a timeout throughout the execution of the whole command or not. This is also a good opportunity to check for timeout, since we cover the whole pipeline execution, rather than only the first phase of it, which is what we cover today - thus this is added as well.

Note 1: We are aware that this is the "aggressive" approach to this fix, i.e., one that may yield serious performance regressions. The decision was made mainly due to the 2 next factors:
1. This is a first-phase modification. As discussed with @MeirShpilraien, we should have the response-discarding API for the Redis response for redis 8, such that we will be able to remove all self-made aggregations (introduced by this PR) and by thus all performance regressions. Thus, this is a temporary hit only.
2. The performance regressions are experienced ONLY when the timeout policy is FAIL, i.e., no performance hits will be experienced to customers that don't want the strict timeout behavior.
3. We have the open window to relax the potential hit with numerous optimizations, such as performing the mentioned aggregation only when we have no depleting results-processors in the pipeline, etc.. Today, we choose this implementation for maximal timeout coverage (for maximal user-credibility), and unity among the different commands' behavior and timeout reporting qualities.
Note 2: This PR handles the single-shard build only. A following one will handle the cluster build (coordinator).

Also, this PR refactors the sendChunk function, which is complex and inefficient (and incorrect in several places).

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

codecov · 2023-11-20T11:01:47Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (2c1d4b3) 82.99% compared to head (4d5b42d) 83.10%.
Report is 2 commits behind head on master.

Files	Patch %	Lines
src/aggregate/aggregate_exec.c	98.59%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4038      +/-   ##
==========================================
+ Coverage   82.99%   83.10%   +0.10%     
==========================================
  Files         192      192              
  Lines       32757    32794      +37     
==========================================
+ Hits        27188    27253      +65     
+ Misses       5569     5541      -28

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

filipecosta90 · 2023-11-22T10:14:19Z

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

In summary:

Detected a total of 25 stable tests between versions.
Detected a total of 11 highly unstable benchmarks.
Detected a total of 3 improvements above the improvement water line.
Detected a total of 2 regressions bellow the regression water line 5.0.

You can check a comparison in detail via the grafana link

Comparison between master and razmon-respect_timeout_policy_single_shard.

Time Period from 30 days ago. (environment used: oss-standalone)

Test Case	Baseline master (median obs. +- std.dev)	Comparison razmon-respect_timeout_policy_single_shard (median obs. +- std.dev)	% change (higher-better)	Note
ftsb-10K-enwiki_abstract-hashes-fulltext-sortby	68 +- 1.3% (7 datapoints)	70	2.5%	No Change
ftsb-10K-enwiki_abstract-hashes-term-prefix	7183 +- 3.7% (7 datapoints)	7366	2.5%	No Change
ftsb-10K-enwiki_abstract-hashes-term-suffix	2094 +- 5.0% (7 datapoints)	2178	4.0%	potential IMPROVEMENT
ftsb-10K-enwiki_abstract-hashes-term-suffix-withsuffixtrie	64206 +- 6.0% (7 datapoints)	71342	11.1%	waterline=6.0%. IMPROVEMENT
ftsb-10K-enwiki_abstract-hashes-term-wildcard	12232 +- 5.8% (7 datapoints)	12833	4.9%	waterline=5.8%. potential IMPROVEMENT
ftsb-10K-enwiki_pages-hashes-fulltext-mixed_simple-1word-query_write_1_to_read_20.yml	1247 +- 2.7% (7 datapoints)	1248	0.1%	No Change
ftsb-10K-enwiki_pages-hashes-load	53925 +- 4.1% (7 datapoints)	52741	-2.2%	No Change
ftsb-1K-enwiki_abstract-hashes-term-contains	1750 +- 4.1% (7 datapoints)	1827	4.3%	potential IMPROVEMENT
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query	326 +- 10.6% UNSTABLE (7 datapoints)	364	11.7%	UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable	36 +- 16.4% UNSTABLE (7 datapoints)	41	14.0%	UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query@100_ops_sec	100 +- 0.0% (7 datapoints)	100	-0.0%	No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query	2566 +- 7.8% (7 datapoints)	2649	3.2%	waterline=7.8%. potential IMPROVEMENT
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable	953 +- 12.6% UNSTABLE (7 datapoints)	867	-9.1%	UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query@100_ops_sec	100 +- 0.0% (7 datapoints)	100	0.0%	No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query	908 +- 11.9% UNSTABLE (7 datapoints)	1030	13.4%	UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query@100_ops_sec	100 +- 0.0% (7 datapoints)	100	0.0%	No Change
ftsb-1M-enwiki_abstract-hashes-load	23455 +- 6.2% (7 datapoints)	23556	0.4%	waterline=6.2%. No Change
ftsb-1M-nyc_taxis-ftadd-load	21580 +- 2.7% (7 datapoints)	22807	5.7%	IMPROVEMENT
ftsb-1M-nyc_taxis-hashes-load	22957 +- 1.2% (7 datapoints)	22775	-0.8%	No Change
search-aggregate-post-filter-simple.yml	86081 +- 4.2% (7 datapoints)	97608	13.4%	IMPROVEMENT
search-filtering-tag-numeric	181 +- 16.4% UNSTABLE (7 datapoints)	185	2.2%	UNSTABLE (very high variance)
search-filtering-tag-numeric-filter-pipeline	18986 +- 1.9% (7 datapoints)	18220	-4.0%	potential REGRESSION
search-ftsb-10K-enwiki_abstract-hashes-term-withoutsuffix-trie	40428 +- 2.8% (7 datapoints)	39550	-2.2%	No Change
search-ftsb-10K-enwiki_abstract-hashes-term-withsuffix-trie	38228 +- 2.7% (7 datapoints)	37242	-2.6%	No Change
search-ftsb-1700K-docs-union-iterators-q3	6 +- 1.2% (7 datapoints)	6	-0.5%	No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable@50_ops_sec	50 +- 21.0% UNSTABLE (7 datapoints)	50	0.2%	UNSTABLE (very high variance)
search-ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable@100_ops_sec	100 +- 0.0% (7 datapoints)	100	-0.0%	No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query-non-sortable	147 +- 9.3% (7 datapoints)	156	6.1%	waterline=9.3%. potential IMPROVEMENT
search-ftsb-370K-docs-union-iterators-q4	7 +- 1.1% (7 datapoints)	7	0.0%
search-ftsb-5200K-docs-union-iterators-q1	1 +- 1.5% (7 datapoints)	1	-1.6%	No Change
search-ftsb-5500K-docs-union-iterators-q2	1 +- 34.3% UNSTABLE (7 datapoints)	1	-1.1%	UNSTABLE (very high variance)
search-geo	186 +- 3.5% (7 datapoints)	185	-0.5%	No Change
search-numeric	3216 +- 34.7% UNSTABLE (7 datapoints)	3465	7.7%	UNSTABLE (very high variance)
search-numeric-optimize	8983 +- 1.8% (7 datapoints)	9107	1.4%	No Change
search-numeric-sortby	3278 +- 19.8% UNSTABLE (7 datapoints)	3247	-0.9%	UNSTABLE (very high variance)
search-numeric-sortby-desc	3323 +- 29.5% UNSTABLE (7 datapoints)	5139	54.7%	UNSTABLE (very high variance)
search-numeric-sortby-desc-optimize	43 +- 26.6% UNSTABLE (7 datapoints)	53	22.1%	UNSTABLE (very high variance)
search-numeric-sortby-optimize	21 +- 4.1% (7 datapoints)	21	-1.2%	No Change
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-fulltext-filter	489 +- 0.8% (7 datapoints)	472	-3.5%	potential REGRESSION
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-numeric-filter	108 +- 1.3% (7 datapoints)	98	-9.5%	REGRESSION
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-tag-filter	58457 +- 3.9% (7 datapoints)	55050	-5.8%	REGRESSION

oshadmi

Some comments

tests/pytests/test_coordinator.py

src/aggregate/aggregate_request.c

src/aggregate/aggregate_exec.c

oshadmi

Reported performance degredations are not due to this PR (since ON_TIMEOUT is not FAIL - benchmark is using the default module configuration)

raz-mon · 2023-11-27T11:41:48Z

/backport

github-actions · 2023-11-27T13:29:26Z

Only merged pull requests can be backported.

github-actions · 2023-11-28T11:11:21Z

Backport failed for 2.8, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.8
git worktree add -d .worktree/backport-4038-to-2.8 origin/2.8
cd .worktree/backport-4038-to-2.8
git switch --create backport-4038-to-2.8
git cherry-pick -x 82798f51ae576dc853f7f218d991f5052a15f7a8 705af18e49fab9053bf83dd0f311829648a0bb69 c68751dd4db09e26cded9c4fa093a62525784c55 f0146261b2c7be73e5ad0fd6bb588136f722c78a e04cfe1d928700f82b8e86b0d09ddff15ca1fc7c 4fb05970608a7710846d5053e8f889e0ee885c0b a46ecb7397a35739a843cb74ad6b1d0eff69aa19 8b086bf904065bda7557c3173ef9ec3792990a91 2570f2b974927e80098182168e3eda96e63c3410 c583de9a7cb3a2bc55631a97b428b83596681b13 a54d221c16ffff1245b0f060dc9ebb55709fd75f 722d8c59cbfbdccc550b4f4756204a80e80b732f 7b8b2084300f67efa0b83968cbf3f5588ae5717a 89f8d5bf6f2258ea02574b9358cbf15b6fd1e33d ffe42b81144521deafb7c1394983ea5d2eff0b2c 0b2b5ce1d407340b2dde56fde935897d42e711d3 44e107755b4250e4b335666953f83dbe7696f581 e5a3f66a275bb5c340fe65425c300c6b9e72cc82 471cac01c301c1f141a1c1aa0cce7778ecefb1f8 8ab3f5d1c53df01718797a978fe10a094b513ff3 8267e5b4f411bc3475489a3f3f831b25adb27277 45e0d87a15b4cbe71bafef9e85f70e728efd8daa 264dbbbdfca2c6a709c879c4c79367dcb13b4838 f51d18548a8da2be007b0ea88352b54f36d05172 7628c9a5be38fa522f2865d3d4ecc8f51c24c828 4d5b42df7ede382a08bd9a34b491674cf14d4a18

github-actions · 2023-11-28T11:11:23Z

Backport failed for 2.10, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.10
git worktree add -d .worktree/backport-4038-to-2.10 origin/2.10
cd .worktree/backport-4038-to-2.10
git switch --create backport-4038-to-2.10
git cherry-pick -x 82798f51ae576dc853f7f218d991f5052a15f7a8 705af18e49fab9053bf83dd0f311829648a0bb69 c68751dd4db09e26cded9c4fa093a62525784c55 f0146261b2c7be73e5ad0fd6bb588136f722c78a e04cfe1d928700f82b8e86b0d09ddff15ca1fc7c 4fb05970608a7710846d5053e8f889e0ee885c0b a46ecb7397a35739a843cb74ad6b1d0eff69aa19 8b086bf904065bda7557c3173ef9ec3792990a91 2570f2b974927e80098182168e3eda96e63c3410 c583de9a7cb3a2bc55631a97b428b83596681b13 a54d221c16ffff1245b0f060dc9ebb55709fd75f 722d8c59cbfbdccc550b4f4756204a80e80b732f 7b8b2084300f67efa0b83968cbf3f5588ae5717a 89f8d5bf6f2258ea02574b9358cbf15b6fd1e33d ffe42b81144521deafb7c1394983ea5d2eff0b2c 0b2b5ce1d407340b2dde56fde935897d42e711d3 44e107755b4250e4b335666953f83dbe7696f581 e5a3f66a275bb5c340fe65425c300c6b9e72cc82 471cac01c301c1f141a1c1aa0cce7778ecefb1f8 8ab3f5d1c53df01718797a978fe10a094b513ff3 8267e5b4f411bc3475489a3f3f831b25adb27277 45e0d87a15b4cbe71bafef9e85f70e728efd8daa 264dbbbdfca2c6a709c879c4c79367dcb13b4838 f51d18548a8da2be007b0ea88352b54f36d05172 7628c9a5be38fa522f2865d3d4ecc8f51c24c828 4d5b42df7ede382a08bd9a34b491674cf14d4a18

* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment

This reverts commit ffda7d5.

* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment

raz-mon and others added 17 commits November 7, 2023 22:05

wip

82798f5

wip

705af18

fix resp3 response

c68751d

fix resp2 section

f014626

fix total results counter report

e04cfe1

fix loop

4fb0597

fix response condition

a46ecb7

fix condition

8b086bf

fix leak

2570f2b

add timeout check after aggregation in strict timeout policy

c583de9

fix coordinator to wait for shard-reply BEFORE polling for timeout

a54d221

fix test

722d8c5

Merge branch 'master' into razmon-respect_timeout_policy_single_shard

17fb71e

fix test

7b8b208

wait for reply after polling for timeout (so we don't loose data)

89f8d5b

fix leak

ffe42b8

fix leak

0b2b5ce

add test

44e1077

raz-mon requested review from nafraf and oshadmi November 20, 2023 12:02

raz-mon added 3 commits November 20, 2023 14:04

non-related code touchups

e5a3f66

reposition response section

471cac0

move duplicated code to function

8ab3f5d

raz-mon added the action:run-benchmark label Nov 22, 2023

nafraf previously approved these changes Nov 22, 2023

View reviewed changes

oshadmi reviewed Nov 22, 2023

View reviewed changes

raz-mon added 2 commits November 22, 2023 15:26

some fixes

8267e5b

address reveiw

45e0d87

fix use after free

f51d185

raz-mon requested a review from oshadmi November 23, 2023 08:39

raz-mon added 2 commits November 23, 2023 12:19

address review

7628c9a

address comment

4d5b42d

raz-mon enabled auto-merge November 27, 2023 10:01

oshadmi approved these changes Nov 27, 2023

View reviewed changes

raz-mon added this pull request to the merge queue Nov 27, 2023

raz-mon added backport 2.8 backport 2.10 labels Nov 27, 2023

raz-mon removed the action:run-benchmark label Nov 27, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 27, 2023

raz-mon added this pull request to the merge queue Nov 27, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 28, 2023

GuyAv46 added this pull request to the merge queue Nov 28, 2023

GuyAv46 removed this pull request from the merge queue due to a manual request Nov 28, 2023

GuyAv46 added this pull request to the merge queue Nov 28, 2023

Merged via the queue into master with commit 6429012 Nov 28, 2023
32 of 79 checks passed

GuyAv46 deleted the razmon-respect_timeout_policy_single_shard branch November 28, 2023 11:11

ephraimfeldblum added a commit that referenced this pull request Dec 17, 2023

Revert "MOD-5948: Respect timeout policy P1 (single shard) (#4038)"

4d4e455

This reverts commit ffda7d5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOD-5948: Respect timeout policy P1 (single shard) #4038

MOD-5948: Respect timeout policy P1 (single shard) #4038

raz-mon commented Nov 9, 2023 •

edited

codecov bot commented Nov 20, 2023 •

edited

filipecosta90 commented Nov 22, 2023 •

edited

oshadmi left a comment

oshadmi left a comment

raz-mon commented Nov 27, 2023

github-actions bot commented Nov 27, 2023

github-actions bot commented Nov 28, 2023

github-actions bot commented Nov 28, 2023

MOD-5948: Respect timeout policy P1 (single shard) #4038

MOD-5948: Respect timeout policy P1 (single shard) #4038

Conversation

raz-mon commented Nov 9, 2023 • edited

codecov bot commented Nov 20, 2023 • edited

Codecov Report

filipecosta90 commented Nov 22, 2023 • edited

Automated performance analysis summary

Comparison between master and razmon-respect_timeout_policy_single_shard.

oshadmi left a comment

Choose a reason for hiding this comment

oshadmi left a comment

Choose a reason for hiding this comment

raz-mon commented Nov 27, 2023

github-actions bot commented Nov 27, 2023

github-actions bot commented Nov 28, 2023

github-actions bot commented Nov 28, 2023

raz-mon commented Nov 9, 2023 •

edited

codecov bot commented Nov 20, 2023 •

edited

filipecosta90 commented Nov 22, 2023 •

edited