Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MOD-5948: Respect timeout policy P1 (single shard) #4038

Merged
merged 28 commits into from Nov 28, 2023

Conversation

raz-mon
Copy link
Collaborator

@raz-mon raz-mon commented Nov 9, 2023

Currently, the ON_TIMEOUT configuration parameter isn't being respected well enough, mainly due to the following:
The main problem today, is that the sendChunk function populates the response to the client 'on the fly', i.e., as results from the pipeline are received. This is problematic, since there is no discarding API for a Redis response, such that if we experience a timeout throughout pipeline execution, we can no longer report this to the client (in Resp3 we can report it, but it will have to be accompanied by the results that were already serialized to the response, which is incompatible with our desires of the ON_TIMEOUT FAIL timeout policy).

In order to respect the ON_TIMEOUT FAIL timeout policy, we aggregate the results from the pipeline prior to populating the response to the client, such that we can reply with an error/the results respectively to whether we experienced a timeout throughout the execution of the whole command or not. This is also a good opportunity to check for timeout, since we cover the whole pipeline execution, rather than only the first phase of it, which is what we cover today - thus this is added as well.

  • Note 1: We are aware that this is the "aggressive" approach to this fix, i.e., one that may yield serious performance regressions. The decision was made mainly due to the 2 next factors:

    1. This is a first-phase modification. As discussed with @MeirShpilraien, we should have the response-discarding API for the Redis response for redis 8, such that we will be able to remove all self-made aggregations (introduced by this PR) and by thus all performance regressions. Thus, this is a temporary hit only.
    2. The performance regressions are experienced ONLY when the timeout policy is FAIL, i.e., no performance hits will be experienced to customers that don't want the strict timeout behavior.
    3. We have the open window to relax the potential hit with numerous optimizations, such as performing the mentioned aggregation only when we have no depleting results-processors in the pipeline, etc.. Today, we choose this implementation for maximal timeout coverage (for maximal user-credibility), and unity among the different commands' behavior and timeout reporting qualities.
  • Note 2: This PR handles the single-shard build only. A following one will handle the cluster build (coordinator).

Also, this PR refactors the sendChunk function, which is complex and inefficient (and incorrect in several places).

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Copy link

codecov bot commented Nov 20, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (2c1d4b3) 82.99% compared to head (4d5b42d) 83.10%.
Report is 2 commits behind head on master.

Files Patch % Lines
src/aggregate/aggregate_exec.c 98.59% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4038      +/-   ##
==========================================
+ Coverage   82.99%   83.10%   +0.10%     
==========================================
  Files         192      192              
  Lines       32757    32794      +37     
==========================================
+ Hits        27188    27253      +65     
+ Misses       5569     5541      -28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@filipecosta90
Copy link
Collaborator

filipecosta90 commented Nov 22, 2023

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

In summary:

  • Detected a total of 25 stable tests between versions.
  • Detected a total of 11 highly unstable benchmarks.
  • Detected a total of 3 improvements above the improvement water line.
  • Detected a total of 2 regressions bellow the regression water line 5.0.

You can check a comparison in detail via the grafana link

Comparison between master and razmon-respect_timeout_policy_single_shard.

Time Period from 30 days ago. (environment used: oss-standalone)

Test Case Baseline master (median obs. +- std.dev) Comparison razmon-respect_timeout_policy_single_shard (median obs. +- std.dev) % change (higher-better) Note
ftsb-10K-enwiki_abstract-hashes-fulltext-sortby 68 +- 1.3% (7 datapoints) 70 2.5% No Change
ftsb-10K-enwiki_abstract-hashes-term-prefix 7183 +- 3.7% (7 datapoints) 7366 2.5% No Change
ftsb-10K-enwiki_abstract-hashes-term-suffix 2094 +- 5.0% (7 datapoints) 2178 4.0% potential IMPROVEMENT
ftsb-10K-enwiki_abstract-hashes-term-suffix-withsuffixtrie 64206 +- 6.0% (7 datapoints) 71342 11.1% waterline=6.0%. IMPROVEMENT
ftsb-10K-enwiki_abstract-hashes-term-wildcard 12232 +- 5.8% (7 datapoints) 12833 4.9% waterline=5.8%. potential IMPROVEMENT
ftsb-10K-enwiki_pages-hashes-fulltext-mixed_simple-1word-query_write_1_to_read_20.yml 1247 +- 2.7% (7 datapoints) 1248 0.1% No Change
ftsb-10K-enwiki_pages-hashes-load 53925 +- 4.1% (7 datapoints) 52741 -2.2% No Change
ftsb-1K-enwiki_abstract-hashes-term-contains 1750 +- 4.1% (7 datapoints) 1827 4.3% potential IMPROVEMENT
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query 326 +- 10.6% UNSTABLE (7 datapoints) 364 11.7% UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable 36 +- 16.4% UNSTABLE (7 datapoints) 41 14.0% UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query@100_ops_sec 100 +- 0.0% (7 datapoints) 100 -0.0% No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query 2566 +- 7.8% (7 datapoints) 2649 3.2% waterline=7.8%. potential IMPROVEMENT
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable 953 +- 12.6% UNSTABLE (7 datapoints) 867 -9.1% UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query@100_ops_sec 100 +- 0.0% (7 datapoints) 100 0.0% No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query 908 +- 11.9% UNSTABLE (7 datapoints) 1030 13.4% UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query@100_ops_sec 100 +- 0.0% (7 datapoints) 100 0.0% No Change
ftsb-1M-enwiki_abstract-hashes-load 23455 +- 6.2% (7 datapoints) 23556 0.4% waterline=6.2%. No Change
ftsb-1M-nyc_taxis-ftadd-load 21580 +- 2.7% (7 datapoints) 22807 5.7% IMPROVEMENT
ftsb-1M-nyc_taxis-hashes-load 22957 +- 1.2% (7 datapoints) 22775 -0.8% No Change
search-aggregate-post-filter-simple.yml 86081 +- 4.2% (7 datapoints) 97608 13.4% IMPROVEMENT
search-filtering-tag-numeric 181 +- 16.4% UNSTABLE (7 datapoints) 185 2.2% UNSTABLE (very high variance)
search-filtering-tag-numeric-filter-pipeline 18986 +- 1.9% (7 datapoints) 18220 -4.0% potential REGRESSION
search-ftsb-10K-enwiki_abstract-hashes-term-withoutsuffix-trie 40428 +- 2.8% (7 datapoints) 39550 -2.2% No Change
search-ftsb-10K-enwiki_abstract-hashes-term-withsuffix-trie 38228 +- 2.7% (7 datapoints) 37242 -2.6% No Change
search-ftsb-1700K-docs-union-iterators-q3 6 +- 1.2% (7 datapoints) 6 -0.5% No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable@50_ops_sec 50 +- 21.0% UNSTABLE (7 datapoints) 50 0.2% UNSTABLE (very high variance)
search-ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable@100_ops_sec 100 +- 0.0% (7 datapoints) 100 -0.0% No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query-non-sortable 147 +- 9.3% (7 datapoints) 156 6.1% waterline=9.3%. potential IMPROVEMENT
search-ftsb-370K-docs-union-iterators-q4 7 +- 1.1% (7 datapoints) 7 0.0%
search-ftsb-5200K-docs-union-iterators-q1 1 +- 1.5% (7 datapoints) 1 -1.6% No Change
search-ftsb-5500K-docs-union-iterators-q2 1 +- 34.3% UNSTABLE (7 datapoints) 1 -1.1% UNSTABLE (very high variance)
search-geo 186 +- 3.5% (7 datapoints) 185 -0.5% No Change
search-numeric 3216 +- 34.7% UNSTABLE (7 datapoints) 3465 7.7% UNSTABLE (very high variance)
search-numeric-optimize 8983 +- 1.8% (7 datapoints) 9107 1.4% No Change
search-numeric-sortby 3278 +- 19.8% UNSTABLE (7 datapoints) 3247 -0.9% UNSTABLE (very high variance)
search-numeric-sortby-desc 3323 +- 29.5% UNSTABLE (7 datapoints) 5139 54.7% UNSTABLE (very high variance)
search-numeric-sortby-desc-optimize 43 +- 26.6% UNSTABLE (7 datapoints) 53 22.1% UNSTABLE (very high variance)
search-numeric-sortby-optimize 21 +- 4.1% (7 datapoints) 21 -1.2% No Change
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-fulltext-filter 489 +- 0.8% (7 datapoints) 472 -3.5% potential REGRESSION
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-numeric-filter 108 +- 1.3% (7 datapoints) 98 -9.5% REGRESSION
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-tag-filter 58457 +- 3.9% (7 datapoints) 55050 -5.8% REGRESSION

nafraf
nafraf previously approved these changes Nov 22, 2023
Copy link
Collaborator

@oshadmi oshadmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

tests/pytests/test_coordinator.py Show resolved Hide resolved
src/aggregate/aggregate_request.c Outdated Show resolved Hide resolved
src/aggregate/aggregate_exec.c Outdated Show resolved Hide resolved
src/aggregate/aggregate_exec.c Outdated Show resolved Hide resolved
src/aggregate/aggregate_exec.c Show resolved Hide resolved
@raz-mon raz-mon requested a review from oshadmi November 23, 2023 08:39
Copy link
Collaborator

@oshadmi oshadmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reported performance degredations are not due to this PR (since ON_TIMEOUT is not FAIL - benchmark is using the default module configuration)

@raz-mon
Copy link
Collaborator Author

raz-mon commented Nov 27, 2023

/backport

Copy link

Only merged pull requests can be backported.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 27, 2023
@raz-mon raz-mon added this pull request to the merge queue Nov 27, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 28, 2023
@GuyAv46 GuyAv46 added this pull request to the merge queue Nov 28, 2023
@GuyAv46 GuyAv46 removed this pull request from the merge queue due to a manual request Nov 28, 2023
@GuyAv46 GuyAv46 added this pull request to the merge queue Nov 28, 2023
Merged via the queue into master with commit 6429012 Nov 28, 2023
32 of 79 checks passed
@GuyAv46 GuyAv46 deleted the razmon-respect_timeout_policy_single_shard branch November 28, 2023 11:11
Copy link

Backport failed for 2.8, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.8
git worktree add -d .worktree/backport-4038-to-2.8 origin/2.8
cd .worktree/backport-4038-to-2.8
git switch --create backport-4038-to-2.8
git cherry-pick -x 82798f51ae576dc853f7f218d991f5052a15f7a8 705af18e49fab9053bf83dd0f311829648a0bb69 c68751dd4db09e26cded9c4fa093a62525784c55 f0146261b2c7be73e5ad0fd6bb588136f722c78a e04cfe1d928700f82b8e86b0d09ddff15ca1fc7c 4fb05970608a7710846d5053e8f889e0ee885c0b a46ecb7397a35739a843cb74ad6b1d0eff69aa19 8b086bf904065bda7557c3173ef9ec3792990a91 2570f2b974927e80098182168e3eda96e63c3410 c583de9a7cb3a2bc55631a97b428b83596681b13 a54d221c16ffff1245b0f060dc9ebb55709fd75f 722d8c59cbfbdccc550b4f4756204a80e80b732f 7b8b2084300f67efa0b83968cbf3f5588ae5717a 89f8d5bf6f2258ea02574b9358cbf15b6fd1e33d ffe42b81144521deafb7c1394983ea5d2eff0b2c 0b2b5ce1d407340b2dde56fde935897d42e711d3 44e107755b4250e4b335666953f83dbe7696f581 e5a3f66a275bb5c340fe65425c300c6b9e72cc82 471cac01c301c1f141a1c1aa0cce7778ecefb1f8 8ab3f5d1c53df01718797a978fe10a094b513ff3 8267e5b4f411bc3475489a3f3f831b25adb27277 45e0d87a15b4cbe71bafef9e85f70e728efd8daa 264dbbbdfca2c6a709c879c4c79367dcb13b4838 f51d18548a8da2be007b0ea88352b54f36d05172 7628c9a5be38fa522f2865d3d4ecc8f51c24c828 4d5b42df7ede382a08bd9a34b491674cf14d4a18

Copy link

Backport failed for 2.10, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 2.10
git worktree add -d .worktree/backport-4038-to-2.10 origin/2.10
cd .worktree/backport-4038-to-2.10
git switch --create backport-4038-to-2.10
git cherry-pick -x 82798f51ae576dc853f7f218d991f5052a15f7a8 705af18e49fab9053bf83dd0f311829648a0bb69 c68751dd4db09e26cded9c4fa093a62525784c55 f0146261b2c7be73e5ad0fd6bb588136f722c78a e04cfe1d928700f82b8e86b0d09ddff15ca1fc7c 4fb05970608a7710846d5053e8f889e0ee885c0b a46ecb7397a35739a843cb74ad6b1d0eff69aa19 8b086bf904065bda7557c3173ef9ec3792990a91 2570f2b974927e80098182168e3eda96e63c3410 c583de9a7cb3a2bc55631a97b428b83596681b13 a54d221c16ffff1245b0f060dc9ebb55709fd75f 722d8c59cbfbdccc550b4f4756204a80e80b732f 7b8b2084300f67efa0b83968cbf3f5588ae5717a 89f8d5bf6f2258ea02574b9358cbf15b6fd1e33d ffe42b81144521deafb7c1394983ea5d2eff0b2c 0b2b5ce1d407340b2dde56fde935897d42e711d3 44e107755b4250e4b335666953f83dbe7696f581 e5a3f66a275bb5c340fe65425c300c6b9e72cc82 471cac01c301c1f141a1c1aa0cce7778ecefb1f8 8ab3f5d1c53df01718797a978fe10a094b513ff3 8267e5b4f411bc3475489a3f3f831b25adb27277 45e0d87a15b4cbe71bafef9e85f70e728efd8daa 264dbbbdfca2c6a709c879c4c79367dcb13b4838 f51d18548a8da2be007b0ea88352b54f36d05172 7628c9a5be38fa522f2865d3d4ecc8f51c24c828 4d5b42df7ede382a08bd9a34b491674cf14d4a18

raz-mon added a commit that referenced this pull request Nov 28, 2023
* wip

* wip

* fix resp3 response

* fix resp2 section

* fix total results counter report

* fix loop

* fix response condition

* fix condition

* fix leak

* add timeout check after aggregation in strict timeout policy

* fix coordinator to wait for shard-reply BEFORE polling for timeout

* fix test

* fix test

* wait for reply after polling for timeout (so we don't loose data)

* fix leak

* fix leak

* add test

* non-related code touchups

* reposition response section

* move duplicated code to function

* some fixes

* address reveiw

* fix nelem increment

* fix use after free

* address review

* address comment
raz-mon added a commit that referenced this pull request Nov 28, 2023
* wip

* wip

* fix resp3 response

* fix resp2 section

* fix total results counter report

* fix loop

* fix response condition

* fix condition

* fix leak

* add timeout check after aggregation in strict timeout policy

* fix coordinator to wait for shard-reply BEFORE polling for timeout

* fix test

* fix test

* wait for reply after polling for timeout (so we don't loose data)

* fix leak

* fix leak

* add test

* non-related code touchups

* reposition response section

* move duplicated code to function

* some fixes

* address reveiw

* fix nelem increment

* fix use after free

* address review

* address comment
github-merge-queue bot pushed a commit that referenced this pull request Nov 30, 2023
* wip

* wip

* fix resp3 response

* fix resp2 section

* fix total results counter report

* fix loop

* fix response condition

* fix condition

* fix leak

* add timeout check after aggregation in strict timeout policy

* fix coordinator to wait for shard-reply BEFORE polling for timeout

* fix test

* fix test

* wait for reply after polling for timeout (so we don't loose data)

* fix leak

* fix leak

* add test

* non-related code touchups

* reposition response section

* move duplicated code to function

* some fixes

* address reveiw

* fix nelem increment

* fix use after free

* address review

* address comment
GuyAv46 pushed a commit that referenced this pull request Dec 1, 2023
* wip

* wip

* fix resp3 response

* fix resp2 section

* fix total results counter report

* fix loop

* fix response condition

* fix condition

* fix leak

* add timeout check after aggregation in strict timeout policy

* fix coordinator to wait for shard-reply BEFORE polling for timeout

* fix test

* fix test

* wait for reply after polling for timeout (so we don't loose data)

* fix leak

* fix leak

* add test

* non-related code touchups

* reposition response section

* move duplicated code to function

* some fixes

* address reveiw

* fix nelem increment

* fix use after free

* address review

* address comment
ephraimfeldblum pushed a commit that referenced this pull request Dec 17, 2023
* wip

* wip

* fix resp3 response

* fix resp2 section

* fix total results counter report

* fix loop

* fix response condition

* fix condition

* fix leak

* add timeout check after aggregation in strict timeout policy

* fix coordinator to wait for shard-reply BEFORE polling for timeout

* fix test

* fix test

* wait for reply after polling for timeout (so we don't loose data)

* fix leak

* fix leak

* add test

* non-related code touchups

* reposition response section

* move duplicated code to function

* some fixes

* address reveiw

* fix nelem increment

* fix use after free

* address review

* address comment
ephraimfeldblum added a commit that referenced this pull request Dec 17, 2023
raz-mon added a commit that referenced this pull request Mar 6, 2024
* wip

* wip

* fix resp3 response

* fix resp2 section

* fix total results counter report

* fix loop

* fix response condition

* fix condition

* fix leak

* add timeout check after aggregation in strict timeout policy

* fix coordinator to wait for shard-reply BEFORE polling for timeout

* fix test

* fix test

* wait for reply after polling for timeout (so we don't loose data)

* fix leak

* fix leak

* add test

* non-related code touchups

* reposition response section

* move duplicated code to function

* some fixes

* address reveiw

* fix nelem increment

* fix use after free

* address review

* address comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants