New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MOD-5948: Respect timeout policy P1 (single shard) #4038
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #4038 +/- ##
==========================================
+ Coverage 82.99% 83.10% +0.10%
==========================================
Files 192 192
Lines 32757 32794 +37
==========================================
+ Hits 27188 27253 +65
+ Misses 5569 5541 -28 ☔ View full report in Codecov by Sentry. |
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. In summary:
You can check a comparison in detail via the grafana link Comparison between master and razmon-respect_timeout_policy_single_shard.Time Period from 30 days ago. (environment used: oss-standalone)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reported performance degredations are not due to this PR (since ON_TIMEOUT
is not FAIL
- benchmark is using the default module configuration)
/backport |
Only merged pull requests can be backported. |
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.8
git worktree add -d .worktree/backport-4038-to-2.8 origin/2.8
cd .worktree/backport-4038-to-2.8
git switch --create backport-4038-to-2.8
git cherry-pick -x 82798f51ae576dc853f7f218d991f5052a15f7a8 705af18e49fab9053bf83dd0f311829648a0bb69 c68751dd4db09e26cded9c4fa093a62525784c55 f0146261b2c7be73e5ad0fd6bb588136f722c78a e04cfe1d928700f82b8e86b0d09ddff15ca1fc7c 4fb05970608a7710846d5053e8f889e0ee885c0b a46ecb7397a35739a843cb74ad6b1d0eff69aa19 8b086bf904065bda7557c3173ef9ec3792990a91 2570f2b974927e80098182168e3eda96e63c3410 c583de9a7cb3a2bc55631a97b428b83596681b13 a54d221c16ffff1245b0f060dc9ebb55709fd75f 722d8c59cbfbdccc550b4f4756204a80e80b732f 7b8b2084300f67efa0b83968cbf3f5588ae5717a 89f8d5bf6f2258ea02574b9358cbf15b6fd1e33d ffe42b81144521deafb7c1394983ea5d2eff0b2c 0b2b5ce1d407340b2dde56fde935897d42e711d3 44e107755b4250e4b335666953f83dbe7696f581 e5a3f66a275bb5c340fe65425c300c6b9e72cc82 471cac01c301c1f141a1c1aa0cce7778ecefb1f8 8ab3f5d1c53df01718797a978fe10a094b513ff3 8267e5b4f411bc3475489a3f3f831b25adb27277 45e0d87a15b4cbe71bafef9e85f70e728efd8daa 264dbbbdfca2c6a709c879c4c79367dcb13b4838 f51d18548a8da2be007b0ea88352b54f36d05172 7628c9a5be38fa522f2865d3d4ecc8f51c24c828 4d5b42df7ede382a08bd9a34b491674cf14d4a18 |
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 2.10
git worktree add -d .worktree/backport-4038-to-2.10 origin/2.10
cd .worktree/backport-4038-to-2.10
git switch --create backport-4038-to-2.10
git cherry-pick -x 82798f51ae576dc853f7f218d991f5052a15f7a8 705af18e49fab9053bf83dd0f311829648a0bb69 c68751dd4db09e26cded9c4fa093a62525784c55 f0146261b2c7be73e5ad0fd6bb588136f722c78a e04cfe1d928700f82b8e86b0d09ddff15ca1fc7c 4fb05970608a7710846d5053e8f889e0ee885c0b a46ecb7397a35739a843cb74ad6b1d0eff69aa19 8b086bf904065bda7557c3173ef9ec3792990a91 2570f2b974927e80098182168e3eda96e63c3410 c583de9a7cb3a2bc55631a97b428b83596681b13 a54d221c16ffff1245b0f060dc9ebb55709fd75f 722d8c59cbfbdccc550b4f4756204a80e80b732f 7b8b2084300f67efa0b83968cbf3f5588ae5717a 89f8d5bf6f2258ea02574b9358cbf15b6fd1e33d ffe42b81144521deafb7c1394983ea5d2eff0b2c 0b2b5ce1d407340b2dde56fde935897d42e711d3 44e107755b4250e4b335666953f83dbe7696f581 e5a3f66a275bb5c340fe65425c300c6b9e72cc82 471cac01c301c1f141a1c1aa0cce7778ecefb1f8 8ab3f5d1c53df01718797a978fe10a094b513ff3 8267e5b4f411bc3475489a3f3f831b25adb27277 45e0d87a15b4cbe71bafef9e85f70e728efd8daa 264dbbbdfca2c6a709c879c4c79367dcb13b4838 f51d18548a8da2be007b0ea88352b54f36d05172 7628c9a5be38fa522f2865d3d4ecc8f51c24c828 4d5b42df7ede382a08bd9a34b491674cf14d4a18 |
* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment
* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment
* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment
* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment
* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment
This reverts commit ffda7d5.
* wip * wip * fix resp3 response * fix resp2 section * fix total results counter report * fix loop * fix response condition * fix condition * fix leak * add timeout check after aggregation in strict timeout policy * fix coordinator to wait for shard-reply BEFORE polling for timeout * fix test * fix test * wait for reply after polling for timeout (so we don't loose data) * fix leak * fix leak * add test * non-related code touchups * reposition response section * move duplicated code to function * some fixes * address reveiw * fix nelem increment * fix use after free * address review * address comment
Currently, the
ON_TIMEOUT
configuration parameter isn't being respected well enough, mainly due to the following:The main problem today, is that the
sendChunk
function populates the response to the client 'on the fly', i.e., as results from the pipeline are received. This is problematic, since there is no discarding API for a Redis response, such that if we experience a timeout throughout pipeline execution, we can no longer report this to the client (in Resp3 we can report it, but it will have to be accompanied by the results that were already serialized to the response, which is incompatible with our desires of theON_TIMEOUT FAIL
timeout policy).In order to respect the
ON_TIMEOUT FAIL
timeout policy, we aggregate the results from the pipeline prior to populating the response to the client, such that we can reply with an error/the results respectively to whether we experienced a timeout throughout the execution of the whole command or not. This is also a good opportunity to check for timeout, since we cover the whole pipeline execution, rather than only the first phase of it, which is what we cover today - thus this is added as well.Note 1: We are aware that this is the "aggressive" approach to this fix, i.e., one that may yield serious performance regressions. The decision was made mainly due to the 2 next factors:
FAIL
, i.e., no performance hits will be experienced to customers that don't want the strict timeout behavior.Note 2: This PR handles the single-shard build only. A following one will handle the cluster build (coordinator).
Also, this PR refactors the
sendChunk
function, which is complex and inefficient (and incorrect in several places).Mark if applicable