Improve handling of concurrent identical queries by gontarzpawel · Pull Request #146 · ContentSquare/chproxy

gontarzpawel · 2022-02-21T11:33:15Z

Description

In this PR I patch a fix for #51 which comes with a slight modification to proposed solution.

Details

Instead of simply refusing request after grace_time period, we refuse if elapsed time achieved half of max execution time of the request.

The rational behind is that:

the identical query failed to register its completion (TransactionRegistry unavailable)
configured grace time was too short

Further improvements

For the case of successful transaction completion:

/ There are following options:
// - the identical query failed
// - operation to save the result in cache failed
// To fix that behaviour we can add information in transaction registry. Operation should be expressed as:
// - created
// - failed
// - completed
// and we should store it for the TTL of at most 2 * max_execution_time

which I propose to handle in another PR as it's less likely and requires more additional work.

render · 2022-02-21T11:33:19Z

Your Render PR Server URL is https://chproxy-pr-146.onrender.com.

Follow its progress at https://dashboard.render.com/static/srv-c89ngvfh8vl18jmum7bg.

fmilhe14 · 2022-02-21T13:20:25Z

proxy.go

+		//
+		// Because of those 2 options, we will fail the query if elapsed time is bigger than half of the max request timeout.
+		if transactionResult.ElapsedTime > s.user.maxExecutionTime / 2 {
+			respondWith(srw, err, http.StatusRequestTimeout)


don't you think it may be weird for the end user to receive timeouts after 15 seconds ou 30 seconds ?

You're right. Perhaps it should be treated more as an internal server error. What do you think?

yes and a message like "Won't be able to server your query within ..." ?

becareful because on new clickhouse versions (including the one we use at contentsquare), clickhouse will timeout after a few seconds if the forecasted execution time of the query is above the max_execution_time.

Very good point. I'd propose that we rely on the idea of state machine in transactionRegistry. We would check the state for the query (e.g. Created, Failed, Completed) and take a reliable decision based on that. What do you think?

looks good, as long as the state contains a Timeout state (because the Failed state could be due to other factors)

proxy.go

mga-chka · 2022-04-04T11:33:32Z

cache/async_cache.go

+		}
+
+		state, err := c.TransactionRegistry.Status(key)
+		if errors.Is(err, ErrMissingTransaction) {


[opinionated comment]
I don't see the relevance of the error ErrMissingTransaction because:

you're dealing with it the same as the state transactionCompleted (and when you return the error ErrMissingTransaction you always return an transactionCompleted state)

it doesn't look like an error since it's the nominal use case

If you want to dissociate the ErrMissingTransaction with the state transactionCompleted you could create a state liketransactionNotPresent. Doing so would allow you to remove this part since it would be handled by the logic of the part if !state.IsPending() {...}

Agree with you 👍

mga-chka · 2022-04-04T12:01:06Z

proxy.go

+		// mark transaction as failed
+		// todo: discuss if we should mark it as failed upon timeout. The rational against it would be to hope that
+		// 		 partial results of the query are cached and therefore subsequent execution can succeed
+		if err = userCache.Fail(key); err != nil {


IMHO you should dissociate timeouted queries from other error reasons. And CHProxy should only deal with timeout erros because let's say there is a transient issue on the cluster (like an AWS network down for 3 sec), all the queries done during this downtime will be unavailable as long as they're in the transaction registry

I think it'll be hard to tell the difference between those different types of errors because Clickhouse does not seem to handle it well. I've checked status codes it propagates in case of socket timeout and it's http service unavailable 503

https://github.com/ClickHouse/ClickHouse/blob/f0774ca9ee1c6de7049fa10be0c67eeb822797a6/src/Server/HTTPHandler.cpp#L225

I'm not sure, I did a test using the http client to get 2 differents behaviour:

a timeout

an predicted timeout (so the query is cancel by CH)
I got on both case an HTTP 500 (which is weird because according to the code you showed me, I was exepcted a 503)
but on the first one I got the answer:
Code: 159. DB::Exception: Timeout exceeded: elapsed 1.00059767 seconds, maximum: 1. (TIMEOUT_EXCEEDED) (version 22.2.2.1)
and on the second one I got a
Code: 160. DB::Exception: Estimated query execution time (220.8800604841333 seconds) is too long. Maximum: 30. Estimated rows to process: 18629040: While executing MergeTreeInOrder. (TOO_SLOW) (version 22.2.2.1)

So maybe we could try to fetch the error code in case of a 500 from CH to check if it's 159 or a 160. WDYT?

ok,
since this point is not critical we can leave it the way it is now.

…nsaction registry fixes.

mga-chka · 2022-04-11T08:03:00Z

cache/async_cache.go


+func (c *AsyncCache) AwaitForConcurrentTransaction(key *Key) (TransactionResult, error) {
+	startTime := time.Now()
+	var seenState TransactionState


It might be more readable to initialize seenState at transactionCreated because without knowing TransactionState you can't understand this part

mga-chka · 2022-04-11T09:06:42Z

proxy.go

+			}
+			// We will assess if elapsed time is bigger than half of the max request timeout.
+			// This precondition halts execution
+			if transaction.ElapsedTime > s.user.maxExecutionTime/2 {


after 2nd thought, this code seems risky because on cluster with long queries we could have that:

the concurrent queries was not working (because of the equality on pointers you spot)

people update their CHProxy version and starts to have "weird" errors
IMHO we should at least have a new conf to activate this behavior so that people know what they're doing when they activate it.
Another thing is that maybe maxExecutionTime/2 is to short because:

if people use the default CH 30 sec.

Let's say you have an APP (or function module) with slow queries (like 25 sec)

the end user refreshes it's UI and fire a new query after 5 sec

and it will get an error even if the first query was able to finish.
Maybe the best would be something like s.user.maxExecutionTime - 2 sec if (maxExecutionTime>= 20 sec) or maxExecutionTime -1 sec if ....

For the first point - I don't understand how the pointer or long query issue can be impacted by this predicate. Keep in mind that it's a branch in which result of concurrent query was either not found or exceeded grace_time. Now, it depends on the user how grace_time is defined.

We can indeed be conservative in choice and set the predicate to:
transaction.ElapsedTime > s.user.maxExecutionTime -1.

Actually I have another proposition which would remove this condition. What if we removed grace_time configuration and use the value in code equal to max_exec_time in place of grace_time, or at least deprecate it in the first step?

Regarding your first comment, my issue is that the grace_time is at 5 sec by default so someone not aware about this new feature could upgrade CHPRoxy and get weird KO if his workload is designed for slow (+ 20 sec) queries (it's maybe an edge since I have not understanding of how people are using CH).

Regarding your second comment I agree with you that grace_time should be equal to max_exec_time and it would avoid this extra logic. Since the max_exec_time is optionnal, we should use a default value at 30 sec (same as CH), wdyt?

main_test.go

…ck after awaiting concurrent transaction.

mga-chka · 2022-04-20T10:38:11Z

config/config.go

+	// Deprecated: GraceTime duration before the expired entry is deleted from the cache.
+	// It's deprecated and in future versions it'll be replaced by user's MaxExecutionTime.
+	// It's already the case today if value of GraceTime is omitted.
 	GraceTime Duration `yaml:"grace_time,omitempty"`


don't forget to deprecate grace_time in the https://github.com/ContentSquare/chproxy/blob/master/config/README.md

mga-chka · 2022-04-21T07:44:57Z

cache/async_cache.go

-func (c *AsyncCache) AwaitForConcurrentTransaction(key *Key) bool {
-	startTime := time.Now()
+type TransactionResult struct {
+	ElapsedTime time.Duration


I'm a bit picky but since the duration is not used (anymore) outside the function AwaitForConcurrentTransaction, there is no need to keep this field.
In fact, we could remove the TransactionResult structure and the AwaitForConcurrentTransaction could return just a TransactionState.

You're absolutely right!

mga-chka · 2022-04-21T08:19:01Z

cache/async_cache_test.go

 	// in order to let the cleaner swipe the transaction
-	time.Sleep(100 * time.Millisecond)
-	if !transactionResult.State.IsPending() {
+	time.Sleep(150 * time.Millisecond)


good catch!
[it's not related to this PR but it's a bit of a shame that the 100 msec frequency of the cleaner is hardcoded. Because it makes the 150 msec choice in this test a bit magical for a newcomer. Moreover, if it was in a variable it would allow us to modify it for the test and therefore speedup the tests]

I'll add a todo to refactor at some point.

mga-chka

Overall it looks good.
I left a few comments if you have time but they're not blocking issues.

pull-request-size bot added the size/M label Feb 21, 2022

gontarzpawel requested a review from fmilhe14 February 21, 2022 11:33

fmilhe14 reviewed Feb 21, 2022

View reviewed changes

gontarzpawel requested a review from fmilhe14 February 21, 2022 13:28

pull-request-size bot added size/L and removed size/M labels Mar 25, 2022

gontarzpawel force-pushed the fix/concurrent-identical-queries branch from a4b58e4 to eca3260 Compare April 3, 2022 08:27

pull-request-size bot added size/XL and removed size/L labels Apr 3, 2022

gontarzpawel force-pushed the fix/concurrent-identical-queries branch 5 times, most recently from 879fa3a to fa9522a Compare April 3, 2022 19:59

gontarzpawel requested a review from mga-chka April 4, 2022 07:20

mga-chka reviewed Apr 4, 2022

View reviewed changes

proxy.go Show resolved Hide resolved

mga-chka reviewed Apr 4, 2022

View reviewed changes

gontarzpawel added 4 commits April 8, 2022 09:51

fix(concurrent-queries): improve handlind of identical queries

8648134

feat(transactionRegistry): created, completed, failed states

241353f

Concurrent test suites to verify transactions behaviour. Inmemory tra…

3e7a2e9

…nsaction registry fixes.

fix(rebase): adapt the changes

e4938c9

gontarzpawel force-pushed the fix/concurrent-identical-queries branch from fa9522a to e4938c9 Compare April 8, 2022 08:04

fix(proxy_test): queue overflow test

ddf0b85

mga-chka reviewed Apr 11, 2022

View reviewed changes

main_test.go Outdated Show resolved Hide resolved

gontarzpawel added 2 commits April 15, 2022 17:38

fix(review): remove ErrMissingTransaction. remove max exec time/2 che…

407d0dd

…ck after awaiting concurrent transaction.

fix(review): use max execution time as default

d824df1

gontarzpawel requested a review from mga-chka April 20, 2022 07:08

mga-chka reviewed Apr 20, 2022

View reviewed changes

mga-chka reviewed Apr 21, 2022

View reviewed changes

mga-chka approved these changes Apr 21, 2022

View reviewed changes

fix(review): deprecation in README. Simplify transaction output.

a54faeb

gontarzpawel merged commit 12d99e0 into master Apr 21, 2022

gontarzpawel deleted the fix/concurrent-identical-queries branch April 21, 2022 14:55

mga-chka mentioned this pull request Jul 9, 2022

[BUG] Chproxy adds 30s max_execution_time to users by default #182

Closed

gontarzpawel mentioned this pull request Jul 19, 2022

fix(proxy): return error message when query fails #192

Merged

13 tasks

gontarzpawel mentioned this pull request Sep 15, 2022

identical request should be able to be refused after grace time elapse #51

Closed

Conversation

gontarzpawel commented Feb 21, 2022

Description

Details

Further improvements

Uh oh!

render bot commented Feb 21, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mga-chka Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gontarzpawel Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mga-chka Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mga-chka left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

mga-chka Feb 23, 2022 •

edited

Loading

gontarzpawel Apr 11, 2022 •

edited

Loading

mga-chka Apr 21, 2022 •

edited

Loading