Retry every database error #698

lippserd · 2024-03-18T08:33:16Z

This PR introduces several changes to our retry functionality:

All functions except retention already retry queries, but there is no logging or timeout. Retry will now stop after 5 minutes and failed queries will be logged.
So far we have kept a list of error codes that should be retried. This doesn't include all the errors that can be retried, and it doesn't even take into account errors that can occur in a database cluster. Instead of going through all possible error codes and checking [1] whether they should be included in the list of retryable errors, every database error is now simply retried. Of course, this also means retrying queries that can't be retried at all, but since we now give up after 5 minutes, that's okay.
[1] Based on a short error description from the vendor, it is difficult to tell for every error whether it can actually be retried without the context of when and how exactly such errors are triggered. There are also database clusters that send their own errors using vendor error codes.
RetryableFunc gets no longer canclled if it exceeds Timeout. Before, if Timeout >0, we ensured to stop retrying after Timeout expires by passing a deadline context to RetryFunc, which aborts the function once Timeout lapsed - assuming that context.Done() is actually taken into account, which applies to all of our usages. I'm pretty sure we didn't think about functions that run longer than Timeout and therefore could be canceled prematurely. Since we are now retrying every database error with a timeout of 5 minutes, this could happen with queries that wait for locks having a generous lock wait timeout configured in the database server. Now, RetryableFunc is granted full execution time and will not be canceled if Timeout is exceeded. This means that WithBackoff may not stop exactly after Timeout expires, or may not retry at all if the first execution of RetryableFunc already takes longer than Timeout.
DELETE statements from retention are now also retried.
HA now stops retrying after 5 minutes.

fixes #618
fixes #651
fixes #677
fixes #730
closes #679

julianbrost · 2024-03-18T12:20:15Z

It took me surprisingly long to answer a simple question: What will happen with simple configuration errors, like a wrong username for example. Answer: it will log a warning like this immediately and exit with the same error at a later point:

WARN    database        Can't connect to database. Retrying     {"error": "Error 1698 (28000): Access denied for user 'icingadb-test'@'localhost'"}

So if you did something wrong while setting it up, this PR shouldn't delay you from seeing the problem.

Why did it take me so long to test this? I wanted to test it within the docker image and the behavior totally didn't match my expectations (it logged that would retry but exited immediately afterwards). What happened? The docker image has an entrypoint which itself connects to the database and this uses a fixed version of the Icinga DB code. So long story short, that has to be updated as well, no matter in which way we change the error handling.

julianbrost · 2024-03-18T12:23:48Z

I've created the release issue and added a corresponding point to the checklist there so that it won't be forgotten: #707

julianbrost · 2024-03-20T08:14:49Z

closes #679

I've added this to the PR description as that PR becomes obsolete if we do it this way.

fixes #618

Not sure about that one. Yes, the issue mentions an "Lock wait timeout exceeded" error, but only after retention.count was lowered in the config. Before, the error was "The total number of locks exceeds the lock table size". I'm not sure if this would be fixed by retrying. I don't know what the limit is there, but if we get close to it with the default retention.count = 5000, maybe we should consider lowering the default.

yhabteab · 2024-03-22T15:20:31Z

I don't know what the limit is there, but if we get close to it with the default retention.count = 5000, maybe we should consider lowering the default.

I've just created an issue for this #717.

lippserd · 2024-03-24T21:34:42Z

fixes #618

Not sure about that one. Yes, the issue mentions an "Lock wait timeout exceeded" error, but only after retention.count was lowered in the config. Before, the error was "The total number of locks exceeds the lock table size". I'm not sure if this would be fixed by retrying. I don't know what the limit is there, but if we get close to it with the default retention.count = 5000, maybe we should consider lowering the default.

Please see #717 (comment).

lippserd · 2024-03-24T21:41:29Z

Added commit ef34b7d to account for #686 (comment).

PR is now ready for review.

yhabteab

The HA transaction retry handler still does not set a timeout and hangs forever without an error (if the same error is always returned and the debug log isn't used). Is it deliberate or missed to adjust it?

Al2Klimov · 2024-03-28T09:17:34Z

pkg/retry/retry.go

+		defer time.AfterFunc(settings.Timeout, func() {
+			cancelTimeWindow()
+		}).Stop()


Suggested change

defer time.AfterFunc(settings.Timeout, func() {

cancelTimeWindow()

}).Stop()

defer time.AfterFunc(settings.Timeout, cancelTimeWindow).Stop()

Yep, that makes perfect sense.

lippserd · 2024-03-28T10:52:03Z

The HA transaction retry handler still does not set a timeout and hangs forever without an error (if the same error is always returned and the debug log isn't used). Is it deliberate or missed to adjust it?

I will have a look.

pkg/icingadb/cleanup.go

pkg/retry/retry.go

lippserd · 2024-04-03T08:22:25Z

The HA transaction retry handler still does not set a timeout and hangs forever without an error (if the same error is always returned and the debug log isn't used). Is it deliberate or missed to adjust it?

I will have a look.

I also added a timeout to ha.

pkg/icingadb/cleanup.go

pkg/icingadb/ha.go

pkg/icingadb/cleanup.go

lippserd · 2024-04-03T09:59:54Z

There is one more thing I have to in fix in ha:

2024-04-03T11:19:11.364+0200 DEBUG high-availability Can't update or insert instance. Retrying {"error": "can't start transaction: can't connect to database: context deadline exceeded: dial tcp [::1]:3306: connect: connection refused"

We're using a deadline ctx for realize().

julianbrost · 2024-04-03T10:36:54Z

We're using a deadline ctx for realize().

However, keep in mind that this deadline should probably keep existing, as it's based on the time when the heartbeat exprires:

icingadb/pkg/icingadb/ha.go

Lines 202 to 206 in 2826af4

    
           if h.responsible { 
        
           	realizeCtx, cancelRealizeCtx = context.WithDeadline(h.ctx, m.ExpiryTime()) 
        
           } else { 
        
           	realizeCtx, cancelRealizeCtx = context.WithCancel(h.ctx) 
        
           }

Retrying the HA transaction with an outdated heartbeat would make no sense. And if it retries for longer than that, it may delay the forwarding of the heartbeat loss to the rest of the code.

And as I'm looking at that if right now, I'm not even sure if it makes sense that way. Shouldn't it use WithDeadline in any case, regardless of h.responsible? Wouldn't there otherwise be a chance that it attempts a takeover with an already timed out heartbeat if that happens to take more than a minute?

Before, the error may have been handled in the `<-ctx.Done()` case in the `select` for the backoff sleep, since `DeadlineExceeded` errors return `true` for `Temporary()` and `Timeout()`.

Before, with `Timeout >0`, we had to check errors for `nil` because we were using a deadline context that might have cancelled before the retry function completed with an error, but since we now pass `ctx` as-is, this is no longer necessary.

Ensure that updating/inserting of the instance row is also completed by the current heartbeat's expiry time in takeover scenarios. Otherwise, there is a risk that the takeover will be performed with an already expired heartbeat if the attempt takes longer than the expiry time of the heartbeat.

julianbrost · 2024-04-09T15:21:27Z

pkg/icingadb/driver.go

@@ -55,8 +55,8 @@ func (c RetryConnector) Connect(ctx context.Context) (driver.Conn, error) {
 		shouldRetry,
 		backoff.NewExponentialWithJitter(time.Millisecond*128, time.Minute*1),


With the nesting of the retry.WithBackoff() within those in other places that execute SQL queries and these backoff settings, some strange effects can be observed: If you stop your database for something like 4m30s, you may see some attempts recovering and others - quite some time after that - failing fatally because those exceeded the 5 minutes. I believe this happens to the maximal backoff of a minute given here, i.e. there may be a minute without an attempt and after that, both the inner and outer 5 minute timeout are exceeded.

Also, if we're adding retry.WithBackoff(), do we even still need it here? Just for fun, I removed all the retry from this function (so that it just does Connect() and initConn() with no loops or anything around it) and it seemed to work more reliably, only the logging became very annoying with queries and stacktraces within a single line.

The following changes address your considerations:

HA now uses a Timer instead of a Ticker that is reset properly even if expired (and not yet drained): 81085c0.

If the timeout expires during the sleep phase between attempts one final retry attempt will be made: 3a3baaf.

only the logging became very annoying with queries and stacktraces within a single line

I have something in my stash which fixes this. zap adds the errorVerbose field if the error implements fmt.Formatter which is true for all our wrapped errors. Simple solution would be to replace zap.Error() calls with a custom implementation that returns a "silent" error, e.g.

package logging ... type stackTracer interface { StackTrace() errors.StackTrace } type errNoStackTrace struct { e error } func (e errNoStackTrace) Error() string { return e.e.Error() } func Error(e error) zap.Field { if _, ok := e.(stackTracer); ok { return zap.Error(errNoStackTrace{e}) } return zap.Error(e) }

only the logging became very annoying with queries and stacktraces within a single line

I have something in my stash which fixes this. zap adds the errorVerbose field if the error implements fmt.Formatter which is true for all our wrapped errors. Simple solution would be to replace zap.Error() calls with a custom implementation that returns a "silent" error, e.g.

However, the PR just keeps the nested retrying for now, thereby avoiding this issue.

Since we are now retrying every database error, we also need to set a retry timeout.

All of our error callbacks are used to log the error and indicate that we are retrying. Previously, in the case of context errors or non-retryable errors, we would have called these too, which would have resulted in misleading log messages.

The retryable function may exit prematurely due to context errors that shouldn't be retried. Before, we checked the returned error for context errors, i.e. used `errors.Is()` to compare it to `Canceled` and `DeadlineExceeded` which also yields `true` for errors that implement `Is()` accordingly. For example, this applies to some non-exported Go `net` errors. Now we explicitly check the context error instead.

Logging of the `attempt` is a meaningless metric as it is not constantly logged but only when the retryable error changes, and it has no context as there is no such thing as max attempts.

This change simplifies the use of `attempt` as a number for reading in log messages and `if`s. Also before, with `attempt` starting with `0`, the second attempt would have been taken immediately, as our backoff implementation returns `0` in this case. Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>

`WithBackoff()` will now make one final retry if the timeout expires during the sleep phase between attempts, which can be a long period depending on the attempts made and the maximum sleep time.

lippserd · 2024-04-10T13:58:14Z

With the latest changes, a reconnect close to the border looks like this:

2024-04-10T15:50:00.965+0200 INFO runtime-updates Upserted 4 ServiceState items
2024-04-10T15:50:05.551+0200 INFO history-sync Synced 4 state history items
2024-04-10T15:50:19.558+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:20.965+0200 INFO runtime-updates Upserted 1 ServiceState items
2024-04-10T15:50:23.222+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:23.222+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:25.551+0200 INFO history-sync Synced 1 state history items
2024-04-10T15:50:51.710+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:53.249+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:51:06.084+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:51:19.552+0200 WARN main Handing over {"reason": "instance update/insert deadline exceeded heartbeat expiry time"}
2024-04-10T15:51:19.554+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
...
2024-04-10T15:55:13.558+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:55:19.233+0200 INFO database Reconnected to database {"after": "5.67756108s", "attempts": 6}
2024-04-10T15:55:19.251+0200 INFO main Taking over {"reason": "no other instance is active"}
2024-04-10T15:55:19.252+0200 INFO main Starting config sync
2024-04-10T15:55:19.252+0200 INFO main Starting overdue sync
2024-04-10T15:55:19.253+0200 INFO main Starting initial state sync
2024-04-10T15:55:19.358+0200 INFO config-sync Updating 430 items of type service state
2024-04-10T15:55:19.393+0200 INFO config-sync Updating 103 items of type host state
2024-04-10T15:55:19.436+0200 INFO main Starting config runtime updates sync
2024-04-10T15:55:19.436+0200 INFO config-sync Finished config sync in 184.650819ms
2024-04-10T15:55:19.441+0200 INFO main Starting history retention
2024-04-10T15:55:19.441+0200 INFO main Starting state runtime updates sync
2024-04-10T15:55:19.442+0200 INFO config-sync Finished initial state sync in 189.713135ms
2024-04-10T15:55:19.449+0200 INFO retention Removed 32 old sla_state history items
2024-04-10T15:55:19.457+0200 INFO retention Removed 5 old notification history items
2024-04-10T15:55:19.463+0200 INFO retention Removed 91 old state history items
2024-04-10T15:55:20.772+0200 INFO database Reconnected to database {"after": "3m15.384650151s", "attempts": 12}
2024-04-10T15:55:23.230+0200 INFO database Reconnected to database {"after": "5m0.007129863s", "attempts": 15}
2024-04-10T15:55:25.555+0200 INFO history-sync Synced 10 state history items
2024-04-10T15:55:28.363+0200 INFO database Reconnected to database {"after": "4m36.652250664s", "attempts": 15}
2024-04-10T15:55:39.442+0200 INFO runtime-updates Upserted 3 ServiceState items
2024-04-10T15:55:39.639+0200 INFO database Reconnected to database {"after": "4m33.553376516s", "attempts": 14}
2024-04-10T15:55:40.573+0200 INFO database Reconnected to database {"after": "3m18.755781702s", "attempts": 13}
2024-04-10T15:55:45.555+0200 INFO history-sync Synced 7 state history items
2024-04-10T15:55:48.592+0200 INFO database Reconnected to database {"after": "3m56.811296875s", "attempts": 13}
2024-04-10T15:55:53.257+0200 INFO database Reconnected to database {"after": "5m0.005792734s", "attempts": 14}

julianbrost

No more complaints from my side. Compared to the others, I haven't spent much time looking into the details of this PR so far, so I think it would be good to get another approval on this as well.

yhabteab

2024-04-11T09:43:21.347+0200    INFO    database        Reconnected to database {"after": "4m50.284240542s", "attempts": 14}
2024-04-11T09:43:25.079+0200    INFO    database        Reconnected to database {"after": "4m44.25903525s", "attempts": 14}
2024-04-11T09:43:28.895+0200    INFO    database        Reconnected to database {"after": "2m45.294415s", "attempts": 12}
2024-04-11T09:43:29.917+0200    INFO    database        Reconnected to database {"after": "4m28.553121625s", "attempts": 15}
2024-04-11T09:43:31.347+0200    INFO    database        Reconnected to database {"after": "4m54.634997417s", "attempts": 15}
2024-04-11T09:43:33.639+0200    INFO    database        Reconnected to database {"after": "5m0.007767875s", "attempts": 15}
2024-04-11T09:43:34.152+0200    INFO    database        Reconnected to database {"after": "5m0.007966958s", "attempts": 15}
2024-04-11T09:43:34.820+0200    INFO    database        Reconnected to database {"after": "3m53.386925458s", "attempts": 13}
2024-04-11T09:43:34.922+0200    INFO    database        Reconnected to database {"after": "5m0.006823084s", "attempts": 15}
2024-04-11T09:43:38.775+0200    INFO    database        Reconnected to database {"after": "5m0.00952125s", "attempts": 15}
2024-04-11T09:43:39.100+0200    INFO    history-sync    Synced 622 state history items
2024-04-11T09:43:39.100+0200    INFO    history-sync    Synced 495 notification history items
2024-04-11T09:43:39.272+0200    INFO    runtime-updates Upserted 17 ServiceState items
2024-04-11T09:43:39.846+0200    INFO    database        Reconnected to database {"after": "4m53.632643916s", "attempts": 15}
2024-04-11T09:43:41.586+0200    INFO    database        Reconnected to database {"after": "3m1.067891625s", "attempts": 12}
2024-04-11T09:43:46.617+0200    INFO    database        Reconnected to database {"after": "4m20.077357709s", "attempts": 13}
2024-04-11T09:43:51.625+0200    INFO    database        Reconnected to database {"after": "4m5.054033166s", "attempts": 13}
2024-04-11T09:43:55.058+0200    INFO    database        Reconnected to database {"after": "4m18.504347416s", "attempts": 14}

During testing, I just encountered a race condition where my Galera cluster was not yet ready, causing the initial schema check to fail. ``` 2024-04-11T08:13:40.401Z INFO icingadb Starting Icinga DB daemon (1.1.1) 2024-04-11T08:13:40.401Z INFO icingadb Connecting to database at 'mysql:3306' 2024-04-11T08:13:40.404Z FATAL icingadb Error 1047 (08S01): WSREP has not yet prepared node for application use can't check database schema version github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115 main.run /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74 main.main /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37 runtime.main /usr/local/go/src/runtime/proc.go:271 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1695 exit status 1 ``` This change now also retries the initial cluster check. References #698.

lippserd added this to the 1.1.2 milestone Mar 18, 2024

cla-bot bot added the cla/signed label Mar 18, 2024

julianbrost mentioned this pull request Mar 18, 2024

Release Version v1.2.0 #707

Closed

9 tasks

julianbrost mentioned this pull request Mar 26, 2024

Try setting wsrep_sync_wait for mysql connections #711

Merged

julianbrost requested review from julianbrost, Al2Klimov, oxzi and yhabteab March 27, 2024 11:44

yhabteab reviewed Mar 27, 2024

View reviewed changes

Al2Klimov reviewed Mar 28, 2024

View reviewed changes

julianbrost reviewed Mar 28, 2024

View reviewed changes

pkg/icingadb/cleanup.go Outdated Show resolved Hide resolved

pkg/retry/retry.go Outdated Show resolved Hide resolved

lippserd force-pushed the retry-every-database-error branch 2 times, most recently from 8e9fc70 to ff0e537 Compare April 3, 2024 08:21

lippserd requested review from julianbrost, yhabteab and Al2Klimov April 3, 2024 08:23

oxzi reviewed Apr 3, 2024

View reviewed changes

pkg/icingadb/cleanup.go Outdated Show resolved Hide resolved

pkg/icingadb/ha.go Outdated Show resolved Hide resolved

pkg/icingadb/cleanup.go Outdated Show resolved Hide resolved

oxzi mentioned this pull request Apr 3, 2024

retry: distinguish non-retryable errors from timeouts #719

Closed

lippserd force-pushed the retry-every-database-error branch from ff0e537 to f8fa60c Compare April 3, 2024 10:16

lippserd force-pushed the retry-every-database-error branch from f8fa60c to 487fa58 Compare April 8, 2024 07:09

lippserd added 3 commits April 9, 2024 16:02

retry: return immediately after context errors

a468703

Before, the error may have been handled in the `<-ctx.Done()` case in the `select` for the backoff sleep, since `DeadlineExceeded` errors return `true` for `Temporary()` and `Timeout()`.

retry: Reduce nil checks for errors

ed6aab8

Before, with `Timeout >0`, we had to check errors for `nil` because we were using a deadline context that might have cancelled before the retry function completed with an error, but since we now pass `ctx` as-is, this is no longer necessary.

lippserd dismissed yhabteab’s stale review via 8d187be April 9, 2024 14:03

lippserd force-pushed the retry-every-database-error branch from b2fd89f to 8d187be Compare April 9, 2024 14:03

julianbrost reviewed Apr 9, 2024

View reviewed changes

lippserd force-pushed the retry-every-database-error branch from 7545764 to fc00fa9 Compare April 10, 2024 11:35

lippserd and others added 7 commits April 10, 2024 15:24

HA: Give up retrying after 5 minutes

81085c0

Since we are now retrying every database error, we also need to set a retry timeout.

Consolidate default retry timeout and settings

3a664c1

HA: Don't log retry count

c2b449d

Logging of the `attempt` is a meaningless metric as it is not constantly logged but only when the retryable error changes, and it has no context as there is no such thing as max attempts.

retry: Mitigate timing issues

3a3baaf

`WithBackoff()` will now make one final retry if the timeout expires during the sleep phase between attempts, which can be a long period depending on the attempts made and the maximum sleep time.

lippserd force-pushed the retry-every-database-error branch from fc00fa9 to 3a3baaf Compare April 10, 2024 13:28

julianbrost reviewed Apr 10, 2024

View reviewed changes

julianbrost requested review from oxzi and yhabteab April 10, 2024 15:46

yhabteab mentioned this pull request Apr 11, 2024

Retry io.EOF error #739

Merged

lippserd mentioned this pull request Apr 11, 2024

Unify notation of n * time.Duration #738

Merged

yhabteab approved these changes Apr 11, 2024

View reviewed changes

oxzi approved these changes Apr 11, 2024

View reviewed changes

julianbrost merged commit 6ff3ca5 into main Apr 11, 2024
31 checks passed

julianbrost deleted the retry-every-database-error branch April 11, 2024 08:01

oxzi mentioned this pull request Apr 11, 2024

icingadb.DB: Retry Schema Checks #740

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry every database error #698

Retry every database error #698

lippserd commented Mar 18, 2024 •

edited

julianbrost commented Mar 18, 2024

julianbrost commented Mar 18, 2024

julianbrost commented Mar 20, 2024

yhabteab commented Mar 22, 2024

lippserd commented Mar 24, 2024

lippserd commented Mar 24, 2024

yhabteab left a comment

Al2Klimov Mar 28, 2024

lippserd Mar 28, 2024

lippserd Apr 3, 2024

lippserd commented Mar 28, 2024

lippserd commented Apr 3, 2024

lippserd commented Apr 3, 2024

julianbrost commented Apr 3, 2024

julianbrost Apr 9, 2024

lippserd Apr 10, 2024

julianbrost Apr 10, 2024

lippserd commented Apr 10, 2024

julianbrost left a comment

yhabteab left a comment

		@@ -55,8 +55,8 @@ func (c RetryConnector) Connect(ctx context.Context) (driver.Conn, error) {
		shouldRetry,
		backoff.NewExponentialWithJitter(time.Millisecond128, time.Minute1),

Retry **every** database error #698

Retry **every** database error #698

Conversation

lippserd commented Mar 18, 2024 • edited

julianbrost commented Mar 18, 2024

julianbrost commented Mar 18, 2024

julianbrost commented Mar 20, 2024

yhabteab commented Mar 22, 2024

lippserd commented Mar 24, 2024

lippserd commented Mar 24, 2024

yhabteab left a comment

Choose a reason for hiding this comment

Al2Klimov Mar 28, 2024

Choose a reason for hiding this comment

lippserd Mar 28, 2024

Choose a reason for hiding this comment

lippserd Apr 3, 2024

Choose a reason for hiding this comment

lippserd commented Mar 28, 2024

lippserd commented Apr 3, 2024

lippserd commented Apr 3, 2024

julianbrost commented Apr 3, 2024

julianbrost Apr 9, 2024

Choose a reason for hiding this comment

lippserd Apr 10, 2024

Choose a reason for hiding this comment

julianbrost Apr 10, 2024

Choose a reason for hiding this comment

lippserd commented Apr 10, 2024

julianbrost left a comment

Choose a reason for hiding this comment

yhabteab left a comment

Choose a reason for hiding this comment

Retry every database error #698

Retry every database error #698

lippserd commented Mar 18, 2024 •

edited