Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry **every** database error #698

Merged
merged 15 commits into from Apr 11, 2024
Merged

Retry **every** database error #698

merged 15 commits into from Apr 11, 2024

Conversation

lippserd
Copy link
Member

@lippserd lippserd commented Mar 18, 2024

This PR introduces several changes to our retry functionality:

  • All functions except retention already retry queries, but there is no logging or timeout. Retry will now stop after 5 minutes and failed queries will be logged.
  • So far we have kept a list of error codes that should be retried. This doesn't include all the errors that can be retried, and it doesn't even take into account errors that can occur in a database cluster. Instead of going through all possible error codes and checking [1] whether they should be included in the list of retryable errors, every database error is now simply retried. Of course, this also means retrying queries that can't be retried at all, but since we now give up after 5 minutes, that's okay.
    [1] Based on a short error description from the vendor, it is difficult to tell for every error whether it can actually be retried without the context of when and how exactly such errors are triggered. There are also database clusters that send their own errors using vendor error codes.
  • RetryableFunc gets no longer canclled if it exceeds Timeout. Before, if Timeout >0, we ensured to stop retrying after Timeout expires by passing a deadline context to RetryFunc, which aborts the function once Timeout lapsed - assuming that context.Done() is actually taken into account, which applies to all of our usages. I'm pretty sure we didn't think about functions that run longer than Timeout and therefore could be canceled prematurely. Since we are now retrying every database error with a timeout of 5 minutes, this could happen with queries that wait for locks having a generous lock wait timeout configured in the database server. Now, RetryableFunc is granted full execution time and will not be canceled if Timeout is exceeded. This means that WithBackoff may not stop exactly after Timeout expires, or may not retry at all if the first execution of RetryableFunc already takes longer than Timeout.
  • DELETE statements from retention are now also retried.
  • HA now stops retrying after 5 minutes.

fixes #618
fixes #651
fixes #677
fixes #730
closes #679

@lippserd lippserd added this to the 1.1.2 milestone Mar 18, 2024
@cla-bot cla-bot bot added the cla/signed label Mar 18, 2024
@julianbrost
Copy link
Contributor

It took me surprisingly long to answer a simple question: What will happen with simple configuration errors, like a wrong username for example. Answer: it will log a warning like this immediately and exit with the same error at a later point:

WARN    database        Can't connect to database. Retrying     {"error": "Error 1698 (28000): Access denied for user 'icingadb-test'@'localhost'"}

So if you did something wrong while setting it up, this PR shouldn't delay you from seeing the problem.

Why did it take me so long to test this? I wanted to test it within the docker image and the behavior totally didn't match my expectations (it logged that would retry but exited immediately afterwards). What happened? The docker image has an entrypoint which itself connects to the database and this uses a fixed version of the Icinga DB code. So long story short, that has to be updated as well, no matter in which way we change the error handling.

@julianbrost julianbrost mentioned this pull request Mar 18, 2024
9 tasks
@julianbrost
Copy link
Contributor

I've created the release issue and added a corresponding point to the checklist there so that it won't be forgotten: #707

@julianbrost
Copy link
Contributor

closes #679

I've added this to the PR description as that PR becomes obsolete if we do it this way.

fixes #618

Not sure about that one. Yes, the issue mentions an "Lock wait timeout exceeded" error, but only after retention.count was lowered in the config. Before, the error was "The total number of locks exceeds the lock table size". I'm not sure if this would be fixed by retrying. I don't know what the limit is there, but if we get close to it with the default retention.count = 5000, maybe we should consider lowering the default.

@yhabteab
Copy link
Member

I don't know what the limit is there, but if we get close to it with the default retention.count = 5000, maybe we should consider lowering the default.

I've just created an issue for this #717.

@lippserd
Copy link
Member Author

fixes #618

Not sure about that one. Yes, the issue mentions an "Lock wait timeout exceeded" error, but only after retention.count was lowered in the config. Before, the error was "The total number of locks exceeds the lock table size". I'm not sure if this would be fixed by retrying. I don't know what the limit is there, but if we get close to it with the default retention.count = 5000, maybe we should consider lowering the default.

Please see #717 (comment).

@lippserd
Copy link
Member Author

Added commit ef34b7d to account for #686 (comment).

PR is now ready for review.

Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HA transaction retry handler still does not set a timeout and hangs forever without an error (if the same error is always returned and the debug log isn't used). Is it deliberate or missed to adjust it?

Comment on lines 43 to 45
defer time.AfterFunc(settings.Timeout, func() {
cancelTimeWindow()
}).Stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer time.AfterFunc(settings.Timeout, func() {
cancelTimeWindow()
}).Stop()
defer time.AfterFunc(settings.Timeout, cancelTimeWindow).Stop()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes perfect sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lippserd
Copy link
Member Author

The HA transaction retry handler still does not set a timeout and hangs forever without an error (if the same error is always returned and the debug log isn't used). Is it deliberate or missed to adjust it?

I will have a look.

pkg/icingadb/cleanup.go Outdated Show resolved Hide resolved
pkg/retry/retry.go Outdated Show resolved Hide resolved
@lippserd lippserd force-pushed the retry-every-database-error branch 2 times, most recently from 8e9fc70 to ff0e537 Compare April 3, 2024 08:21
@lippserd
Copy link
Member Author

lippserd commented Apr 3, 2024

The HA transaction retry handler still does not set a timeout and hangs forever without an error (if the same error is always returned and the debug log isn't used). Is it deliberate or missed to adjust it?

I will have a look.

I also added a timeout to ha.

pkg/icingadb/cleanup.go Outdated Show resolved Hide resolved
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
pkg/icingadb/cleanup.go Outdated Show resolved Hide resolved
@lippserd
Copy link
Member Author

lippserd commented Apr 3, 2024

There is one more thing I have to in fix in ha:

2024-04-03T11:19:11.364+0200 DEBUG high-availability Can't update or insert instance. Retrying {"error": "can't start transaction: can't connect to database: context deadline exceeded: dial tcp [::1]:3306: connect: connection refused"

We're using a deadline ctx for realize().

@julianbrost
Copy link
Contributor

We're using a deadline ctx for realize().

However, keep in mind that this deadline should probably keep existing, as it's based on the time when the heartbeat exprires:

icingadb/pkg/icingadb/ha.go

Lines 202 to 206 in 2826af4

if h.responsible {
realizeCtx, cancelRealizeCtx = context.WithDeadline(h.ctx, m.ExpiryTime())
} else {
realizeCtx, cancelRealizeCtx = context.WithCancel(h.ctx)
}

Retrying the HA transaction with an outdated heartbeat would make no sense. And if it retries for longer than that, it may delay the forwarding of the heartbeat loss to the rest of the code.

And as I'm looking at that if right now, I'm not even sure if it makes sense that way. Shouldn't it use WithDeadline in any case, regardless of h.responsible? Wouldn't there otherwise be a chance that it attempts a takeover with an already timed out heartbeat if that happens to take more than a minute?

Before, the error may have been handled in the `<-ctx.Done()` case in
the `select` for the backoff sleep, since `DeadlineExceeded` errors
return `true` for `Temporary()` and `Timeout()`.
Before, with `Timeout >0`, we had to check errors for `nil` because we
were using a deadline context that might have cancelled before the retry
function completed with an error, but since we now pass `ctx` as-is,
this is no longer necessary.
Ensure that updating/inserting of the instance row is also completed by
the current heartbeat's expiry time in takeover scenarios. Otherwise,
there is a risk that the takeover will be performed with an already
expired heartbeat if the attempt takes longer than the expiry time of
the heartbeat.
@@ -55,8 +55,8 @@ func (c RetryConnector) Connect(ctx context.Context) (driver.Conn, error) {
shouldRetry,
backoff.NewExponentialWithJitter(time.Millisecond*128, time.Minute*1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the nesting of the retry.WithBackoff() within those in other places that execute SQL queries and these backoff settings, some strange effects can be observed: If you stop your database for something like 4m30s, you may see some attempts recovering and others - quite some time after that - failing fatally because those exceeded the 5 minutes. I believe this happens to the maximal backoff of a minute given here, i.e. there may be a minute without an attempt and after that, both the inner and outer 5 minute timeout are exceeded.

Also, if we're adding retry.WithBackoff(), do we even still need it here? Just for fun, I removed all the retry from this function (so that it just does Connect() and initConn() with no loops or anything around it) and it seemed to work more reliably, only the logging became very annoying with queries and stacktraces within a single line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following changes address your considerations:

HA now uses a Timer instead of a Ticker that is reset properly even if expired (and not yet drained): 81085c0.

If the timeout expires during the sleep phase between attempts one final retry attempt will be made: 3a3baaf.

only the logging became very annoying with queries and stacktraces within a single line

I have something in my stash which fixes this. zap adds the errorVerbose field if the error implements fmt.Formatter which is true for all our wrapped errors. Simple solution would be to replace zap.Error() calls with a custom implementation that returns a "silent" error, e.g.

package logging

...

type stackTracer interface {
	StackTrace() errors.StackTrace
}

type errNoStackTrace struct {
	e error
}

func (e errNoStackTrace) Error() string {
	return e.e.Error()
}

func Error(e error) zap.Field {
	if _, ok := e.(stackTracer); ok {
		return zap.Error(errNoStackTrace{e})
	}

	return zap.Error(e)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the logging became very annoying with queries and stacktraces within a single line

I have something in my stash which fixes this. zap adds the errorVerbose field if the error implements fmt.Formatter which is true for all our wrapped errors. Simple solution would be to replace zap.Error() calls with a custom implementation that returns a "silent" error, e.g.

However, the PR just keeps the nested retrying for now, thereby avoiding this issue.

lippserd and others added 7 commits April 10, 2024 15:24
Since we are now retrying every database error,
we also need to set a retry timeout.
All of our error callbacks are used to log the error and indicate that
we are retrying. Previously, in the case of context errors or
non-retryable errors, we would have called these too, which would have
resulted in misleading log messages.
The retryable function may exit prematurely due to context errors that
shouldn't be retried. Before, we checked the returned error for context
errors, i.e. used `errors.Is()` to compare it to `Canceled` and
`DeadlineExceeded` which also yields `true` for errors that implement
`Is()` accordingly. For example, this applies to some non-exported Go
`net` errors. Now we explicitly check the context error instead.
Logging of the `attempt` is a meaningless metric as it is not constantly
logged but only when the retryable error changes, and it has no context
as there is no such thing as max attempts.
This change simplifies the use of `attempt` as a number for reading in
log messages and `if`s. Also before, with `attempt` starting with `0`,
the second attempt would have been taken immediately, as our backoff
implementation returns `0` in this case.

Co-Authored-By: Alvar Penning <alvar.penning@icinga.com>
`WithBackoff()` will now make one final retry if the timeout expires
during the sleep phase between attempts, which can be a long period
depending on the attempts made and the maximum sleep time.
@lippserd
Copy link
Member Author

With the latest changes, a reconnect close to the border looks like this:

2024-04-10T15:50:00.965+0200 INFO runtime-updates Upserted 4 ServiceState items
2024-04-10T15:50:05.551+0200 INFO history-sync Synced 4 state history items
2024-04-10T15:50:19.558+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:20.965+0200 INFO runtime-updates Upserted 1 ServiceState items
2024-04-10T15:50:23.222+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:23.222+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:25.551+0200 INFO history-sync Synced 1 state history items
2024-04-10T15:50:51.710+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:50:53.249+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:51:06.084+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:51:19.552+0200 WARN main Handing over {"reason": "instance update/insert deadline exceeded heartbeat expiry time"}
2024-04-10T15:51:19.554+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
...
2024-04-10T15:55:13.558+0200 WARN database Can't connect to database. Retrying {"error": "dial tcp [::1]:3306: connect: connection refused"}
2024-04-10T15:55:19.233+0200 INFO database Reconnected to database {"after": "5.67756108s", "attempts": 6}
2024-04-10T15:55:19.251+0200 INFO main Taking over {"reason": "no other instance is active"}
2024-04-10T15:55:19.252+0200 INFO main Starting config sync
2024-04-10T15:55:19.252+0200 INFO main Starting overdue sync
2024-04-10T15:55:19.253+0200 INFO main Starting initial state sync
2024-04-10T15:55:19.358+0200 INFO config-sync Updating 430 items of type service state
2024-04-10T15:55:19.393+0200 INFO config-sync Updating 103 items of type host state
2024-04-10T15:55:19.436+0200 INFO main Starting config runtime updates sync
2024-04-10T15:55:19.436+0200 INFO config-sync Finished config sync in 184.650819ms
2024-04-10T15:55:19.441+0200 INFO main Starting history retention
2024-04-10T15:55:19.441+0200 INFO main Starting state runtime updates sync
2024-04-10T15:55:19.442+0200 INFO config-sync Finished initial state sync in 189.713135ms
2024-04-10T15:55:19.449+0200 INFO retention Removed 32 old sla_state history items
2024-04-10T15:55:19.457+0200 INFO retention Removed 5 old notification history items
2024-04-10T15:55:19.463+0200 INFO retention Removed 91 old state history items
2024-04-10T15:55:20.772+0200 INFO database Reconnected to database {"after": "3m15.384650151s", "attempts": 12}
2024-04-10T15:55:23.230+0200 INFO database Reconnected to database {"after": "5m0.007129863s", "attempts": 15}
2024-04-10T15:55:25.555+0200 INFO history-sync Synced 10 state history items
2024-04-10T15:55:28.363+0200 INFO database Reconnected to database {"after": "4m36.652250664s", "attempts": 15}
2024-04-10T15:55:39.442+0200 INFO runtime-updates Upserted 3 ServiceState items
2024-04-10T15:55:39.639+0200 INFO database Reconnected to database {"after": "4m33.553376516s", "attempts": 14}
2024-04-10T15:55:40.573+0200 INFO database Reconnected to database {"after": "3m18.755781702s", "attempts": 13}
2024-04-10T15:55:45.555+0200 INFO history-sync Synced 7 state history items
2024-04-10T15:55:48.592+0200 INFO database Reconnected to database {"after": "3m56.811296875s", "attempts": 13}
2024-04-10T15:55:53.257+0200 INFO database Reconnected to database {"after": "5m0.005792734s", "attempts": 14}

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more complaints from my side. Compared to the others, I haven't spent much time looking into the details of this PR so far, so I think it would be good to get another approval on this as well.

Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024-04-11T09:43:21.347+0200    INFO    database        Reconnected to database {"after": "4m50.284240542s", "attempts": 14}
2024-04-11T09:43:25.079+0200    INFO    database        Reconnected to database {"after": "4m44.25903525s", "attempts": 14}
2024-04-11T09:43:28.895+0200    INFO    database        Reconnected to database {"after": "2m45.294415s", "attempts": 12}
2024-04-11T09:43:29.917+0200    INFO    database        Reconnected to database {"after": "4m28.553121625s", "attempts": 15}
2024-04-11T09:43:31.347+0200    INFO    database        Reconnected to database {"after": "4m54.634997417s", "attempts": 15}
2024-04-11T09:43:33.639+0200    INFO    database        Reconnected to database {"after": "5m0.007767875s", "attempts": 15}
2024-04-11T09:43:34.152+0200    INFO    database        Reconnected to database {"after": "5m0.007966958s", "attempts": 15}
2024-04-11T09:43:34.820+0200    INFO    database        Reconnected to database {"after": "3m53.386925458s", "attempts": 13}
2024-04-11T09:43:34.922+0200    INFO    database        Reconnected to database {"after": "5m0.006823084s", "attempts": 15}
2024-04-11T09:43:38.775+0200    INFO    database        Reconnected to database {"after": "5m0.00952125s", "attempts": 15}
2024-04-11T09:43:39.100+0200    INFO    history-sync    Synced 622 state history items
2024-04-11T09:43:39.100+0200    INFO    history-sync    Synced 495 notification history items
2024-04-11T09:43:39.272+0200    INFO    runtime-updates Upserted 17 ServiceState items
2024-04-11T09:43:39.846+0200    INFO    database        Reconnected to database {"after": "4m53.632643916s", "attempts": 15}
2024-04-11T09:43:41.586+0200    INFO    database        Reconnected to database {"after": "3m1.067891625s", "attempts": 12}
2024-04-11T09:43:46.617+0200    INFO    database        Reconnected to database {"after": "4m20.077357709s", "attempts": 13}
2024-04-11T09:43:51.625+0200    INFO    database        Reconnected to database {"after": "4m5.054033166s", "attempts": 13}
2024-04-11T09:43:55.058+0200    INFO    database        Reconnected to database {"after": "4m18.504347416s", "attempts": 14}

@julianbrost julianbrost merged commit 6ff3ca5 into main Apr 11, 2024
31 checks passed
@julianbrost julianbrost deleted the retry-every-database-error branch April 11, 2024 08:01
oxzi added a commit that referenced this pull request Apr 11, 2024
During testing, I just encountered a race condition where my Galera
cluster was not yet ready, causing the initial schema check to fail.

```
2024-04-11T08:13:40.401Z        INFO    icingadb        Starting Icinga DB daemon (1.1.1)
2024-04-11T08:13:40.401Z        INFO    icingadb        Connecting to database at 'mysql:3306'
2024-04-11T08:13:40.404Z        FATAL   icingadb        Error 1047 (08S01): WSREP has not yet prepared node for application use
can't check database schema version
github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema
        /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115
main.run
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74
main.main
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37
runtime.main
        /usr/local/go/src/runtime/proc.go:271
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1695
exit status 1
```

This change now also retries the initial cluster check.

References #698.
oxzi added a commit that referenced this pull request Apr 11, 2024
During testing, I just encountered a race condition where my Galera
cluster was not yet ready, causing the initial schema check to fail.

```
2024-04-11T08:13:40.401Z        INFO    icingadb        Starting Icinga DB daemon (1.1.1)
2024-04-11T08:13:40.401Z        INFO    icingadb        Connecting to database at 'mysql:3306'
2024-04-11T08:13:40.404Z        FATAL   icingadb        Error 1047 (08S01): WSREP has not yet prepared node for application use
can't check database schema version
github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema
        /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115
main.run
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74
main.main
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37
runtime.main
        /usr/local/go/src/runtime/proc.go:271
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1695
exit status 1
```

This change now also retries the initial cluster check.

References #698.
oxzi added a commit that referenced this pull request Apr 11, 2024
During testing, I just encountered a race condition where my Galera
cluster was not yet ready, causing the initial schema check to fail.

```
2024-04-11T08:13:40.401Z        INFO    icingadb        Starting Icinga DB daemon (1.1.1)
2024-04-11T08:13:40.401Z        INFO    icingadb        Connecting to database at 'mysql:3306'
2024-04-11T08:13:40.404Z        FATAL   icingadb        Error 1047 (08S01): WSREP has not yet prepared node for application use
can't check database schema version
github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema
        /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115
main.run
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74
main.main
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37
runtime.main
        /usr/local/go/src/runtime/proc.go:271
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1695
exit status 1
```

This change now also retries the initial cluster check.

References #698.
oxzi added a commit that referenced this pull request Apr 11, 2024
During testing, I just encountered a race condition where my Galera
cluster was not yet ready, causing the initial schema check to fail.

```
2024-04-11T08:13:40.401Z        INFO    icingadb        Starting Icinga DB daemon (1.1.1)
2024-04-11T08:13:40.401Z        INFO    icingadb        Connecting to database at 'mysql:3306'
2024-04-11T08:13:40.404Z        FATAL   icingadb        Error 1047 (08S01): WSREP has not yet prepared node for application use
can't check database schema version
github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema
        /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115
main.run
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74
main.main
        /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37
runtime.main
        /usr/local/go/src/runtime/proc.go:271
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1695
exit status 1
```

This change now also retries the initial cluster check.

References #698.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants