Refine `wait_timeout` override to be at cut-over only #1406

timvaillancourt · 2024-04-11T17:42:21Z

Related issue: #1407

Description

This PR refines #1401 by overriding the session wait_timeout only where it is needed - at the cut-over time where an idle connection could lead to potentially-long table lock if the gh-ost process (or host running it) "freezes"/"stalls" at the cut-over stage

The change (at cut-over only):

Before cut-over:
- The server wait_timeout is fetched (via an existing select that fetched time_zone)
- The applier session wait_timeout is set to be == --cut-over-idle-timeout-seconds (defaults to 2 x lock-wait timeout)
  - This is to ensure the lock is released if gh-ost stalls with a still-active connection here
The cut-over proceeds as normal
After cut-over, the original session wait_timeout is restored to what it was set to pre-cut-over

The --mysql-wait-timeout flag added in #1401 is replaced with the new --cut-over-idle-timeout-seconds flag. No release has been cut since #1401, so this isn't necessarily a breaking change

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>

shlomi-noach

Nice work and in particular great analysis of this stalemate situation! This looks generally correct, but see inline comments:

It break gh-ost for existing users
It adds variables where I think we can do without
It adds code complexity which we can delegate to MySQL.

shlomi-noach · 2024-04-14T05:36:39Z

go/base/context.go

@@ -135,6 +135,7 @@ type MigrationContext struct {
 	CriticalLoadHibernateSeconds        int64
 	PostponeCutOverFlagFile             string
 	CutOverLockTimeoutSeconds           int64
+	CutOverIdleTimeoutSeconds           int64


Since this value must be greater than CutOverLockTimeoutSeconds (or else cut-over may timeout unjustifiably), I suggest removing it altogether and setting something like CutOverLockTimeoutSeconds * 2.

@shlomi-noach a CutOverLockTimeoutSeconds * 2 approach is what I took initially, and I think it will probably work in practice, but I may have got a bit paranoid about one detail that I hope you can validate

The CutOverLockTimeoutSeconds (== lock_wait_timeout) tells MySQL how long the lock can be held once it is taken (3 seconds by default), but nothing is controlling how long the "whole operation" takes to issue. So if we use CutOverLockTimeoutSeconds * 2 as the wait_timeout, let's call this 6, we are assuming the RENAME will always take < 6 seconds, 3 seconds max locking, 3 seconds max in "processing"/network time

So what if, for whatever crazy reason (we are in this kind of territory now with this PR 😄) the server (or network) is slow to process the RENAME operation in the 2nd connection. The connection holding the table locks (with wait_timeout=6) may close at 6 seconds and in some extremely-rare edge cases this may be too early. Can you help validate this idea?

This paranoia led me to make the wait_timeout configurable (still doesn't solve every case) and to make sure the "waiting" connection keeps it's connection "warm" - in case the RENAME is slow

@timvaillancourt I see your use case, and yes, I agree it's possible! I'd argue that we still don't need nor want to introduce a new variables. Pick your paranoia level based on CutOverLockTimeoutSeconds and choose some formula. e.g. use CutOverLockTimeoutSeconds *2 + 5*time.Second. This is a safer approach than introducing yet a new variable that isn't validated at all.

shlomi-noach · 2024-04-14T05:37:28Z

go/cmd/gh-ost/main.go

@@ -51,7 +51,6 @@ func main() {
 	flag.StringVar(&migrationContext.AssumeMasterHostname, "assume-master-host", "", "(optional) explicitly tell gh-ost the identity of the master. Format: some.host.com[:port] This is useful in master-master setups where you wish to pick an explicit master, or in a tungsten-replicator where gh-ost is unable to determine the master")
 	flag.IntVar(&migrationContext.InspectorConnectionConfig.Key.Port, "port", 3306, "MySQL port (preferably a replica, not the master)")
 	flag.Float64Var(&migrationContext.InspectorConnectionConfig.Timeout, "mysql-timeout", 0.0, "Connect, read and write timeout for MySQL")
-	flag.Float64Var(&migrationContext.InspectorConnectionConfig.WaitTimeout, "mysql-wait-timeout", 0.0, "wait_timeout for MySQL sessions")


Removing a flag makes this PR backwards incompatible, and will break gh-ost for any user supplying the flag.

@shlomi-noach this flag was introduced just a few days prior in #1401 to solve the same root cause and no release has been cut yet with it

I later decided to isolate the wait_timeout change to cut-over time only (in this PR) to address the problem more closely and to avoid a larger change

That said, I think we could keep this flag around for more generic reasons

Ah, cool, I did not realize the flag was only added recently. Remove it, then!

shlomi-noach · 2024-04-14T05:44:09Z

go/logic/applier.go

@@ -934,6 +937,42 @@ func (this *Applier) CreateAtomicCutOverSentryTable() error {
 	return nil
 }

+// WaitForAtomicCutOverRename waits for the cut-over RENAME operation while periodically
+// pinging the applier connection to avoid the connection becoming idle.
+func (this *Applier) WaitForAtomicCutOverRename(tx *gosql.Tx, okToUnlockTable <-chan bool) {


I'm not sure this function is required, and it adds more complexity. If we set @@session.wait_timeout to be somewhat bigger than lock-wait-timeout, then we don't need to do a connection keep-alive. I see the idea behind this, but I think it's simpler to just set an appropriate wait_timeout and let MySQL handle it.

@shlomi-noach see earlier comment. This is born by paranoia that the RENAME in the other session could take longer than lock_wait_timeout * 2, which I believe is hypothetically possible (eg: server is slow, network, process scheduling, hand-wavy things, etc)

To be fair, these edge cases are extremely "edge"

I'd say I still think this function adds complexity, and still better to use @@wait_timeout based on some (same) formula such as the one presented above.

shlomi-noach · 2024-04-14T05:46:17Z

go/logic/applier.go

+	this.migrationContext.Log.Infof("Setting cut-over idle timeout as %d seconds", this.migrationContext.CutOverIdleTimeoutSeconds)
+	query := fmt.Sprintf(`set /* gh-ost */ session wait_timeout:=%d`, this.migrationContext.CutOverIdleTimeoutSeconds)
+	_, err = tx.Exec(query)
+	return func() {


The generated function doesn't use anything special coming from InitAtomicCutOverIdleTimeout, ie no local variables etc. As such, it will be simpler for InitAtomicCutOverIdleTimeout to only return error, and for the calling code (AtomicCutOverMagicLock(), below) to just defer sqlutils.ExecNoPrepare(this.db, <restore wait time>))

shlomi-noach · 2024-04-14T05:48:04Z

go/logic/applier.go

@@ -974,6 +1013,15 @@ func (this *Applier) AtomicCutOverMagicLock(sessionIdChan chan int64, tableLocke
 		return err
 	}

+	if this.migrationContext.CutOverIdleTimeoutSeconds >= 1 {
+		restoreIdleTimeoutFunc, err := this.InitAtomicCutOverIdleTimeout(tx)


IMO should be replaced with:

Suggested change

restoreIdleTimeoutFunc, err := this.InitAtomicCutOverIdleTimeout(tx)

defer func() {

if _, err := sqlutils.ExecNoPrepare(this.db, query); err != nil {

this.migrationContext.Log.Errorf("Failed to restore applier wait_timeout to %d seconds: %v",

this.migrationContext.ApplierWaitTimeout, err,

)

}

}

err := this.InitAtomicCutOverIdleTimeout(tx)

@shlomi-noach makes sense 👍. I added a few questions to your responses; I'll make this change once we've reached agreement. Thanks for checking 🙇

timvaillancourt · 2024-04-22T14:34:02Z

@shlomi-noach I'm curious about your feedback on one consequence of this change

Following this PR, it is possible for the session holding the lock tables to timeout (and unlock tables) before the "magic table" is dropped here

If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed. That all sounds better than before, but the "magic table" will be left behind. The impact of this I don't fully understand

-initially-drop-old-table
    	Drop a possibly existing OLD table (remains from a previous run?) before beginning operation. Default is to panic and abort if such table exists

Would gh-ost just-fix this scenario for users with -initially-drop-old-table? Any other race-condition risks you can see the lock-release causing 🤔? 🙇

shlomi-noach · 2024-05-07T12:44:00Z

If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed.

No, actually. The RENAME will not succeed, because the magic table is still in place. The RENAME statement attempts to rename original-table into magic-table. But since magic-table is there, the RENAME will fail.

The next cut-over attempt will first, before placing any locks, attempt to DropAtomicCutOverSentryTableIfExists() before re-creating it.

This should be safe.

Refine wait_timeout flag to be cut-over only

85116ad

Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>

timvaillancourt added enhancement needs-testing labels Apr 11, 2024

timvaillancourt added this to the v1.1.7 milestone Apr 11, 2024

timvaillancourt changed the title ~~Refine wait_timeout flag to be cut-over only~~ Refine wait_timeout override to be at cut-over only Apr 11, 2024

Update docs

05a26c8

Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>

timvaillancourt mentioned this pull request Apr 11, 2024

cut-over locks not released when gh-ost pauses mid-cut-over #1407

Open

timvaillancourt linked an issue Apr 11, 2024 that may be closed by this pull request

cut-over locks not released when gh-ost pauses mid-cut-over #1407

Open

timvaillancourt marked this pull request as ready for review April 11, 2024 18:29

timvaillancourt requested review from rashiq and meiji163 as code owners April 11, 2024 18:29

timvaillancourt added 2 commits April 12, 2024 02:55

cleanup

ec50373

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

Fix lint

d1c9eef

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>

timvaillancourt added the bug label Apr 12, 2024

timvaillancourt added 2 commits April 12, 2024 15:54

Clarify doc

52574ef

Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>

Reorder doc

0210c3b

Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>

shlomi-noach reviewed Apr 14, 2024

View reviewed changes

github deleted a comment Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine `wait_timeout` override to be at cut-over only #1406

Refine `wait_timeout` override to be at cut-over only #1406

timvaillancourt commented Apr 11, 2024 •

edited

shlomi-noach left a comment

shlomi-noach Apr 14, 2024

timvaillancourt Apr 22, 2024

shlomi-noach May 7, 2024

shlomi-noach Apr 14, 2024

timvaillancourt Apr 22, 2024

shlomi-noach May 7, 2024

shlomi-noach Apr 14, 2024

timvaillancourt Apr 22, 2024 •

edited

shlomi-noach May 7, 2024

shlomi-noach Apr 14, 2024

shlomi-noach Apr 14, 2024

timvaillancourt Apr 22, 2024

timvaillancourt commented Apr 22, 2024 •

edited

shlomi-noach commented May 7, 2024

-		restoreIdleTimeoutFunc, err := this.InitAtomicCutOverIdleTimeout(tx)
+		defer func() {
+		if _, err := sqlutils.ExecNoPrepare(this.db, query); err != nil {
+ 			this.migrationContext.Log.Errorf("Failed to restore applier wait_timeout to %d seconds: %v",
+ 				this.migrationContext.ApplierWaitTimeout, err,
+ 			)
+ 		}
+ 		}
+		err := this.InitAtomicCutOverIdleTimeout(tx)

Refine wait_timeout override to be at cut-over only #1406

Are you sure you want to change the base?

Refine wait_timeout override to be at cut-over only #1406

Conversation

timvaillancourt commented Apr 11, 2024 • edited

Description

shlomi-noach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timvaillancourt Apr 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timvaillancourt commented Apr 22, 2024 • edited

shlomi-noach commented May 7, 2024

Refine `wait_timeout` override to be at cut-over only #1406

Refine `wait_timeout` override to be at cut-over only #1406

timvaillancourt commented Apr 11, 2024 •

edited

timvaillancourt Apr 22, 2024 •

edited

timvaillancourt commented Apr 22, 2024 •

edited