-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine wait_timeout
override to be at cut-over only
#1406
base: master
Are you sure you want to change the base?
Refine wait_timeout
override to be at cut-over only
#1406
Conversation
Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>
wait_timeout
override to be at cut-over only
Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>
Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>
Signed-off-by: Tim Vaillancourt <tvaillancourt@slack-corp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work and in particular great analysis of this stalemate situation! This looks generally correct, but see inline comments:
- It break
gh-ost
for existing users - It adds variables where I think we can do without
- It adds code complexity which we can delegate to MySQL.
@@ -135,6 +135,7 @@ type MigrationContext struct { | |||
CriticalLoadHibernateSeconds int64 | |||
PostponeCutOverFlagFile string | |||
CutOverLockTimeoutSeconds int64 | |||
CutOverIdleTimeoutSeconds int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this value must be greater than CutOverLockTimeoutSeconds
(or else cut-over may timeout unjustifiably), I suggest removing it altogether and setting something like CutOverLockTimeoutSeconds * 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach a CutOverLockTimeoutSeconds * 2
approach is what I took initially, and I think it will probably work in practice, but I may have got a bit paranoid about one detail that I hope you can validate
The CutOverLockTimeoutSeconds
(== lock_wait_timeout
) tells MySQL how long the lock can be held once it is taken (3 seconds by default), but nothing is controlling how long the "whole operation" takes to issue. So if we use CutOverLockTimeoutSeconds * 2
as the wait_timeout
, let's call this 6
, we are assuming the RENAME
will always take < 6
seconds, 3
seconds max locking, 3
seconds max in "processing"/network time
So what if, for whatever crazy reason (we are in this kind of territory now with this PR 😄) the server (or network) is slow to process the RENAME
operation in the 2nd connection. The connection holding the table locks (with wait_timeout=6
) may close at 6
seconds and in some extremely-rare edge cases this may be too early. Can you help validate this idea?
This paranoia led me to make the wait_timeout
configurable (still doesn't solve every case) and to make sure the "waiting" connection keeps it's connection "warm" - in case the RENAME
is slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timvaillancourt I see your use case, and yes, I agree it's possible! I'd argue that we still don't need nor want to introduce a new variables. Pick your paranoia level based on CutOverLockTimeoutSeconds
and choose some formula. e.g. use CutOverLockTimeoutSeconds *2 + 5*time.Second
. This is a safer approach than introducing yet a new variable that isn't validated at all.
@@ -51,7 +51,6 @@ func main() { | |||
flag.StringVar(&migrationContext.AssumeMasterHostname, "assume-master-host", "", "(optional) explicitly tell gh-ost the identity of the master. Format: some.host.com[:port] This is useful in master-master setups where you wish to pick an explicit master, or in a tungsten-replicator where gh-ost is unable to determine the master") | |||
flag.IntVar(&migrationContext.InspectorConnectionConfig.Key.Port, "port", 3306, "MySQL port (preferably a replica, not the master)") | |||
flag.Float64Var(&migrationContext.InspectorConnectionConfig.Timeout, "mysql-timeout", 0.0, "Connect, read and write timeout for MySQL") | |||
flag.Float64Var(&migrationContext.InspectorConnectionConfig.WaitTimeout, "mysql-wait-timeout", 0.0, "wait_timeout for MySQL sessions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing a flag makes this PR backwards incompatible, and will break gh-ost
for any user supplying the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach this flag was introduced just a few days prior in #1401 to solve the same root cause and no release has been cut yet with it
I later decided to isolate the wait_timeout
change to cut-over time only (in this PR) to address the problem more closely and to avoid a larger change
That said, I think we could keep this flag around for more generic reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, cool, I did not realize the flag was only added recently. Remove it, then!
@@ -934,6 +937,42 @@ func (this *Applier) CreateAtomicCutOverSentryTable() error { | |||
return nil | |||
} | |||
|
|||
// WaitForAtomicCutOverRename waits for the cut-over RENAME operation while periodically | |||
// pinging the applier connection to avoid the connection becoming idle. | |||
func (this *Applier) WaitForAtomicCutOverRename(tx *gosql.Tx, okToUnlockTable <-chan bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this function is required, and it adds more complexity. If we set @@session.wait_timeout
to be somewhat bigger than lock-wait-timeout, then we don't need to do a connection keep-alive. I see the idea behind this, but I think it's simpler to just set an appropriate wait_timeout
and let MySQL handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach see earlier comment. This is born by paranoia that the RENAME
in the other session could take longer than lock_wait_timeout * 2
, which I believe is hypothetically possible (eg: server is slow, network, process scheduling, hand-wavy things, etc)
To be fair, these edge cases are extremely "edge"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say I still think this function adds complexity, and still better to use @@wait_timeout
based on some (same) formula such as the one presented above.
this.migrationContext.Log.Infof("Setting cut-over idle timeout as %d seconds", this.migrationContext.CutOverIdleTimeoutSeconds) | ||
query := fmt.Sprintf(`set /* gh-ost */ session wait_timeout:=%d`, this.migrationContext.CutOverIdleTimeoutSeconds) | ||
_, err = tx.Exec(query) | ||
return func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated function doesn't use anything special coming from InitAtomicCutOverIdleTimeout
, ie no local variables etc. As such, it will be simpler for InitAtomicCutOverIdleTimeout
to only return error
, and for the calling code (AtomicCutOverMagicLock()
, below) to just defer sqlutils.ExecNoPrepare(this.db, <restore wait time>)
)
@@ -974,6 +1013,15 @@ func (this *Applier) AtomicCutOverMagicLock(sessionIdChan chan int64, tableLocke | |||
return err | |||
} | |||
|
|||
if this.migrationContext.CutOverIdleTimeoutSeconds >= 1 { | |||
restoreIdleTimeoutFunc, err := this.InitAtomicCutOverIdleTimeout(tx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO should be replaced with:
restoreIdleTimeoutFunc, err := this.InitAtomicCutOverIdleTimeout(tx) | |
defer func() { | |
if _, err := sqlutils.ExecNoPrepare(this.db, query); err != nil { | |
this.migrationContext.Log.Errorf("Failed to restore applier wait_timeout to %d seconds: %v", | |
this.migrationContext.ApplierWaitTimeout, err, | |
) | |
} | |
} | |
err := this.InitAtomicCutOverIdleTimeout(tx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach makes sense 👍. I added a few questions to your responses; I'll make this change once we've reached agreement. Thanks for checking 🙇
@shlomi-noach I'm curious about your feedback on one consequence of this change Following this PR, it is possible for the session holding the If understand this right, the lock-session hitting -initially-drop-old-table
Drop a possibly existing OLD table (remains from a previous run?) before beginning operation. Default is to panic and abort if such table exists Would |
No, actually. The The next cut-over attempt will first, before placing any locks, attempt to This should be safe. |
Related issue: #1407
Description
This PR refines #1401 by overriding the session
wait_timeout
only where it is needed - at the cut-over time where an idle connection could lead to potentially-long table lock if thegh-ost
process (or host running it) "freezes"/"stalls" at the cut-over stageThe change (at cut-over only):
wait_timeout
is fetched (via an existingselect
that fetchedtime_zone
)wait_timeout
is set to be ==--cut-over-idle-timeout-seconds
(defaults to 2 x lock-wait timeout)gh-ost
stalls with a still-active connection herewait_timeout
is restored to what it was set to pre-cut-overThe
--mysql-wait-timeout
flag added in #1401 is replaced with the new--cut-over-idle-timeout-seconds
flag. No release has been cut since #1401, so this isn't necessarily a breaking changescript/cibuild
returns with no formatting errors, build errors or unit test errors.