Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max-lag-millis parameter description is misleading when running on master #1387

Open
Tracked by #1400
dontstopbelieveing opened this issue Mar 9, 2024 · 3 comments

Comments

@dontstopbelieveing
Copy link
Contributor

This comes out of an issue we kept running into causing great pain and outages

We would run with

gh-ost --allow-on-master -assume-rbr \
-critical-load Threads_running=700 \
-critical-load-hibernate-seconds 60 \
-max-load Threads_running=100 \
-nice-ratio 0.2 \
-chunk-size 5000 \
-ask-pass \
-database=sbtest \
-table=sbtest1 \
-user sbtest \
-host=XXX \
-postpone-cut-over-flag-file /tmp/gh-ost-postpone-cut-over-flag-sbtest1.log \
-skip-foreign-key-checks -throttle-additional-flag-file /tmp/gh-ost-throttle_flag-sbtest1.log \
--alter "add column c1 int(11) NOT NULL DEFAULT '1';" \
-verbose -initially-drop-old-table -execute 2>&1 | tee /tmp/gh-ost-sbtest1.log

Then once row copy was complete, we would be in migrating stage for a long time applying binlogs. At this point heartbeat lag would be 10-30 seconds. We thought if we increased max-lag-millis from 1500 to 10000 this would give us less throttling and speed up binlog reading and applies (silly us!)

Heartbeat lag would drop below 10 seconds, we would remove the cutover file and then run into
"ERROR Timeout while waiting for events up to lock"

which made sense since 10 seconds > cutover lock timeout of 6 seconds

Our ask is that we edit the documentation to point out this important effect of the seemingly innocent parameter as evident here https://github.com/github/gh-ost/blob/master/go/logic/migrator.go#L504

if heartbeatLag > maxLagMillisecondsThrottle || heartbeatLag > cutOverLockTimeout {
				this.migrationContext.Log.Debugf("current HeartbeatLag (%.2fs) is too high, it needs to be less than both --max-lag-millis (%.2fs) and --cut-over-lock-timeout-seconds (%.2fs) to continue", heartbeatLag.Seconds(), maxLagMillisecondsThrottle.Seconds(), cutOverLockTimeout.Seconds())
				return true, nil
			}

I also have 2 questions,

  1. after this error gh-ost went back to the migrating stage, supposedly to do rollbacks which caused an availability issue and we had to kill the process. What is it rolling back, since no cutover was done, should it not simply abandon things and release locks?
  2. we obviously didn't wait for rollback to complete but should the metadata locks not released prior to the rollback and just after this error occurs? In our case both the new and old tables remained locked till we killed the process.

For context we are on AWS Aurora and the high hearbeat lag is a side effect of aurora_binlog_replication_max_yield_seconds set to non-zero

@dontstopbelieveing
Copy link
Contributor Author

Created this PR for updating the documentation - #1388

@timvaillancourt
Copy link
Collaborator

after this error gh-ost went back to the migrating stage, supposedly to do rollbacks which caused an availability issue and we had to kill the process. What is it rolling back, since no cutover was done, should it not simply abandon things and release locks?

@dontstopbelieveing could you elaborate on this point?

To my knowledge gh-ost doesn't have any concept of "rolling something back", so I wonder if what you're seeing is a MySQL/InnoDB behaviour

@dontstopbelieveing
Copy link
Contributor Author

dontstopbelieveing commented Mar 13, 2024

I'll add what we see in our test here, we land at the postpone cutover stage

Copy: 100000000/100000000 100.0%; Applied: 751753; Backlog: 0/1000; Time: 1h27m30s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232697:134650910; Lag: 0.09s, HeartbeatLag: 9.49s, State: postponing cut-over; ETA: due
[2024/03/09 11:31:04] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232698, 4)
[2024/03/09 11:31:04] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232698, 4)
2024-03-09 11:31:04 INFO rotate to next log from mysql-bin-changelog.232698:134667787 to mysql-bin-changelog.232698
2024-03-09 11:31:04 INFO rotate to next log from mysql-bin-changelog.232698:0 to mysql-bin-changelog.232698
Copy: 100000000/100000000 100.0%; Applied: 752224; Backlog: 0/1000; Time: 1h28m0s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232698:134396827; Lag: 0.09s, HeartbeatLag: 18.59s, State: postponing cut-over; ETA: due
[2024/03/09 11:31:24] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232699, 4)
[2024/03/09 11:31:24] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232699, 4)
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:134397338 to mysql-bin-changelog.232699
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:0 to mysql-bin-changelog.232699
2024-03-09 11:31:24 INFO rotate to next log from mysql-bin-changelog.232699:0 to mysql-bin-changelog.232699

And then once we delete the cutover flag

2024-03-09 11:31:27 INFO Grabbing voluntary lock: gh-ost.2374.lock
2024-03-09 11:31:27 INFO Setting LOCK timeout as 6 seconds
2024-03-09 11:31:27 INFO Looking for magic cut-over table
2024-03-09 11:31:27 INFO Creating magic cut-over table `sbtest`.`_sbtest1_del`
2024-03-09 11:31:27 INFO Magic cut-over table created
2024-03-09 11:31:27 INFO Locking `sbtest`.`sbtest1`, `sbtest`.`_sbtest1_del`
2024-03-09 11:31:27 INFO Tables locked
2024-03-09 11:31:27 INFO Session locking original & magic tables is 2374
2024-03-09 11:31:27 INFO Writing changelog state: AllEventsUpToLockProcessed:1709983887171918805
2024-03-09 11:31:27 INFO Waiting for events up to lock
2024-03-09 11:31:30 ERROR Timeout while waiting for events up to lock
2024-03-09 11:31:30 ERROR 2024-03-09 11:31:30 ERROR Timeout while waiting for events up to lock
2024-03-09 11:31:30 INFO Looking for magic cut-over table
2024-03-09 11:31:30 INFO Will now proceed to drop magic table and unlock tables
2024-03-09 11:31:30 INFO Dropping magic cut-over table
2024-03-09 11:31:30 INFO Dropping magic cut-over table
2024-03-09 11:31:30 INFO Dropping table `sbtest`.`_sbtest1_del`

So far so good. At this point I would expect metadata locks to be released. But they don't get released and the log has these entries

Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m10s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 7.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m15s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 12.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m20s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 17.19s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752633; Backlog: 0/1000; Time: 1h28m25s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232699:134374965; Lag: 0.09s, HeartbeatLag: 22.19s, State: migrating; ETA: due
2024-03-09 11:31:46 INFO rotate to next log from mysql-bin-changelog.232700:134391842 to mysql-bin-changelog.232700
2024-03-09 11:31:46 INFO rotate to next log from mysql-bin-changelog.232700:0 to mysql-bin-changelog.232700
[2024/03/09 11:31:46] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232700, 4)
[2024/03/09 11:31:46] [info] binlogsyncer.go:723 rotate to (mysql-bin-changelog.232700, 4)
2024-03-09 11:31:47 INFO Intercepted changelog state AllEventsUpToLockProcessed
2024-03-09 11:31:47 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m30s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 5.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m35s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 10.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m40s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 15.79s, State: migrating; ETA: due
Copy: 100000000/100000000 100.0%; Applied: 752689; Backlog: 0/1000; Time: 1h28m45s(total), 1h8m1s(copy); streamer: mysql-bin-changelog.232700:135022773; Lag: 0.09s, HeartbeatLag: 20.79s, State: migrating; ETA: due

This continued till we manually killed the gh-ost process.

I am baffled about two issues -

  1. After the error the metadata locks should have been released and possibly also gh-ost should exit, but this did not happen.
  2. I am confused what's the migrating stage after is it a retry? Again even the retry should release metadata locks and reacquire IMO.

The reason I said "rolling something back" is because the effect I see on MySQL is similar to if a long running transaction might do a rollback. This might not be a rollback but gh-ost running something else.

@ramshad-sha ramshad-sha mentioned this issue Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants