-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InlineVerifier Binlog Reverifier #122
Conversation
23ea5df
to
5c20186
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me. A few comment and questions on some specifics.
|
||
func (c *InlineVerifierConfig) Validate() error { | ||
var err error | ||
if c.MaxExpectedDowntime != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this throw an error if the maxExpectedDowntime
is not set? Or at least set a default value like VerifyBinlogEventsInterval
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make this more clear: if this option is not specified, the InlineVerifier will not try to estimate the max expected downtime before cutover and hence won't abort the cutover if the expected downtime is very long.
inline_verifier.go
Outdated
v.logger.Info("starting periodic reverifier") | ||
|
||
for !v.periodicVerifyStopped { | ||
time.Sleep(v.VerifyBinlogEventsInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this use a time.Ticker
instead?
"iteration": i, | ||
}).Infof("completed re-verification iteration %d", i) | ||
|
||
if after <= 1000 || after >= before { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from where is the 1000
coming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the IterativeVerifier functioned previously. If there's less than 1000 rows in the reverifystore, then it will stop the reverification before cutover and allow cutover to proceed.
} | ||
} | ||
|
||
if v.MaxExpectedDowntime != 0 && timeToVerify > v.MaxExpectedDowntime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are intentionally overwriting the timeToVerify
value, right? To get the last amount of it it takes to iterate the re-verify queue
? Does it make sense to just take the last value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is copied verbatim from the IterativeVerifier. The estimate is not perfect but I don't want to change from what "works" for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Is it worth tracking this in an issue? Are we measuring the accuracy at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can create an issue. That said, I don't know how/if this is measured while running real moves. I'd imagine the precision of the measurement is gotta be pretty low as the speed at which we can copy binlogs depend heavily on external factors (such as database load) and these factors tend to be highly transient in nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this to be accurate, not sure i understand the concern
}, nil | ||
} | ||
|
||
var messageBuf bytes.Buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move the error message composition below into a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only used once here as it's a special message tailored for VerifyDuringCutover. I can leave a comment saying that this block is for building the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me 👍
// Note that the FingerprintQuery returns the columns: pk, fingerprint, | ||
// compressedData1, compressedData2, ... | ||
// If there are no compressed data, only 2 columns are returned and this | ||
// loop will be skipped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comment 👍
b046de4
to
616416e
Compare
I've made some minor changes to fix the comments raised above. |
616416e
to
9b9dd44
Compare
I'll merge this tomorrow if there are no more comments on this. We can always address anything else that pops up in a follow up PR. |
I was planning on reviewing tomorrow |
OK. I'll hold off until you give it a look. Thanks for spending the time! |
config.go
Outdated
} | ||
|
||
if c.FinalReverifyMaxIterations == 0 { | ||
c.FinalReverifyMaxIterations = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to you but this is a great example of too much configuration options that causes more confusion for the user 😛
i personally think MaxExpectedDowntime
is sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'll hard code 30 into the code base for now. Although the MaxExpectedDowntime
is not likely to be accurate and thus I made this also configurable. We should open a low priority issue to improve MaxExpectedDowntime
as @fjordan suggested.
config.go
Outdated
MaxExpectedDowntime string | ||
|
||
// The interval at which the periodic binlog reverification occurs, in the | ||
// format of time.ParseDuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"1s by default"
inlineVerifier := f.NewInlineVerifier() | ||
f.Verifier = inlineVerifier | ||
f.inlineVerifier = f.NewInlineVerifier() | ||
f.Verifier = f.inlineVerifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah gross, are you cleaning this up after? (in which case, I'd like a TODO here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll make the InlineVerifier an integrated option so this should become better.
@@ -461,6 +472,10 @@ func (f *Ferry) Start() error { | |||
f.BinlogStreamer.AddEventListener(f.BinlogWriter.BufferBinlogEvents) | |||
f.DataIterator.AddBatchListener(f.BatchWriter.WriteRowBatch) | |||
|
|||
if f.inlineVerifier != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another idea to avoid introducing the f.inlineVerifier
field could be to just use a type assertion for now until the Verifier
interface is extended/removed:
if f.Config.VerifierType == VerifierTypeInline {
// do things with f.Verifier.(*InlineVerifier).binlogEventListener
}
Not a ton better than what you have but at least it doesn't call the same object from two different references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this for now. Once we ship these series of PRs, I'll review the Verifier interface. My plan is to always have the InlineVerifier with Ferry and possibly convert the Verifier into a "CutoverVerifier" kind deal.
|
||
ghostferry = new_ghostferry(MINIMAL_GHOSTFERRY, config: { verifier_type: "Inline" }) | ||
|
||
ghostferry.on_status(Ghostferry::Status::ROW_COPY_COMPLETED) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is after data iteration has finished right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is immediately after the data iteration finishes and the cutover won't occur until after the code in this block finishes.
inline_verifier.go
Outdated
"remainingRowCount": v.reverifyStore.currentRowCount, | ||
}).Debug("reverified") | ||
case <-v.doneCh: | ||
break loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return
instead so we don't have that loop label? the log line can move up here, correct?
v.periodicVerifyTicker.Stop() | ||
close(v.doneCh) | ||
v.logger.Info("request shutdown of periodic reverifier") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, we can ✂️ this method if we use a context for cancellation
} | ||
} | ||
|
||
if v.MaxExpectedDowntime != 0 && timeToVerify > v.MaxExpectedDowntime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this to be accurate, not sure i understand the concern
inline_verifier.go
Outdated
db = v.TargetDB | ||
stmtCache = v.targetStmtCache | ||
} else { | ||
return nil, nil, errors.New("programming error: whichDB must be either source or target") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my.
How about this instead:
func (v *InlineVerifier) getFingerprintDataFromSourceDb(db *sql.DB, stmtCache *StmtCache, ...) {
getFingerprintDataFromDb(v.SourceDB, v.sourceStmtCache, ...)
}
func (v *InlineVerifier) getFingerprintDataFromTargetDb(...) {
getFingerprintDataFromDb(v.TargetDB, v.targetStmtCache, ...)
}
func (v *InlineVerifier) getFingerprintDataFromDb(db *sql.DB, stmtCache *StmtCache, ...) {
...
}
} | ||
|
||
if tx != nil { | ||
fingerprintStmt = tx.Stmt(fingerprintStmt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't fingerprintStmt
prepared on the connection as opposed to the tx
? is this safe? i assume it gets re-prepared on the transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns a transaction specific statement, but I don't think it actually it gets re-prepared (i.e. there shouldn't be another set of round trips every transaction). Skimming through the golang code quickly seems to suggest it is an internal flag of some sort?
0041a1e
to
f484312
Compare
Periodically attempts to verify BinlogEvents copied. The ReverifyStore is finally flushed during the cutover phase.
f484312
to
cc57797
Compare
Sorry for the long delay. This PR should be ready to be reviewed again. All the changes are in the last commit that I'll squash when I merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but feels like we are adding more and more items to the "do later" list.
logger *logrus.Entry | ||
type BinlogVerifySerializedStore struct { | ||
Store map[string]map[string]map[uint64]struct{} | ||
RowCount uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carrying around this state seems more complex to me than just getting the row count on deserialization. "several nested loops" is not really that complex.. you could even define map[string]map[string]map[uint64]struct{}
as a named type (like it was before) and define a GetRowCount()
method on it 🙂
You don't even have to write the code right now, so... 🤷♂
// include the rows added in the interrupted run if the present run is a | ||
// resuming one. This is only used for emitting metrics. | ||
totalRowCount uint64 | ||
currentRowCount uint64 // The number of rows in store currently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense but I would like to make the purpose of it clearer. This is not much more than a monotonically increasing integer (doesn't have to be serialized or exposed to the user at all), right?
Alternatively, could we step back here and consider how else we could throttle the logs? Maybe using a timestamp?
I'll fix some of the "do later" items in #124 later today if you hold off reviewing that. |
cc57797
to
df83ae3
Compare
The InlineVerifier doesn't verify the Binlog copied. This PR adds a reverifier component to the InlineVerifier. It periodically attempts to verify the BinlogEvents copied prior to cutover (within
PeriodicallyVerifyBinlogEvents
). If a row doesn't match, it will be added back to the queue to be verified again in the future. This process will be stopped after the DataIterator finishes copying all rows. After it is stopped,VerifyBeforeCutover
starts. Within this phase, the verifier attempts to iterate over the verification queue until it is small enough (< 1000 binlog events by default). A maximum of 30 iterations (by default). After this is done, VerifyDuringCutover should be called by the application. During this, the verifier will return the mismatches if they are found.The code shares a fair amount of similarities to the IterativeVerifier, except it is running as DataIterator is copying, while the original IterativeVerifier code runs after DataIterator is done. Note that this verifier is not resumable yet. I'll make it resumable in a follow up PR (should be relatively easily as I just need to dump the ReverifyStore with the state dump).
Downstream consumer changes required
iterative_verifier_store_rows
toinline_verifier_current_rows