Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InlineVerifier Binlog Reverifier #122

Merged
merged 3 commits into from
Aug 13, 2019

Conversation

shuhaowu
Copy link
Contributor

@shuhaowu shuhaowu commented Jul 9, 2019

The InlineVerifier doesn't verify the Binlog copied. This PR adds a reverifier component to the InlineVerifier. It periodically attempts to verify the BinlogEvents copied prior to cutover (within PeriodicallyVerifyBinlogEvents). If a row doesn't match, it will be added back to the queue to be verified again in the future. This process will be stopped after the DataIterator finishes copying all rows. After it is stopped, VerifyBeforeCutover starts. Within this phase, the verifier attempts to iterate over the verification queue until it is small enough (< 1000 binlog events by default). A maximum of 30 iterations (by default). After this is done, VerifyDuringCutover should be called by the application. During this, the verifier will return the mismatches if they are found.

The code shares a fair amount of similarities to the IterativeVerifier, except it is running as DataIterator is copying, while the original IterativeVerifier code runs after DataIterator is done. Note that this verifier is not resumable yet. I'll make it resumable in a follow up PR (should be relatively easily as I just need to dump the ReverifyStore with the state dump).

Downstream consumer changes required

  • Metric tag changed from iterative_verifier_store_rows to inline_verifier_current_rows

@shuhaowu shuhaowu changed the title [WIP] Binlog Reverifier InlineVerifier Binlog Reverifier Jul 15, 2019
@shuhaowu shuhaowu force-pushed the inline-verifier-reverifier branch 2 times, most recently from 23ea5df to 5c20186 Compare July 15, 2019 14:42
@shuhaowu shuhaowu requested review from fjordan and hkdsun July 15, 2019 14:43
Copy link
Contributor

@fjordan fjordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me. A few comment and questions on some specifics.


func (c *InlineVerifierConfig) Validate() error {
var err error
if c.MaxExpectedDowntime != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw an error if the maxExpectedDowntime is not set? Or at least set a default value like VerifyBinlogEventsInterval below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this more clear: if this option is not specified, the InlineVerifier will not try to estimate the max expected downtime before cutover and hence won't abort the cutover if the expected downtime is very long.

v.logger.Info("starting periodic reverifier")

for !v.periodicVerifyStopped {
time.Sleep(v.VerifyBinlogEventsInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this use a time.Ticker instead?

"iteration": i,
}).Infof("completed re-verification iteration %d", i)

if after <= 1000 || after >= before {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from where is the 1000 coming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the IterativeVerifier functioned previously. If there's less than 1000 rows in the reverifystore, then it will stop the reverification before cutover and allow cutover to proceed.

}
}

if v.MaxExpectedDowntime != 0 && timeToVerify > v.MaxExpectedDowntime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are intentionally overwriting the timeToVerify value, right? To get the last amount of it it takes to iterate the re-verify queue? Does it make sense to just take the last value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is copied verbatim from the IterativeVerifier. The estimate is not perfect but I don't want to change from what "works" for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Is it worth tracking this in an issue? Are we measuring the accuracy at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can create an issue. That said, I don't know how/if this is measured while running real moves. I'd imagine the precision of the measurement is gotta be pretty low as the speed at which we can copy binlogs depend heavily on external factors (such as database load) and these factors tend to be highly transient in nature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this to be accurate, not sure i understand the concern

}, nil
}

var messageBuf bytes.Buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move the error message composition below into a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only used once here as it's a special message tailored for VerifyDuringCutover. I can leave a comment saying that this block is for building the error message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me 👍

// Note that the FingerprintQuery returns the columns: pk, fingerprint,
// compressedData1, compressedData2, ...
// If there are no compressed data, only 2 columns are returned and this
// loop will be skipped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the comment 👍

@shuhaowu shuhaowu force-pushed the inline-verifier-reverifier branch 4 times, most recently from b046de4 to 616416e Compare July 22, 2019 15:32
@shuhaowu
Copy link
Contributor Author

I've made some minor changes to fix the comments raised above.

@shuhaowu
Copy link
Contributor Author

shuhaowu commented Jul 24, 2019

I'll merge this tomorrow if there are no more comments on this.

We can always address anything else that pops up in a follow up PR.

@hkdsun
Copy link
Contributor

hkdsun commented Jul 24, 2019

I was planning on reviewing tomorrow

@shuhaowu
Copy link
Contributor Author

OK. I'll hold off until you give it a look. Thanks for spending the time!

config.go Outdated
}

if c.FinalReverifyMaxIterations == 0 {
c.FinalReverifyMaxIterations = 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you but this is a great example of too much configuration options that causes more confusion for the user 😛

i personally think MaxExpectedDowntime is sufficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'll hard code 30 into the code base for now. Although the MaxExpectedDowntime is not likely to be accurate and thus I made this also configurable. We should open a low priority issue to improve MaxExpectedDowntime as @fjordan suggested.

config.go Outdated
MaxExpectedDowntime string

// The interval at which the periodic binlog reverification occurs, in the
// format of time.ParseDuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"1s by default"

inlineVerifier := f.NewInlineVerifier()
f.Verifier = inlineVerifier
f.inlineVerifier = f.NewInlineVerifier()
f.Verifier = f.inlineVerifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah gross, are you cleaning this up after? (in which case, I'd like a TODO here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll make the InlineVerifier an integrated option so this should become better.

@@ -461,6 +472,10 @@ func (f *Ferry) Start() error {
f.BinlogStreamer.AddEventListener(f.BinlogWriter.BufferBinlogEvents)
f.DataIterator.AddBatchListener(f.BatchWriter.WriteRowBatch)

if f.inlineVerifier != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another idea to avoid introducing the f.inlineVerifier field could be to just use a type assertion for now until the Verifier interface is extended/removed:

if f.Config.VerifierType == VerifierTypeInline {
    // do things with f.Verifier.(*InlineVerifier).binlogEventListener
}

Not a ton better than what you have but at least it doesn't call the same object from two different references

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this for now. Once we ship these series of PRs, I'll review the Verifier interface. My plan is to always have the InlineVerifier with Ferry and possibly convert the Verifier into a "CutoverVerifier" kind deal.


ghostferry = new_ghostferry(MINIMAL_GHOSTFERRY, config: { verifier_type: "Inline" })

ghostferry.on_status(Ghostferry::Status::ROW_COPY_COMPLETED) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is after data iteration has finished right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is immediately after the data iteration finishes and the cutover won't occur until after the code in this block finishes.

"remainingRowCount": v.reverifyStore.currentRowCount,
}).Debug("reverified")
case <-v.doneCh:
break loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return instead so we don't have that loop label? the log line can move up here, correct?

v.periodicVerifyTicker.Stop()
close(v.doneCh)
v.logger.Info("request shutdown of periodic reverifier")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, we can ✂️ this method if we use a context for cancellation

}
}

if v.MaxExpectedDowntime != 0 && timeToVerify > v.MaxExpectedDowntime {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this to be accurate, not sure i understand the concern

db = v.TargetDB
stmtCache = v.targetStmtCache
} else {
return nil, nil, errors.New("programming error: whichDB must be either source or target")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my.

How about this instead:

func (v *InlineVerifier) getFingerprintDataFromSourceDb(db *sql.DB, stmtCache *StmtCache, ...) { 
    getFingerprintDataFromDb(v.SourceDB, v.sourceStmtCache, ...)
}

func (v *InlineVerifier) getFingerprintDataFromTargetDb(...) { 
    getFingerprintDataFromDb(v.TargetDB, v.targetStmtCache, ...) 
}

func (v *InlineVerifier) getFingerprintDataFromDb(db *sql.DB, stmtCache *StmtCache, ...) { 
    ... 
}

}

if tx != nil {
fingerprintStmt = tx.Stmt(fingerprintStmt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't fingerprintStmt prepared on the connection as opposed to the tx? is this safe? i assume it gets re-prepared on the transaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns a transaction specific statement, but I don't think it actually it gets re-prepared (i.e. there shouldn't be another set of round trips every transaction). Skimming through the golang code quickly seems to suggest it is an internal flag of some sort?

@shuhaowu shuhaowu force-pushed the inline-verifier-reverifier branch 3 times, most recently from 0041a1e to f484312 Compare August 8, 2019 21:15
Periodically attempts to verify BinlogEvents copied. The ReverifyStore
is finally flushed during the cutover phase.
@shuhaowu
Copy link
Contributor Author

shuhaowu commented Aug 8, 2019

Sorry for the long delay. This PR should be ready to be reviewed again. All the changes are in the last commit that I'll squash when I merge this PR.

Copy link
Contributor

@hkdsun hkdsun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but feels like we are adding more and more items to the "do later" list.

logger *logrus.Entry
type BinlogVerifySerializedStore struct {
Store map[string]map[string]map[uint64]struct{}
RowCount uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carrying around this state seems more complex to me than just getting the row count on deserialization. "several nested loops" is not really that complex.. you could even define map[string]map[string]map[uint64]struct{} as a named type (like it was before) and define a GetRowCount() method on it 🙂

You don't even have to write the code right now, so... 🤷‍♂

// include the rows added in the interrupted run if the present run is a
// resuming one. This is only used for emitting metrics.
totalRowCount uint64
currentRowCount uint64 // The number of rows in store currently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense but I would like to make the purpose of it clearer. This is not much more than a monotonically increasing integer (doesn't have to be serialized or exposed to the user at all), right?

Alternatively, could we step back here and consider how else we could throttle the logs? Maybe using a timestamp?

@shuhaowu
Copy link
Contributor Author

the "do later" list.

I'll fix some of the "do later" items in #124 later today if you hold off reviewing that.

@shuhaowu shuhaowu merged commit 9da3337 into Shopify:master Aug 13, 2019
@shuhaowu shuhaowu deleted the inline-verifier-reverifier branch August 13, 2019 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants