Add verification of snappy-compressed data #44

fjordan · 2018-07-09T22:57:16Z

This is an initial commit of adding verification of snappy compressed data. This approach adds a CompressionVerifier that will allow Snappy and other compression algorithms to be added to decompress data to be fingerprinted and verified. A component is added to the configuration asking the user to identify the compression algorithms used, if any, in corresponding tables and columns.

The process to decompress the data is done during the call to the IterativeVerifier's GetHashes method here. Is there a better approach?

Questions

Q: Why do we need this?
A: Certain compression algorithms may create different hashes of data depending on the version, hardware, or simply because they are not deterministic (Snappy being one of them). Because of this, we cannot blindly rely on the md5 hash of the compressed data in the tables to verify equality. We must first decompress the data and then we can fingerprint for equality.

Q: Does the CompressionVerifier warrant its own test suite? Or should the functionality be included in the existing IterativeVerifier's test suite?
A: Nope. Not right now.

To be completed

Validation of approach
Tests with fixture data
Metrics
Logging

@Shopify/pods

fw42

Lots of nitpicks but the general approach seems good to me.

One thing I think we should consider is that it's actually pretty rare for these "collisions" (same payload but different compressed data) to happen, but with your changes here, we would decompress every single row of a table like that (while we only really have to decompress the ones for which the fingerprinting approach gives us mismatches). I think defaulting to MD5 and only decompressing on mismatches might be a big performance win compared to the approach in this PR (always decompress).

@pushrax, if you have half an hour, I'd love your eyes on this too so that Forrest can get some more feedback.

fw42 · 2018-07-11T09:52:49Z

compression_verifier.go

+	CompressionSnappy = "SNAPPY"
+)
+
+// UnsupportedCompressionError is used to identify a


identify a ... what?

fw42 · 2018-07-11T09:54:39Z

compression_verifier.go

+}
+
+// CompressionVerifier provides the same functionality as the iterative verifier but also
+// provides support for manually verifying the payload of compressed columns that may


I think calling it "manual verification" was a bad choice of words on our part. I think we should consider a different terminology (or just omitting it and call it "verifying").

Good feedback. Changed to just "verifying" as that makes more sense. 👍

fw42 · 2018-07-11T10:01:23Z

compression_verifier.go

+
+// CompressionVerifier provides the same functionality as the iterative verifier but also
+// provides support for manually verifying the payload of compressed columns that may
+// have different hashes for the same data


Super nitpick: I think this is a tiny bit confusing. The "root problem" is that there are multiple compressed versions of the same data. The fact that there are also (as a result of this) multiple hashes for the same (decompressed) data is a side-effect of that, not the actual problem (and it's confusing since hash functions themselves can't have this property, it's the combination with compression functions that causes this problem).

Nice call out. Updated the comment to be a bit more clear and not confuse the two. What do you think?

fw42 · 2018-07-11T10:01:38Z

compression_verifier.go

+type CompressionVerifier struct {
+	logger *logrus.Entry
+
+	// supportedAlgorithms provide O(1) lookup to check if a configured algorithm is supported


s/provide/provides/?

fw42 · 2018-07-11T10:02:02Z

compression_verifier.go

+	// supportedAlgorithms provide O(1) lookup to check if a configured algorithm is supported
+	supportedAlgorithms map[string]struct{}
+
+	// tableCmpressions is represented as table[column][compression_algorithm]


s/Cmp/Comp/

Nitpick: Took me a minute to understand what you mean by table[column][compression_algorithm]. Is that a common notation? How about something like tableName -> columnName -> compressionAlgorithmName?

Yeah the name tableColumnCompressions is a bit more clear and helps the reader understand the data structure. Updated 😄

fw42 · 2018-07-11T10:21:56Z

compression_verifier.go

+}
+
+// Decode will apply the configured decompression algorithm to the configured columns data
+func (c *CompressionVerifier) Decode(table, column, algorithm string, compressed []byte) ([]byte, error) {


Same here, for clarity's sake, can we call it Decompress instead of Decode?

Changed, but yeah it was definitely the Snappy library that threw me off and caused me to start naming things using decode instead of just decompress.

fw42 · 2018-07-11T10:25:32Z

compression_verifier.go

+			if algorithm, ok := tableCompression[column.Name]; ok {
+				decodedColData, err := c.Decode(table, column.Name, algorithm, rowData[idx].([]byte))
+				if err != nil {
+					return nil, err


How do you feel about ignoring errors here? Imagine we have corrupted data in a compressed column, i.e. data that can't be decompressed (for example because it was truncated). With your code here, we will never ever be able to copy that data. But (at this layer) we actually don't care about the data being decompressable, we only care about it being equal to the data in the other database. So as long as both the source and the target are the same (even if both are corrupted, as long as they are corrupted in the exact same way), we should not treat that as an error (if you agree, can you add a test for that?). I'd suggest to treat decompression errors as "this column wasn't meant to be decompressed anyway" and just store it in decodedRowData as raw compressed data (same as the else branch below) and then give it a second chance by comparing using the regular md5 approach.

Not sure if this actually ever happens in practice, but (if I remember correctly) our current Ruby implementation of this algorithm doesn't have that problem since it only decompresses if the fingerprints are not equal already.

Great call out. If there is a decompression error then surely it will cause problems in other places so it should definitely be logged (and possibly make some noise?).

We can continue execution and yes just confirm equality, however, if we're only going down this path when the fingerprints of the SQL don't match (as you point out in your other comment), and now the data cannot be decompressed, then we won't be able to confirm the equality of the data, right? 🤔

@hkdsun, how do you feel about this comment? i.e. ALWAYS decompressing makes it impossible to move data that is corrupted (while that was previously possible).

@fw42 just confirmed with @hkdsun that moves are aborted if the data cannot be decompressed and the compressed hashes do not match. Planning to implement this here. What are your thoughts?

Yes, that's the current behaviour in the Ruby implementation.

fw42 · 2018-07-11T10:26:47Z

compression_verifier.go

+	case CompressionSnappy:
+		return snappy.Decode(decoded, compressed)
+	default:
+		return nil, UnsupportedCompressionError{


Can this ever happen? Don't you check for this in the initializer already?

It could happen if the method is called outside of the workflow of the CompressionVerifier and in an ad hoc sense. Should we not export this method?

fw42 · 2018-07-11T10:27:01Z

compression_verifier.go

+
+}
+
+// HashRow will fingerprint the non-primary columns of the row to be verify data equality


s/to be/to/

fw42 · 2018-07-11T10:31:47Z

compression_verifier.go

+
+// NewCompressionVerifier first checks the map for supported compression algorithms before
+// initializing and returning the initialized instance.
+func NewCompressionVerifier(tableCompressions map[string]map[string]string) (*CompressionVerifier, error) {


This method isn't used anywhere (only in the tests). That seems a bit odd. Am I missing something?

It may or may not be used depending on the user's use case. If it were to be used, this function is provided as a convenience and safeguard to ensure the configured compression algorithms are supported.

sirupsen · 2018-07-11T17:26:18Z

@fw42 since Forrest and Hormoz get back at the same time, we can probably wait for Hormoz :blobheart: to get back? 👂

fw42 · 2018-07-11T18:16:48Z

I'd like Forrest to have some feedback ready to act on as soon as he gets back. Not super urgent but if anyone has some time to take a look, I think that would help make sure that Forrest isn't still blocked on this when he returns from vacation.

hkdsun · 2018-07-26T16:06:22Z

I think defaulting to MD5 and only decompressing on mismatches might be a big performance win compared to the approach in this PR (always decompress).

In my opinion, there won't be a huge performance win overall since this configuration is a very special case that I imagine will be use infrequently by the average user.

What you suggested, in my opinion, will make the verification logic more complex to follow and since we can avoid it without a huge performance hit, I prefer the current approach

fw42 · 2018-07-26T16:53:23Z

there won't be a huge performance win overall since this configuration is a very special case that I imagine will be use infrequently by the average user.

Not sure what you mean by "average user", but Shopify's use-case will run into this for every single shop, potentially tens of thousands of times, depending on how many rows the table has.

hkdsun · 2018-07-26T17:13:54Z

My justification was that majority of tables will not go through this codepath - therefore the overall ferry performance will not be affected by these changes. So even if this is used for every ferry run, for every shop, all the other tables (which are much larger) dwarf the time spent in this codepath.

hkdsun

I didn't go into a lot of detail in my review since Flo has already raised a lot of good points and we discussed a larger refactor in Slack before our vacations based on the following feedback:

Having the data structures (or in general having the logic) flow directly down in a clean line is a principle Justin advocates for and would simplify this PR a bunch. This principle usually makes the solution much easier to follow and modify down the line.

What reminded me of this principle was the cylclic data flow from user's configuration to CompressionVerifier, to IterativeVerifier, back to the user's config (if v.CompressionVerifier.columnCompression[table] { ... }) and finally back to the CompressionVerifier's GetHashes method.

In pseudo-code this is how it'd go:

user passes the configuration to the itrative verifier:


type IterativeVerifier struct {
  CompressionConfig map[string]TableCompressionConfig
  ...
}

iterative verifier uses the config to decide how to get the hashes for a table:

  if i.CompressionConfig[table] != nil {
    return GetCompressedTableHash(i.db, i.CompressionConfig[table], table)
  }

  return GetMd5Hashes(table)
}

and GetCompressedHash does the right thing, given the table's config:


GetCompressedTableHash(db, compressionConfig, table) {
  // core logic that decompresses columns when necessary
}

hkdsun · 2018-07-26T18:51:54Z

compression_verifier.go

+	supportedAlgorithms map[string]struct{}
+
+	// tableColumnCompressions is represented as table -> column -> compression_algorithm
+	tableColumnCompressions map[string]map[string]string


I think using a simple type annotation here would greatly improve readability. Something like:

type ColumnCompressionConfig map[string]string // map of columns to decompressors. e.g. { "body": "SNAPPY" }

you'd then have:

tableColumnCompressions map[string]ColumnCompressionConfig

which is a hash of table -> ColumnCompressionConfig which is readily comprehensible

Added two types, one being exported (TableColumnCompressionConfig). Do you think this is more clear?

fjordan · 2018-08-02T22:13:57Z

This PR now only uses the CompressionVerifier on mismatch and a compression configuration check. Do we have any fixture data we can share with a mismatched hash that can be used in the test?

fw42

Unless I'm missed it, I think the most important test of this PR is still missing, i.e. a test with two rows that have the same decompressed value but different compressed values. Can you add that test?

Besides my comment about the client-side MD5 being a waste of resources, I'd say this looks pretty good and basically ready to go. I imagine having the client-side MD5 stuff makes it easier to integrate your new code with the existing iterative verifier, so I'm ok with keeping it. It's unnecessary and a bit of a waste of CPU resources but probably negligible? So unless you have an idea how to cleanly remove it, I'd say I'm ok with keeping it.

fw42 · 2018-08-03T13:43:10Z

compression_verifier.go

+	// supportedAlgorithms provide O(1) lookup to check if a configured algorithm is supported
+	supportedAlgorithms map[string]struct{}
+
+	// tableColumnCompressions is represented as table -> column -> compression-type


nitpick this comment is unnecessary since you already explain it above

fw42 · 2018-08-03T13:46:27Z

compression_verifier.go

+//
+// The GetCompressedHashes method checks if the existing table contains compressed data
+// and will apply the decompression algorithm to the applicable columns if necessary.
+// After the columns are decompressed, the hashes of the data are used to verify equality


If we have already loaded the data from the database and already decompressed it, then why still do the hashing? The whole point of hashing is to have to load less data into the client (the hashes are computed server-side). I feel like the extra hashing here is wasted effort. Why not just compare the decompressed data (without hashing it first)?

The client side hashing here is to be compatible with the IterativeVerifier (it expects a single byte slice per primary key to be returned by this method)

fw42 · 2018-08-03T13:47:47Z

compression_verifier.go

+		decompressedRowData := make(map[uint64][]byte)
+		for idx, column := range columns {
+			if column.Name == pkColumn {
+				continue


Is this because the primary key can never be a compressed column? (nitpick, but is there a check for that anywhere? should we add one to verifyConfiguredCompression maybe?)

Mistake on my understanding of how the IterativeVerifier was generating the row_fingerprint (also see comment below: #44 (comment)). Will update 👍

fw42 · 2018-08-03T13:49:14Z

compression_verifier.go

+			// Check if column is configured as compressed and decompress if necessary
+			if algorithm, ok := tableCompression[column.Name]; ok {
+				decompressedColData, err := c.Decompress(table, column.Name, algorithm, rowData[idx].([]byte))
+				if err != nil {


Does this return here mean that we fail hard if data can't be decompressed even if the fingerprints already match? Or do you catch that case elsewhere so it should never get to this point here?

If the fingerprints already match, we never get here due to the changes in iterative_verifier.go

fw42 · 2018-08-03T13:52:20Z

compression_verifier.go

+		}
+
+		// Hash the data of the row to be added to the result set
+		decompressedRowHash, err := c.HashRow(decompressedRowData)


As mentioned above, I think this is unnecessary and we should remove the HashRow method and instead just compare the unhashed data. If we already have the data in the client, there's no point in hashing it client-side (that was a server-side optimization).

🤔 Would it really be faster? It takes the ~~m * n~~ mⁿ iterations down to just m iterations (one value to compare versus a value for each column) plus the cost to calculate the hash. Unless you're referring to just comparing the compressed column and allowing the IterativeVerifier to hash/fingerprint the other columns, which adds some complexity. What do you think?

EDIT: Also, as @hkdsun pointed out above, this is what the IterativeVerifier expects. We could change it, but at the cost of complexity.

I'm fine with this 👍

fw42 · 2018-08-03T13:56:08Z

compression_verifier.go

+		}
+
+		quotedCol := normalizeAndQuoteColumn(column)
+		columnStrs[idx] = fmt.Sprintf("COALESCE(%s, 'NULL')", quotedCol)


I'm not sure I understand why this line is here. Can you explain? This looks like an artefact from the server-side MD5 fingerprinting algorithm (which we aren't using here, are we?).

This line just writes NULL instead of having an empty value for the column. Turns out we can remove it and just use NULLs. Why is this needed for the IterativeVerifier?

Basically because you can't MD5(NULL) in MySQL, if I remember correctly.

Hmm that doesn't appear to be the case.

mysql> select md5(null)\G *************************** 1. row *************************** md5(null): NULL 1 row in set (0.00 sec) mysql>

but this does throw an error:

mysql> select md5()\G ERROR 1582 (42000): Incorrect parameter count in the call to native function 'md5'

Maybe that's what is happening.

Hm then it was probably the CONCAT that needed this

fw42 · 2018-08-03T13:57:00Z

compression_verifier.go

+			continue
+		}
+
+		quotedCol := normalizeAndQuoteColumn(column)


Hm where is this defined? I only see this method in iterative_verifier.go. Why is it in scope here??

Looks like we can remove it :)

fw42 · 2018-08-03T13:57:40Z

config.go

 	// the target database. See DatabaseRewrite.
 	//
 	// Optional: defaults to empty map/no rewrites
 	TableRewrites map[string]string

+	// Map of the table and column identifying the compression type
+	// (if any) of the column. This is used during verification  to ensure


extra space here before "to"

fw42 · 2018-08-03T14:00:55Z

iterative_verifier.go

@@ -573,7 +574,27 @@ func (v *IterativeVerifier) compareFingerprints(pks []uint64, table *schema.Tabl
 		return nil, targetErr
 	}

-	return compareHashes(sourceHashes, targetHashes), nil
+	mismatches := compareHashes(sourceHashes, targetHashes)
+	if len(mismatches) > 0 && v.CompressionVerifier != nil {


I feel like this block of code here could be it's own function to make things more readable.

mismatches := compareHashes(sourceHashes, targetHashes) if mismatchInCompressedColumns(mismatches) { return compareCompressedColumns(...) } else { return mismatches, nil }

wdyt?

That does look simpler, however, we would need to introduce some compression-specific logic to the IterativeVerifier to prevent duplication of code or flowing back and forth between the CompressionVerifier and IterativeVerifier. That would clean up this block of execution inside the conditional, but would blur the line of responsibility between these two components. We could do something like:

if v.compressionMismatch(mismatches) { sourceHashes, targetHashes, err = GetCompressedHashes(...) if err != nil { return nil, err } return compareHashes(sourceHashes, targetHashes), nil } func (v *IterativeVerifier) compressionMismatch(mismatches) { if len(mismatches) > 0 && v.CompressionVerifier != nil { return true } return false }

and rename the current GetCompressedHashes to GetCompressedTableHashes and wrap the former to provide a clean interface.

Thoughts?

hkdsun

Great work! This is definitely not an easy task to get introduced to Ghostferry with but kudos to you for jumping right in 🙂

Besides my review comments, I didn't see any integration of the CompressionVerifier with the sharding package or the IterativeVerifier. Was that intentional? What's the plan there?

hkdsun · 2018-08-03T15:41:35Z

glide.yaml

+- package: github.com/go-sql-driver/mysql
+  version: ^1.3.0
+- package: github.com/Shopify/go-dogstatsd
+- package: github.com/golang/snappy


This must have been a strange conflict to resolve 😛 We moved away from glide to dep in #45 so we'll have to put this dependency in Gopkg.toml instead

Oops 🤕 Will add!

hkdsun · 2018-08-03T17:06:01Z

config.go

+	// (if any) of the column. This is used during verification  to ensure
+	// the data was successfully copied as it must be manually verified.
+	//
+	// Note that the VerifierType must be set to the IterativeVerifier


VerifierType is a copydb concept. I'd just say IterativeVerifier must be used.

Updated 👍

hkdsun · 2018-08-03T17:21:38Z

compression_verifier.go

+// initializing and returning the initialized instance.
+func NewCompressionVerifier(tableColumnCompressions TableColumnCompressionConfig) (*CompressionVerifier, error) {
+	supportedAlgorithms := make(map[string]struct{})
+	supportedAlgorithms[CompressionSnappy] = struct{}{}


Any reason to not go with a global here? I know it can't be a constant but it could be a var:

var supportedAlgorithms = map[string]struct{}{ CompressionSnappy: {}, }

Seems strange to create it for every instance.. and what if somebody doesn't use this constructor?

Do we create multiple IterativeVerifiers or do we reuse a single one? If just a single instance then this will only be instantiated once.

@fw42 mentioned this same thing, and I considered it, but didn't want to make a package-level global var. We can obviously if we intentionally want to do that, but it's generally frowned upon.

hkdsun · 2018-08-03T17:22:35Z

compression_verifier.go

+type CompressionVerifier struct {
+	logger *logrus.Entry
+
+	// supportedAlgorithms provide O(1) lookup to check if a configured algorithm is supported


nit: I think the code speaks for itself here

🔪 removed

hkdsun · 2018-08-03T17:28:10Z

compression_verifier.go

+	table     string
+	column    string
+	algorithm string
+}


Why do we need a new type of error? We're not handling it any differently in any part of the code anyway

We aren't handling it now, but once we integrate into copydb or sharding we may. It allows the type of error to be checked against if we do want any logical conditions based on the error. If there will never be a need we should just remove it, but I was thinking we may want it.

Sure, I'm fine with keeping it since it's not a huge amount of complexity.

Though, as a general principle, I like to avoid complexity if it's not being used immediately (and I don't see us handling this error any differently than any other verifier/initialization error when integrating into sharding package) – while keeping the future maintenance/changes in mind of course.

hkdsun · 2018-08-03T19:29:51Z

compression_verifier.go

+	supportedAlgorithms map[string]struct{}
+
+	// tableColumnCompressions is represented as table -> column -> compression-type
+	tableColumnCompressions TableColumnCompressionConfig


I don't understand why tableColumnCompressions is a field on CompressionVerifier. It's never accessed from within the methods here. It's in the constructor but that's just to populate the field – not consume it

The only other time I see it being accessed is from within the IterativeVerifier. Is that a sign that it should not be here? Perhaps it should just be a config on the IterativeVerifier or Ferry directly?

It's currently part of the CompressionVerifier because it's the configuration for it. Given the CompressionVerifier is also a component of the IterativeVerifier it's essentially part of the IterativeVerifier already, but contained within the CompressionVerifier as the CompressionVerifier contains the logic and info for all things compressed. It's also not accessed within the CompressionVerifier, but referenced at runtime and then passed to the GetCompressedHashes method as that's how I understood your earlier comment about the direct logic flow: #44 (review).

Does that make sense? If you feel we should still move it let me know so we can discuss because I want this to make sense 👍

hkdsun · 2018-08-03T19:34:18Z

compression_verifier.go

+		for idx, column := range columns {
+			if column.Name == pkColumn {
+				continue
+			}


Why are we excluding the primary key column from hashing/comparison?

I was under the impression that what we're trying to do is SELECT pk_col, * and then massaging the result to a map[pk]rowData where rowData is all the row columns after decompression and which must contain the pk column as well

I mistakenly thought the IterativeVerifier wasn't including the primary key for some reason in the fingerprint because it is selected in addition to the row_fingerprint. Looking at it again I see that it is selected separately in addition to it being included.

I'll make adjustments to include the PK in the compression verifier's row hash 👍

hkdsun · 2018-08-03T19:39:16Z

compression_verifier.go

+			}
+			// Check if column is configured as compressed and decompress if necessary
+			if algorithm, ok := tableCompression[column.Name]; ok {
+				decompressedColData, err := c.Decompress(table, column.Name, algorithm, rowData[idx].([]byte))


I think rowData[idx] doesn't always map one-to-one to the ordering of the columns slice. Do you agree?

Imagine the following example table:

| col_1 | col_2 | pk_col | col_3 |

In this scenario, with the rowSelector that we've defined below, you'd have:

rowData = [pk_col, col_1, col_2, col_3]

whereas

columns = [col_1, col_2, pk_col, col_3]

hkdsun · 2018-08-03T19:40:33Z

compression_verifier.go

+			// Check if column is configured as compressed and decompress if necessary
+			if algorithm, ok := tableCompression[column.Name]; ok {
+				decompressedColData, err := c.Decompress(table, column.Name, algorithm, rowData[idx].([]byte))
+				if err != nil {


If the fingerprints already match, we never get here due to the changes in iterative_verifier.go

hkdsun · 2018-08-03T19:46:27Z

compression_verifier.go

+	c.logger.Info("decompressing table data before verification")
+
+	// Extract the raw rows using SQL to be decompressed
+	rows, err := c.getRows(db, schema, table, pkColumn, columns, pks)


I think a lot of confusion would be cleared up if the getRows function did the hard work of massaging SELECT pk_col, * to map[pk]rowData where rowData's indexes map one-to-one with the columns slice.

fjordan · 2018-08-06T19:47:18Z

@fw42 thanks for the comments and review! 😄 I've just pushed up a test that I believe satisfies your concerns. I was working on it last week (as we were discussing getting this fixture data for the tests over Slack) but ran into some encoding issues with the snappy compressed data that I didn't resolve until earlier this morning. I'm working to address the rest of your questions/comments/concerns.

EDIT: @fw42 the client-side MD5 only happens now on mismatch.

@hkdsun thanks 😄 ! We can certainly integrate this into the sharding package and/or copydb. It is already integrated into the IterativeVerifier if the user has configured the compression mappings. Also working to get your comments and feedback addressed.

Thanks so much guys looking forward to getting this merged in and used once it's all ready to ship 🚢

hkdsun

Almost there! I'll follow up with a review of the test files

hkdsun · 2018-08-08T18:08:26Z

Gopkg.toml

@@ -33,3 +33,6 @@
 [prune]
  go-tests = true
  unused-packages = true
+[[constraint]]


nit: can we move these up to where the other [[constraint]] blocks are?

hkdsun · 2018-08-08T18:38:35Z

compression_verifier.go

+	// Decompress applicable columns and hash the resulting column values for comparison
+	resultSet := make(map[uint64][]byte)
+	for rows.Next() {
+		rowData, err := ScanByteRow(rows, len(columns)+1)


Really like this idea 👍

With some effort, we could simplify the rest of the codebase too (definitely not in this PR but this was something we were discussing back in the day we were writing the uint parsing code and discovered the driver's weird behaviour):

ghostferry/dml_events.go

Lines 17 to 20 in b0eb328

// The mysql driver never actually gives you a uint64 from Scan, instead you

// get an int64 for values that fit in int64 or a byte slice decimal string

// with the uint64 value in it.

func (r RowData) GetUint64(colIdx int) (res uint64, err error) {

Why the +1? Is that because columns doesn't include the primary key column?

hkdsun · 2018-08-08T18:40:25Z

compression_verifier.go

+		}
+
+		// Hash the data of the row to be added to the result set
+		decompressedRowHash, err := c.HashRow(decompressedRowData)


I'm fine with this 👍

hkdsun · 2018-08-08T18:44:46Z

iterative_verifier.go

@@ -573,6 +614,29 @@ func (v *IterativeVerifier) compareFingerprints(pks []uint64, table *schema.Tabl
 		return nil, targetErr
 	}

+	mismatches := compareHashes(sourceHashes, targetHashes)
+	if len(mismatches) < 0 {


should be <= right? otherwise this is basically dead code and we'd be decompressing all compressed tables

👀 good catch

hkdsun · 2018-08-08T18:52:22Z

iterative_verifier.go

@@ -329,6 +297,79 @@ func (v *IterativeVerifier) Result() (VerificationResultAndStatus, error) {
 	return v.verificationResultAndStatus, v.verificationErr
 }

+func (v *IterativeVerifier) GetHashes(db *sql.DB, schema, table, pkColumn string, columns []schema.TableColumn, pks []uint64) (map[uint64][]byte, error) {


assuming these previously defined methods were only moved around and not modified

Yes, I can undo and just let another PR re-order them.

It's fine, just double checking

hkdsun · 2018-08-08T19:02:35Z

compression_verifier.go

+	}
+
+	hash.Write(rowFingerprint)
+	return []byte(hex.EncodeToString(hash.Sum(nil))), nil


Note to other reviewers: this Sum() method is not the same as md5.Sum()

We're actually working with a hash.Hash implementation. See the example: https://golang.org/pkg/crypto/md5/#example_New

hkdsun · 2018-08-08T19:30:53Z

test/iterative_verifier_test.go

+}
+
+func (t *IterativeVerifierTestSuite) TestVerifyCompressedMismatchOncePass() {
+	t.InsertCompressedRowInDb(43, testhelpers.TestCompressedData3, t.Ferry.SourceDB)


Let's assert our test's assumption right before this with a helpful comment (after all it's the whole point of the PR):

// Two fixtures that have different compressed values but have equal decompressed values t.Require().NotEqual(testhelpers.TestCompressedData3, testhelpers.TestCompressedData4)

hkdsun · 2018-08-08T19:35:31Z

test/iterative_verifier_test.go

+	t.Require().Nil(err)
+	t.Require().False(result.DataCorrect)
+	t.Require().Equal(fmt.Sprintf("verification failed on table: %s.%s for pks: %s", "gftest", testhelpers.TestCompressedTable1Name, "42"), result.Message)
+}


Can we possibly have a symmetrical test for the Data3 and Data4 case?

Yep! Will check for positive case here after a compressed mismatch

fw42 · 2018-08-09T10:49:49Z

compression_verifier.go

+			return nil, err
+		}
+
+		pk, err := strconv.ParseUint(string(rowData[0]), 10, 64)


why do we have to parse the pk out of rowData? Don't we already have all the pks in pks?

fw42 · 2018-08-09T10:52:09Z

compression_verifier.go

+		for idx, column := range columns {
+			if algorithm, ok := tableCompression[column.Name]; ok {
+				// rowData contains the result of "SELECT pkColumn, * FROM ...", so idx+1 to get each column
+				decompressedColData, err := c.Decompress(table, column.Name, algorithm, rowData[idx+1])


same question here. why the +1? Is that because rowData contains the primary key column but columns doesn't?

nvm I just saw the comment 🤦‍♂️

fw42 · 2018-08-09T10:53:45Z

compression_verifier.go

+		// to create a fingerprint. decompressedRowData contains a map of all
+		// the non-compressed columns and associated decompressed values by the
+		// index of the column
+		decompressedRowData := [][]byte{}


Seems like you already know exactly how long this array will be. Isn't it more idiomatic to make it here than rather than to append below? Seems more memory-efficient (fewer allocations). But probably no big deal and I honestly don't know which one is preferable. Just curious.

We don't actually know the number of rows until iterating through all of them. That would be more efficient though if we did.

fw42 · 2018-08-09T10:54:41Z

compression_verifier.go

+		resultSet[pk] = decompressedRowHash
+	}
+
+	metrics.Gauge("compression_verifier_decompress_rows", float64(len(resultSet)), []MetricTag{}, 1.0)


Maybe add a tag for the table name?

fw42 · 2018-08-09T10:58:20Z

compression_verifier.go

+		rowFingerprint = append(rowFingerprint, colData...)
+	}
+
+	hash.Write(rowFingerprint)


hash.Write can return an error. We probably want to check that here?

fw42 · 2018-08-09T11:06:07Z

config.go

+	//	1. Snappy (https://google.github.io/snappy/) as "SNAPPY"
+	//
+	// Optional: defaults to empty map/no compression
+	TableColumnCompression map[string]map[string]string


Is this actually used anywhere? I don't see it. Or is that a follow-up PR?

yep @hkdsun will be following up with a PR to integrate into sharding and use this

fw42 · 2018-08-09T11:07:59Z

iterative_verifier.go

@@ -573,6 +614,29 @@ func (v *IterativeVerifier) compareFingerprints(pks []uint64, table *schema.Tabl
 		return nil, targetErr
 	}

+	mismatches := compareHashes(sourceHashes, targetHashes)
+	if len(mismatches) <= 0 {


Just curious, can this actually ever be negative? Why not == 0?

fw42 · 2018-08-09T11:08:52Z

iterative_verifier.go

+		return mismatches, nil
+	}
+
+	if v.CompressionVerifier != nil && v.CompressionVerifier.IsCompressedTable(table.Name) {


Nitpick: Why not if len(mismatches) > 0 && ... here and then you can get rid of line 618?

fw42 · 2018-08-09T11:10:00Z

test/data_iterator_test.go


 	this.di.Initialize()
 	this.di.AddBatchListener(func(ev *ghostferry.RowBatch) error {
-		this.receivedRows = append(this.receivedRows, ev.Values()...)
+		this.receivedRows[ev.TableSchema().Name] = append(this.receivedRows[ev.TableSchema().Name], ev.Values()...)
 		return nil
 	})
 }

 func (this *DataIteratorTestSuite) TestNoEventsForEmptyTable() {
 	_, err := this.Ferry.SourceDB.Query(fmt.Sprintf("DELETE FROM `%s`.`%s`", testhelpers.TestSchemaName, testhelpers.TestTable1Name))


This line needs it's own this.Require().Nil(err), otherwise it's useless

fw42 · 2018-08-09T11:13:33Z

test/iterative_verifier_test.go

+	t.Require().Equal("", result.Message)
+}
+
+func (t *IterativeVerifierTestSuite) TestVerifyCompressedMismatchOncePass() {


nitpick but Mismatch seems a bit ambiguous, how about something like SameDecompressedDataButDifferentHash to be more explicit?

fw42 · 2018-08-09T11:14:40Z

test/iterative_verifier_test.go

@@ -222,6 +297,25 @@ func (t *IterativeVerifierTestSuite) InsertRowInDb(id int, data string, db *sql.
 	t.Require().Nil(err)
 }

+func (t *IterativeVerifierTestSuite) InsertCompressedRowInDb(id int, data string, db *sql.DB) {
+	t.SetColumnType(testhelpers.TestSchemaName, testhelpers.TestCompressedTable1Name, testhelpers.TestCompressedColumn1Name, "MEDIUMBLOB", db)


This seems weird. Why do we change the schema during the test (rather than to just always have it be a MEDIUMBLOB and set it during test setup or whenever we create the db)?

We change to MEDIUMBLOB because of the snappy-compressed data. TEXT expects utf8 data and throws an error when we try to insert the compressed data. I didn't want to change any other tests but the ones added in this PR. If we know later it won't negatively impact the other tests then I'm sure we can change it and just use MEDIUMBLOB everywhere.

fw42 · 2018-08-09T11:16:32Z

testhelpers/unit_test_suite.go

+	}
+
+	decompressed := [][]byte{}
+	for _, path := range filePaths {


Nitpick, but this whole loop thing seems a bit unnecessary to me.. I'd create some kind of LoadFixtureFromFile helper function and then just do

func init() { TestCompressedData3 = LoadFixtureFromFile("urls1.snappy") TestCompressedData4 = LoadFixtureFromFile("urls2.snappy") }

Will update. I like this more 👍

fw42

Few small comments but overall this look good to me.

👍 🚢 🇮🇹

hkdsun

💯

fjordan force-pushed the compression-verifier branch from c804a31 to b1f092f Compare July 9, 2018 23:02

fjordan changed the title ~~wip - initial commit of compression verification~~ [wip] Add verification of snappy-compressed data Jul 9, 2018

fw42 reviewed Jul 11, 2018

View reviewed changes

fjordan force-pushed the compression-verifier branch 4 times, most recently from e5a917a to 2889981 Compare July 25, 2018 18:54

hkdsun reviewed Jul 26, 2018

View reviewed changes

fjordan force-pushed the compression-verifier branch 10 times, most recently from 5be68f0 to 0ee4ec5 Compare August 1, 2018 21:43

fjordan changed the title ~~[wip] Add verification of snappy-compressed data~~ Add verification of snappy-compressed data Aug 2, 2018

fjordan force-pushed the compression-verifier branch 2 times, most recently from f510b79 to 5b96948 Compare August 2, 2018 15:22

fjordan mentioned this pull request Aug 2, 2018

Ensure Prepared Statements are Used Where Necessary #52

Closed

fw42 suggested changes Aug 3, 2018

View reviewed changes

hkdsun reviewed Aug 3, 2018

View reviewed changes

fjordan changed the title ~~Add verification of snappy-compressed data~~ [WIP] Add verification of snappy-compressed data Aug 6, 2018

fjordan force-pushed the compression-verifier branch 3 times, most recently from 933f5c8 to a709c8e Compare August 8, 2018 17:01

hkdsun reviewed Aug 8, 2018

View reviewed changes

fjordan force-pushed the compression-verifier branch from bab3ac2 to 6de8e71 Compare August 8, 2018 20:07

fw42 reviewed Aug 9, 2018

View reviewed changes

fw42 approved these changes Aug 9, 2018

View reviewed changes

hkdsun approved these changes Aug 9, 2018

View reviewed changes

fjordan force-pushed the compression-verifier branch from 6de8e71 to e5316c4 Compare August 9, 2018 14:38

add compression-verifier to verfiy compressed data

97313e4

fjordan force-pushed the compression-verifier branch from e5316c4 to 97313e4 Compare August 9, 2018 14:44

fjordan changed the title ~~[WIP] Add verification of snappy-compressed data~~ Add verification of snappy-compressed data Aug 9, 2018

fjordan merged commit d87db6b into Shopify:master Aug 9, 2018

fjordan deleted the compression-verifier branch August 9, 2018 15:44

hkdsun mentioned this pull request Aug 9, 2018

Integration of CompressionVerifier into sharding package #57

Merged


		}

		// HashRow will fingerprint the non-primary columns of the row to be verify data equality

	// The mysql driver never actually gives you a uint64 from Scan, instead you
	// get an int64 for values that fit in int64 or a byte slice decimal string
	// with the uint64 value in it.
	func (r RowData) GetUint64(colIdx int) (res uint64, err error) {

Add verification of snappy-compressed data #44

Add verification of snappy-compressed data #44

Conversation

fjordan commented Jul 9, 2018 • edited Loading

Questions

To be completed

fw42 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjordan Jul 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sirupsen commented Jul 11, 2018

fw42 commented Jul 11, 2018

hkdsun commented Jul 26, 2018

fw42 commented Jul 26, 2018

hkdsun commented Jul 26, 2018

hkdsun left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjordan commented Aug 2, 2018

fw42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fw42 Aug 3, 2018 • edited Loading

Choose a reason for hiding this comment

fjordan Aug 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjordan Aug 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hkdsun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjordan Aug 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjordan commented Jul 9, 2018 •

edited

Loading

fw42 left a comment •

edited

Loading

fjordan Jul 27, 2018 •

edited

Loading

hkdsun left a comment •

edited

Loading

fw42 Aug 3, 2018 •

edited

Loading

fjordan Aug 6, 2018 •

edited

Loading

fjordan Aug 7, 2018 •

edited

Loading

fjordan Aug 7, 2018 •

edited

Loading

hkdsun Aug 3, 2018 •

edited

Loading