Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved deduplication/ordering metadata in Debezium #5668

Closed
umanwizard opened this issue Feb 8, 2021 · 2 comments
Closed

Improved deduplication/ordering metadata in Debezium #5668

umanwizard opened this issue Feb 8, 2021 · 2 comments
Assignees
Labels
C-feature Category: new feature or request

Comments

@umanwizard
Copy link
Contributor

Debezium can produce duplicate or out-of-order records, which Materialize is not able to cope with. Thus, Materialize needs some metadata which we can use to determine the true, correct ordering of messages.

For MySQL topics, we use pos and row in the Debezium source stanza, which express the position of a record in the MySQL binlog (example: https://debezium.io/documentation/reference/connectors/mysql.html#mysql-update-events). This seems to have been working well for us so far. For Postgres topics, the same source metadata table has an lsn (log sequence number) field, which we have been using instead, but it turns out this doesn't work, since the actual correct order of messages does not necessarily result in increasing LSNs. Thus, Materialize is currently broken when ingesting Postgres data via Debezium.

The reason for this is because Debezium reads transactions in order, but the LSNs of those transactions might be interleaved; for example:

Records ordered by LSN

Record Transaction LSN
Begin transaction a a 1
Begin transaction b b 2
Insert "foo" a 3
Insert "hello, " b 4
Update "quux" to "bar" a 5
Insert "world!" b 6
Delete "baz" a 7
Commit transaction a a 8
Commit transaction b b 9

Records seen by Debezium (ordered by transaction)

Record Transaction LSN
Begin transaction a a 1
Insert "foo" a 3
Update "quux" to "bar" a 5
Delete "baz" a 7
Commit transaction a a 8
Begin transaction b b 2
Insert "hello, " b 4
Insert "world!" b 6
Commit transaction b b 9

Thus if we only have the LSN, we have no idea what order Debezium actually saw the records and intended to emit them in.

We want to make sure this is fixed in a future version of Debezium. Gunnar from the Debezium team has suggested adding an opaque "ordering" field to the source metadata, which consumers like Materialize can read in order to discover the correct ordering of messages. For example, the ordering field for Postgres sources could be [lsnOfLastTransactionCommit, lsn]. For the example given above, the insert/update/delete records would then have the following metadata:

Record sequencing metadata
Insert "foo" [null, 3]
Update "quux" to "bar" [null, 5]
Delete "baz" [null, 7]
Insert "hello, " [8, 4]
Insert "world!" [8, 6]

And these are properly ordered when comparing field-by-field. The Debezium team is open to accepting a change to produce this metadata, but we would need to do the work ourselves.

Upstream Debezium task, for reference: https://issues.redhat.com/browse/DBZ-2911

Cc @elindsey , @JLDLaughlin , @benesch

@umanwizard umanwizard added the C-feature Category: new feature or request label Feb 8, 2021
@umanwizard
Copy link
Contributor Author

Related: #5262

Let's use the present issue to track the upstream work in Debezium. Once we have that done, we can fix Materialize to read the new metadata info and warn if it's not present (due to an older version of dbz). Then we can close out #5262.

@JLDLaughlin
Copy link
Contributor

There is a PR out for this, currently waiting to see if Philip has test cases to add!

Storage (Old) automation moved this from In Progress to Landed Jul 20, 2021
@nmeagan11 nmeagan11 removed this from Landed in Storage (Old) Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: new feature or request
Projects
None yet
Development

No branches or pull requests

2 participants