You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Debezium can produce duplicate or out-of-order records, which Materialize is not able to cope with. Thus, Materialize needs some metadata which we can use to determine the true, correct ordering of messages.
For MySQL topics, we use pos and row in the Debezium source stanza, which express the position of a record in the MySQL binlog (example: https://debezium.io/documentation/reference/connectors/mysql.html#mysql-update-events). This seems to have been working well for us so far. For Postgres topics, the same source metadata table has an lsn (log sequence number) field, which we have been using instead, but it turns out this doesn't work, since the actual correct order of messages does not necessarily result in increasing LSNs. Thus, Materialize is currently broken when ingesting Postgres data via Debezium.
The reason for this is because Debezium reads transactions in order, but the LSNs of those transactions might be interleaved; for example:
Records ordered by LSN
Record
Transaction
LSN
Begin transaction a
a
1
Begin transaction b
b
2
Insert "foo"
a
3
Insert "hello, "
b
4
Update "quux" to "bar"
a
5
Insert "world!"
b
6
Delete "baz"
a
7
Commit transaction a
a
8
Commit transaction b
b
9
Records seen by Debezium (ordered by transaction)
Record
Transaction
LSN
Begin transaction a
a
1
Insert "foo"
a
3
Update "quux" to "bar"
a
5
Delete "baz"
a
7
Commit transaction a
a
8
Begin transaction b
b
2
Insert "hello, "
b
4
Insert "world!"
b
6
Commit transaction b
b
9
Thus if we only have the LSN, we have no idea what order Debezium actually saw the records and intended to emit them in.
We want to make sure this is fixed in a future version of Debezium. Gunnar from the Debezium team has suggested adding an opaque "ordering" field to the source metadata, which consumers like Materialize can read in order to discover the correct ordering of messages. For example, the ordering field for Postgres sources could be [lsnOfLastTransactionCommit, lsn]. For the example given above, the insert/update/delete records would then have the following metadata:
Record
sequencing metadata
Insert "foo"
[null, 3]
Update "quux" to "bar"
[null, 5]
Delete "baz"
[null, 7]
Insert "hello, "
[8, 4]
Insert "world!"
[8, 6]
And these are properly ordered when comparing field-by-field. The Debezium team is open to accepting a change to produce this metadata, but we would need to do the work ourselves.
Let's use the present issue to track the upstream work in Debezium. Once we have that done, we can fix Materialize to read the new metadata info and warn if it's not present (due to an older version of dbz). Then we can close out #5262.
Debezium can produce duplicate or out-of-order records, which Materialize is not able to cope with. Thus, Materialize needs some metadata which we can use to determine the true, correct ordering of messages.
For MySQL topics, we use
pos
androw
in the Debeziumsource
stanza, which express the position of a record in the MySQL binlog (example: https://debezium.io/documentation/reference/connectors/mysql.html#mysql-update-events). This seems to have been working well for us so far. For Postgres topics, the samesource
metadata table has anlsn
(log sequence number) field, which we have been using instead, but it turns out this doesn't work, since the actual correct order of messages does not necessarily result in increasing LSNs. Thus, Materialize is currently broken when ingesting Postgres data via Debezium.The reason for this is because Debezium reads transactions in order, but the LSNs of those transactions might be interleaved; for example:
Records ordered by LSN
Records seen by Debezium (ordered by transaction)
Thus if we only have the LSN, we have no idea what order Debezium actually saw the records and intended to emit them in.
We want to make sure this is fixed in a future version of Debezium. Gunnar from the Debezium team has suggested adding an opaque "ordering" field to the source metadata, which consumers like Materialize can read in order to discover the correct ordering of messages. For example, the ordering field for Postgres sources could be
[lsnOfLastTransactionCommit, lsn]
. For the example given above, the insert/update/delete records would then have the following metadata:[null, 3]
[null, 5]
[null, 7]
[8, 4]
[8, 6]
And these are properly ordered when comparing field-by-field. The Debezium team is open to accepting a change to produce this metadata, but we would need to do the work ourselves.
Upstream Debezium task, for reference: https://issues.redhat.com/browse/DBZ-2911
Cc @elindsey , @JLDLaughlin , @benesch
The text was updated successfully, but these errors were encountered: