Postgres / Debezium messages presented "out of order" #5262

cirego · 2021-01-11T19:18:01Z

Our logic for deduplicating messages from Debezium assumes that Postgres records are presented in
order of increasing LSN. However, Postgres' logical replication mechanism appears to be present
records in commit order, not WAL order.

For now, this implies that we need to run with deduplication = 'full' for sources where the
messages are postgres records from Debezium.

The text was updated successfully, but these errors were encountered:

benesch · 2021-01-11T21:31:44Z

@umanwizard and I spent some time today digging into the Debezium/PostgreSQL source code on this front. Recording those learnings here for posterity.

Debezium uses the PgJDBC logical streaming replication protocol to read the change feed. You can learn more about the protocol here: https://www.postgresql.org/docs/10/protocol-logical-replication.html. The gist is that WAL records are emitted via a CopyData stream—much like the way our TAIL protocol works! The format of these WAL records is determined by the "logical streaming output plugin" that folks have chosen to use. That plugin interface is described in depth here: https://www.postgresql.org/docs/10/logicaldecoding-output-plugin.html

There are three plugins that are currently available for use with Debezium: wal2json, decoderbuf, and pgoutput. I think all of our stuff uses pgoutput, since it ships by default with PostgreSQL.

The important part of this interface is described thusly:

Concurrent transactions are decoded in commit order, and only changes belonging to a specific transaction are decoded between the begin and commit callbacks. Transactions that were rolled back explicitly or implicitly never get decoded. Successful savepoints are folded into the transaction containing them in the order they were executed within that transaction.

That means that transaction commit messages will be presented in LSN order, but on a per-message level, we'll often see out-of-order LSNs. Per Brennan, a WAL like this

LSN | Operation
-----------------
0      |  Begin TX 1
8      |  Begin TX 2
16    |  Insert "foo" (part of TX 2)
24    |  Insert "bar" (part of TX 1)
32    | Commit TX 1
40    | Commit TX 2

will produce Debezium records like so:

{ lsn: 24, op: "insert foo" }
{ lsn: 16, op: "insert bar" }

The information we need to put things back into the right order is available in the transaction topic, but it's a bit of a pain to wire that up presently.

If we wanted to instead tackle this from the Debezium side, we could include the txn_final_lsn in the metadata for each message. This information is presented in the BEGIN message (see https://www.postgresql.org/docs/10/protocol-logicalrep-message-formats.html) and would be straightforward to include alongside each message that Debezium outputs.

cirego · 2021-01-11T21:36:40Z

Is txId the field that we're looking for?

[chris-work:~/materialize/benchmark_data/chbench/1] ~/materialize/materialize/misc/kafka-util/scripts/print_archive.py -p debezium.tpcch.neworder -c 1 -o 102002
{'key': {'no_d_id': 3, 'no_o_id': 2146, 'no_w_id': 10},
 'timestamp': 1607060564850,
 'value': {'after': None,
           'before': {'no_d_id': 3, 'no_o_id': 2146, 'no_w_id': 10},
           'op': 'd',
           'source': {'connector': 'postgresql',
                      'db': 'postgres',
                      'lsn': 2494289944,
                      'name': 'debezium',
                      'schema': 'tpcch',
                      'snapshot': 'false',
                      'table': 'neworder',
                      'ts_ms': 1607060339396,
                      'txId': 142199,
                      'version': '1.2.5.Final',
                      'xmin': None},
           'transaction': {'data_collection_order': 1,
                           'id': '142199',
                           'total_order': 1},
           'ts_ms': 1607060564559}}

benesch · 2021-01-11T21:40:54Z

In a perfect world, yes. Unfortunately xids in PostgreSQL are 32 bits and wrap around occasionally: https://info.crunchydata.com/blog/managing-transaction-id-wraparound-in-postgresql

It's not theoretical, either, sadly. Nearly every production PostgreSQL database has some horror story whose punchline is "xid wraparound". See for example: https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres

cirego · 2021-01-11T21:44:06Z

Could we detect wraparound? If this is for the purpose of deduplication, would it be sufficient to assume that a transaction can not be long enough for the value to wrap around more than once?

umanwizard · 2021-01-11T21:58:11Z

I don't understand why txId would work, even if it weren't for the wraparound issue. We don't expect to see those in order, do we? Since Debezium sees transactions in commit order, not in the order they were created.

Or is txId not allocated until a transaction is committed? If that is the case, we could try using it and detecting wraparound, as Chris suggests.

benesch · 2021-01-11T22:00:42Z

Ah, yeah, you're right: xids are assigned at the moment you write any data. So they're out for two reasons.

benesch · 2021-01-11T22:25:10Z

I filed this upstream: https://issues.redhat.com/browse/DBZ-2911

umanwizard · 2021-05-19T18:24:30Z

It's possible to fix this for people on newer Debezium versions now that @JLDLaughlin 's changes to the source coordinates have landed.

cuongdo · 2021-06-10T13:19:32Z

@umanwizard what is the latest on this issue?

umanwizard · 2021-06-10T14:36:54Z

It depends on #6553 , which will involve fixing an edge-case in how Debezium emits sequence numbers. After we do that (requires upstream work in Debezium) and the release goes out, then we can start reading those sequence numbers in Materialize which will finally fix this issue.

Alternatively, we could fix this now, and it should make postgres/debezium sources correct in more cases than they are now, but it won't fully solve the problem until the issue I linked above is fixed.

Anyway, the fix on the Mz end is relatively easy (<1 day of work); me or @quodlibetor can probably pick it up when we get a chance.

umanwizard · 2021-06-10T14:37:20Z

Added to the Sources and Sinks project board so we don't lose track of this

cuongdo · 2021-06-16T11:09:50Z

This is a P1 issue and lacks an owner. Does this make sense for @JLDLaughlin to own, since she did the earlier upstream PR?

umanwizard · 2021-06-16T13:55:53Z

@quodlibetor , are you interested in picking this up? It should be a relatively simple change -- we need to start reading the sequence field in Debezium output when it exists, rather than the LSN field as we're reading now.

umanwizard · 2021-06-16T14:15:16Z

I spoke to Brandon; he will tentatively pick this up. He already has a lot on his plate and is out next week, so we can reassign if necessary if it doesn't get done.

umanwizard · 2021-06-30T15:55:38Z

Essentially a duplicate of #5668, which @JLDLaughlin is working on

cirego assigned cirego and umanwizard Jan 11, 2021

cirego mentioned this issue Jan 29, 2021

Error seen in chbench: Non-postive accumulation in ReduceHierarchical #4440

Closed

umanwizard mentioned this issue Feb 8, 2021

Improved deduplication/ordering metadata in Debezium #5668

Closed

quodlibetor mentioned this issue Apr 14, 2021

Require ENVELOPE DEBEZIUM UPSERT for postgres sources on old versions of debezium #6464

Closed

cirego removed their assignment May 14, 2021

umanwizard removed their assignment May 19, 2021

umanwizard added this to To Do in Top priority via automation May 19, 2021

benesch added Top Priority P0 critical issue, fix immediately, associated with bugs that demand immediate attention and removed Top Priority labels Jun 2, 2021

umanwizard added this to To Do in Storage (Old) via automation Jun 10, 2021

philip-stoev added the T-correctness Theme: relates to consistency and correctness of results. label Jun 16, 2021

umanwizard assigned quodlibetor Jun 16, 2021

umanwizard closed this as completed Jun 30, 2021

Storage (Old) automation moved this from To Do to Landed Jun 30, 2021

nmeagan11 removed this from Landed in Storage (Old) Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres / Debezium messages presented "out of order" #5262

Postgres / Debezium messages presented "out of order" #5262

cirego commented Jan 11, 2021

benesch commented Jan 11, 2021

cirego commented Jan 11, 2021

benesch commented Jan 11, 2021 •

edited

cirego commented Jan 11, 2021

umanwizard commented Jan 11, 2021

benesch commented Jan 11, 2021 •

edited

benesch commented Jan 11, 2021

umanwizard commented May 19, 2021

cuongdo commented Jun 10, 2021

umanwizard commented Jun 10, 2021

umanwizard commented Jun 10, 2021

cuongdo commented Jun 16, 2021

umanwizard commented Jun 16, 2021

umanwizard commented Jun 16, 2021

umanwizard commented Jun 30, 2021

Postgres / Debezium messages presented "out of order" #5262

Postgres / Debezium messages presented "out of order" #5262

Comments

cirego commented Jan 11, 2021

benesch commented Jan 11, 2021

cirego commented Jan 11, 2021

benesch commented Jan 11, 2021 • edited

cirego commented Jan 11, 2021

umanwizard commented Jan 11, 2021

benesch commented Jan 11, 2021 • edited

benesch commented Jan 11, 2021

umanwizard commented May 19, 2021

cuongdo commented Jun 10, 2021

umanwizard commented Jun 10, 2021

umanwizard commented Jun 10, 2021

cuongdo commented Jun 16, 2021

umanwizard commented Jun 16, 2021

umanwizard commented Jun 16, 2021

umanwizard commented Jun 30, 2021

benesch commented Jan 11, 2021 •

edited

benesch commented Jan 11, 2021 •

edited