Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres / Debezium messages presented "out of order" #5262

Closed
cirego opened this issue Jan 11, 2021 · 15 comments
Closed

Postgres / Debezium messages presented "out of order" #5262

cirego opened this issue Jan 11, 2021 · 15 comments
Assignees
Labels
P0 critical issue, fix immediately, associated with bugs that demand immediate attention T-correctness Theme: relates to consistency and correctness of results.

Comments

@cirego
Copy link
Contributor

cirego commented Jan 11, 2021

Our logic for deduplicating messages from Debezium assumes that Postgres records are presented in
order of increasing LSN. However, Postgres' logical replication mechanism appears to be present
records in commit order, not WAL order.

For now, this implies that we need to run with deduplication = 'full' for sources where the
messages are postgres records from Debezium.

@benesch
Copy link
Member

benesch commented Jan 11, 2021

@umanwizard and I spent some time today digging into the Debezium/PostgreSQL source code on this front. Recording those learnings here for posterity.

Debezium uses the PgJDBC logical streaming replication protocol to read the change feed. You can learn more about the protocol here: https://www.postgresql.org/docs/10/protocol-logical-replication.html. The gist is that WAL records are emitted via a CopyData stream—much like the way our TAIL protocol works! The format of these WAL records is determined by the "logical streaming output plugin" that folks have chosen to use. That plugin interface is described in depth here: https://www.postgresql.org/docs/10/logicaldecoding-output-plugin.html

There are three plugins that are currently available for use with Debezium: wal2json, decoderbuf, and pgoutput. I think all of our stuff uses pgoutput, since it ships by default with PostgreSQL.

The important part of this interface is described thusly:

Concurrent transactions are decoded in commit order, and only changes belonging to a specific transaction are decoded between the begin and commit callbacks. Transactions that were rolled back explicitly or implicitly never get decoded. Successful savepoints are folded into the transaction containing them in the order they were executed within that transaction.

That means that transaction commit messages will be presented in LSN order, but on a per-message level, we'll often see out-of-order LSNs. Per Brennan, a WAL like this

LSN | Operation
-----------------
0      |  Begin TX 1
8      |  Begin TX 2
16    |  Insert "foo" (part of TX 2)
24    |  Insert "bar" (part of TX 1)
32    | Commit TX 1
40    | Commit TX 2

will produce Debezium records like so:

{ lsn: 24, op: "insert foo" }
{ lsn: 16, op: "insert bar" }

The information we need to put things back into the right order is available in the transaction topic, but it's a bit of a pain to wire that up presently.

If we wanted to instead tackle this from the Debezium side, we could include the txn_final_lsn in the metadata for each message. This information is presented in the BEGIN message (see https://www.postgresql.org/docs/10/protocol-logicalrep-message-formats.html) and would be straightforward to include alongside each message that Debezium outputs.

@cirego
Copy link
Contributor Author

cirego commented Jan 11, 2021

Is txId the field that we're looking for?

[chris-work:~/materialize/benchmark_data/chbench/1] ~/materialize/materialize/misc/kafka-util/scripts/print_archive.py -p debezium.tpcch.neworder -c 1 -o 102002
{'key': {'no_d_id': 3, 'no_o_id': 2146, 'no_w_id': 10},
 'timestamp': 1607060564850,
 'value': {'after': None,
           'before': {'no_d_id': 3, 'no_o_id': 2146, 'no_w_id': 10},
           'op': 'd',
           'source': {'connector': 'postgresql',
                      'db': 'postgres',
                      'lsn': 2494289944,
                      'name': 'debezium',
                      'schema': 'tpcch',
                      'snapshot': 'false',
                      'table': 'neworder',
                      'ts_ms': 1607060339396,
                      'txId': 142199,
                      'version': '1.2.5.Final',
                      'xmin': None},
           'transaction': {'data_collection_order': 1,
                           'id': '142199',
                           'total_order': 1},
           'ts_ms': 1607060564559}}

@benesch
Copy link
Member

benesch commented Jan 11, 2021

In a perfect world, yes. Unfortunately xids in PostgreSQL are 32 bits and wrap around occasionally: https://info.crunchydata.com/blog/managing-transaction-id-wraparound-in-postgresql

It's not theoretical, either, sadly. Nearly every production PostgreSQL database has some horror story whose punchline is "xid wraparound". See for example: https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres

@cirego
Copy link
Contributor Author

cirego commented Jan 11, 2021

Could we detect wraparound? If this is for the purpose of deduplication, would it be sufficient to assume that a transaction can not be long enough for the value to wrap around more than once?

@umanwizard
Copy link
Contributor

I don't understand why txId would work, even if it weren't for the wraparound issue. We don't expect to see those in order, do we? Since Debezium sees transactions in commit order, not in the order they were created.

Or is txId not allocated until a transaction is committed? If that is the case, we could try using it and detecting wraparound, as Chris suggests.

@benesch
Copy link
Member

benesch commented Jan 11, 2021

Ah, yeah, you're right: xids are assigned at the moment you write any data. So they're out for two reasons.

@benesch
Copy link
Member

benesch commented Jan 11, 2021

I filed this upstream: https://issues.redhat.com/browse/DBZ-2911

@umanwizard
Copy link
Contributor

It's possible to fix this for people on newer Debezium versions now that @JLDLaughlin 's changes to the source coordinates have landed.

@benesch benesch added Top Priority P0 critical issue, fix immediately, associated with bugs that demand immediate attention and removed Top Priority labels Jun 2, 2021
@cuongdo
Copy link
Contributor

cuongdo commented Jun 10, 2021

@umanwizard what is the latest on this issue?

@umanwizard
Copy link
Contributor

It depends on #6553 , which will involve fixing an edge-case in how Debezium emits sequence numbers. After we do that (requires upstream work in Debezium) and the release goes out, then we can start reading those sequence numbers in Materialize which will finally fix this issue.

Alternatively, we could fix this now, and it should make postgres/debezium sources correct in more cases than they are now, but it won't fully solve the problem until the issue I linked above is fixed.

Anyway, the fix on the Mz end is relatively easy (<1 day of work); me or @quodlibetor can probably pick it up when we get a chance.

@umanwizard umanwizard added this to To Do in Storage (Old) via automation Jun 10, 2021
@umanwizard
Copy link
Contributor

Added to the Sources and Sinks project board so we don't lose track of this

@cuongdo
Copy link
Contributor

cuongdo commented Jun 16, 2021

This is a P1 issue and lacks an owner. Does this make sense for @JLDLaughlin to own, since she did the earlier upstream PR?

@philip-stoev philip-stoev added the T-correctness Theme: relates to consistency and correctness of results. label Jun 16, 2021
@umanwizard
Copy link
Contributor

@quodlibetor , are you interested in picking this up? It should be a relatively simple change -- we need to start reading the sequence field in Debezium output when it exists, rather than the LSN field as we're reading now.

@umanwizard
Copy link
Contributor

I spoke to Brandon; he will tentatively pick this up. He already has a lot on his plate and is out next week, so we can reassign if necessary if it doesn't get done.

@umanwizard
Copy link
Contributor

Essentially a duplicate of #5668, which @JLDLaughlin is working on

Storage (Old) automation moved this from To Do to Landed Jun 30, 2021
@nmeagan11 nmeagan11 removed this from Landed in Storage (Old) Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 critical issue, fix immediately, associated with bugs that demand immediate attention T-correctness Theme: relates to consistency and correctness of results.
Projects
No open projects
Development

No branches or pull requests

6 participants