Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debezium: updating the primary key causes the record to be deleted on the mz side #6553

Closed
philip-stoev opened this issue Apr 23, 2021 · 11 comments
Assignees
Labels
C-bug Category: something is broken P0 critical issue, fix immediately, associated with bugs that demand immediate attention T-correctness Theme: relates to consistency and correctness of results.

Comments

@philip-stoev
Copy link
Contributor

philip-stoev commented Apr 23, 2021

What version of Materialize are you using?

v0.7.3-dev (7658bbf64)

What was the issue?

If you issue the following statements:

INSERT INTO t1 VALUES (1);
UPDATE t1 SET f1 = 2 WHERE f1 = 1;

then the following Debezium kafka messages will be produced:

{"before":null,"after":{"postgres.public.update_pk_values.Value":{"f1":1}},"source":{"version":"1.5.0.Final","connector":"postgresql","name":"postgres","ts_ms":1619171446308,"snapshot":{"string":"last"},"db":"postgres","sequence":{"string":"[null,\"23547008\"]"},"schema":"public","table":"update_pk_values","txId":{"long":579},"lsn":{"long":23547008},"xmin":null},"op":"r","ts_ms":{"long":1619171446308},"transaction":null}
{"before":{"postgres.public.update_pk_values.Value":{"f1":1}},"after":null,"source":{"version":"1.5.0.Final","connector":"postgresql","name":"postgres","ts_ms":1619171449076,"snapshot":{"string":"false"},"db":"postgres","sequence":{"string":"[null,\"23547008\"]"},"schema":"public","table":"update_pk_values","txId":{"long":580},"lsn":{"long":23547048},"xmin":null},"op":"d","ts_ms":{"long":1619171449439},"transaction":null}
null
{"before":null,"after":{"postgres.public.update_pk_values.Value":{"f1":2}},"source":{"version":"1.5.0.Final","connector":"postgresql","name":"postgres","ts_ms":1619171449076,"snapshot":{"string":"false"},"db":"postgres","sequence":{"string":"[null,\"23547008\"]"},"schema":"public","table":"update_pk_values","txId":{"long":580},"lsn":{"long":23547048},"xmin":null},"op":"c","ts_ms":{"long":1619171449440},"transaction":null}

Note that there are 4 messages and there is a null message second to last, in case this is important. The null message is not there if the table has no primary key or if the primary key is not being changed.

and the outcome is that the row is now gone on the Mz side. It seems that the last insertion message was never processed.

Is the issue reproducible? If so, please provide reproduction instructions.

This test is part of a series of tests that intends to run in an mzcompose workflow that sets up Debezium.

$ postgres-execute connection=postgres://postgres:postgres@postgres
DROP TABLE IF EXISTS ten;
CREATE TABLE ten (f1 INTEGER);
INSERT INTO ten VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
CREATE SEQUENCE update_pk_values_sequence;
CREATE TABLE update_pk_values (f1 INTEGER, PRIMARY KEY (f1));
ALTER TABLE update_pk_values REPLICA IDENTITY FULL;
BEGIN;
INSERT INTO update_pk_values VALUES (1);
COMMIT;

> CREATE MATERIALIZED SOURCE update_pk_values
  FROM KAFKA BROKER '${testdrive.kafka-addr}' TOPIC 'postgres.public.update_pk_values'
  FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY '${testdrive.schema-registry-url}'
  ENVELOPE DEBEZIUM;

> SELECT COUNT(*), COUNT(DISTINCT f1), MIN(f1), MAX(f1) FROM update_pk_values;
1 1 1 1

$ postgres-execute connection=postgres://postgres:postgres@postgres
UPDATE update_pk_values SET f1 = f1 + 1;

> SELECT COUNT(*), COUNT(DISTINCT f1), MIN(f1), MAX(f1) FROM update_pk_values;
1 1 2 2

Please attach any applicable log files.

@philip-stoev philip-stoev added C-bug Category: something is broken A-source T-correctness Theme: relates to consistency and correctness of results. labels Apr 23, 2021
@awang awang added this to To Do in Top priority via automation May 11, 2021
@umanwizard umanwizard self-assigned this May 19, 2021
@umanwizard
Copy link
Contributor

umanwizard commented May 19, 2021

@philip-stoev In your initial example,

INSERT INTO t1 VALUES (1);
UPDATE t1 SET f1 = 2 WHERE f1 = 1;

were the insert and update in the same transaction, or separate, transactions?

@philip-stoev
Copy link
Contributor Author

@umanwizard they are separate statements in separate transactions , in separate connections even.

@umanwizard
Copy link
Contributor

At least one of the problems here is that Debezium is outputting two records to Kafka for the same LSN, so we think they are duplicates and skip one.

Apparently, this happens when the primary key changes; rather than an update, it emits a delete followed by an insert.

@JLDLaughlin - are you interested in taking this on? It seems like it might need an upstream fix, to output a third coordinate in the sequence array corresponding to "the Nth record emitted for this LSN"

A second possible issue (or maybe something I just don't understand) is that the first coordinate of all the sequence numbers is null here? What does that mean? I couldn't reproduce it locally.

(To be clear, this can't be why we're getting wrong answers in Materialize, since we're not using the sequence field there yet, but we need to start doing so eventually if we want to interpret Postgres/Debezium sources correctly)

@umanwizard umanwizard removed their assignment May 24, 2021
@umanwizard
Copy link
Contributor

This is clearly top priority, but will take a while to do, and I'm not yet sure who will do it. So I've unassigned myself for now.

@umanwizard umanwizard added this to To Do in Storage (Old) via automation Jun 2, 2021
@benesch benesch added P0 critical issue, fix immediately, associated with bugs that demand immediate attention and removed Top Priority labels Jun 7, 2021
@uce
Copy link
Contributor

uce commented Jun 9, 2021

@umanwizard @JLDLaughlin I'd be happy to look into this if you didn't start yet.

@cuongdo
Copy link
Contributor

cuongdo commented Jun 9, 2021

This has no assignee, so I'm assigning to @uce

@JLDLaughlin
Copy link
Contributor

JLDLaughlin commented Jun 29, 2021

I was just trying to reproduce this, but the following test is succeeding for me:

$ postgres-execute connection=postgres://postgres:postgres@postgres
CREATE TABLE update_pk_values (f1 INTEGER, PRIMARY KEY (f1));
ALTER TABLE update_pk_values REPLICA IDENTITY FULL;
BEGIN;
INSERT INTO update_pk_values VALUES (1);
COMMIT;

$ schema-registry-wait-schema schema=postgres.public.update_pk_values-value

> CREATE MATERIALIZED SOURCE update_pk_values
  FROM KAFKA BROKER '${testdrive.kafka-addr}' TOPIC 'postgres.public.update_pk_values'
  FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY '${testdrive.schema-registry-url}'
  ENVELOPE DEBEZIUM;

> SELECT * FROM update_pk_values;
1

$ postgres-execute connection=postgres://postgres:postgres@postgres
UPDATE update_pk_values SET f1 = 2 WHERE f1 = 1;

# Correctly SELECTs 2 here
> SELECT * FROM update_pk_values;
2

@philip-stoev is there something I'm missing in this test case?

@philip-stoev
Copy link
Contributor Author

@JLDLaughlin Apologies for sending you on a wild goose chase -- apparently the issue is no longer reproducible, even after trying various combinations in the test.

I am taking up the ticket for myself to clean up the test and push it . Thank you.

@JLDLaughlin
Copy link
Contributor

@philip-stoev no worries at all, please feel free to tag me if you find it again!

philip-stoev added a commit to philip-stoev/materialize that referenced this issue Jul 7, 2021
Since MaterializeInc#6553 is no longer reproducible, enable the respective .td
test. Make sure the table contains records inserted both during
the initial snapshot and after it.
@philip-stoev
Copy link
Contributor Author

A regression test has been pushed.

Storage (Old) automation moved this from To Do to Landed Jul 7, 2021
@JLDLaughlin
Copy link
Contributor

Thanks, @philip-stoev !

aljoscha pushed a commit to aljoscha/materialize that referenced this issue Jul 21, 2021
Since MaterializeInc#6553 is no longer reproducible, enable the respective .td
test. Make sure the table contains records inserted both during
the initial snapshot and after it.
@nmeagan11 nmeagan11 removed this from Landed in Storage (Old) Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken P0 critical issue, fix immediately, associated with bugs that demand immediate attention T-correctness Theme: relates to consistency and correctness of results.
Projects
No open projects
Development

No branches or pull requests

7 participants