storage: validate append batches are well formed #12246

petrosagg · 2022-05-04T10:39:23Z

Motivation

While working on piping everything through persist I got panics because sometimes we call storage_controller.append() with updates whose timestamp is equal to the upper, but the updates in a batch must be at times strictly less than the batch's upper. I added an assertion to see if we generate these malformed batches on main and indeed we do (build results of just the first commit here).

Testing

This PR has adequate test coverage / QA involvement has been duly considered.

Release notes

This PR includes the following user-facing behavior changes:

benesch · 2022-05-04T13:35:05Z

Oh, good catch! @frankmcsherry is the original author of this code and may have thoughts on where the responsibility for this deduplication should live.

frankmcsherry

Adding the test is good; I think the fix is probably not the right fix (or at least, I am not ready to accept it :D).

src/dataflow-types/src/client/controller/storage.rs

src/coord/src/coord.rs

maddyblue · 2022-05-04T17:39:21Z

src/coord/src/coord.rs

+            // TODO(petrosagg): replace with `drain_filter` once it stabilizes
+            let mut cursor = 0;
+            while let Some(update) = updates.get(cursor) {
+                if update.timestamp < advance_to {


My understanding is that these updates are coming from the send_builtin_table_updates_at_offset function, where we use some future timestamp to correctly retract data without possibility for forgetting it. This PR removes the second half of that ("correctly retract"), because now the data are split and backed by a volatile in memory data structure. We (coord) don't have plans on making that WAL-backed for now (it's hard and we don't need it for any upcoming milestone).

These updates come from anything that write to tables, including what you said but also user INSERT statements, which must conform to the semantics of the Append operation. Is there another way to cut the batches that makes coord happy?

I think that between the various needs here, send_builtin_table_updates_at_offset is on the losing end and should be changed to not do its magical timestamp retraction thing. If we commit this PR as is, I think any coord restart will always create some permanently wrong metrics.

Wait, how do INSERTs generate timestamps that would trigger this?

Does it make sense to tag in @jkosh44 here? There are various other "tables have the wrong data on restart" issues that he's looked at, where one conclusion was that it might be most ergonomic to support e.g. truncate(table_id).

But, we should def see if we agree on the API and that/whether the calls are each meant to be durable (even if they are not currently so).

edit: Ignore me; I though this was storage controller code, rather than adapter code.

My understanding is that the TimestampOracle does correct things such that advance_local_inputs would always append all INSERT data, even with this PR. Is that not the case?

No that is not the case, and it's related to my own misconception of what the TimestampOracle does. We currently call advance_local_inputs whenever the oracle returns Some(ts) from its should_advance_to call. This however will return the same timestamp that is currently used for writing, if we are in writing mode.

This means that we are only able to append to tables the INSERT data that has been written in strictly previous times then the current time of the oracle. If the oracle is in reading mode then you're correct that all INSERT data is appended to the tables. If the oracle is in writing mode however we must exclude the timestamp that is currently being written, and therefore there might be some pending INSERT data that has to wait until the next time we decide to read.

I looked into send_builtin_table_updates_at_offset and we were only calling it with an offset of zero so I removed it in this PR and did the work directly in send_builtin_table_updates

This makes sense now. I didn't have a full understanding of how your Append change some weeks ago interacted with the TimestampOracle. This analysis sounds correct. I'm ok with this PR now.

maddyblue

@jkosh44 This PR illustrates a second consistency problem with tables: ack'd INSERTs aren't made durable until a SELECT (or the 1 second loop triggers). I believe we need to change table INSERTs to:

Disallow multiple tables in the same write transaction.
Have an INSERT (in end_transaction) do the call to append before ack'd to the user. This will involve forcibly advancing the oracle timestamp and thus disallow batching of writes into the same timestamp.

Both of those are needed because we don't intend to implement a WAL right now. (This decision could easily change but...no one has argued that yet.) If users want 1 or 2 ends up being too slow (or too frequent which ends up advancing the table write time to be in advance of the system clock), then we should look into a WAL.

maddyblue · 2022-05-05T16:34:49Z

src/coord/src/coord.rs

@@ -4401,33 +4401,15 @@ impl Coordinator {
        Ok(result)
    }

-    async fn send_builtin_table_updates_at_offset(&mut self, updates: Vec<TimestampedUpdate>) {


The thing that used this has apparently gone away, great! Looks like you can also remove the TimestampedUpdate struct now too.

Great! I removed the struct and hit auto-merge

maddyblue · 2022-05-05T16:41:47Z

src/coord/src/coord.rs

+            // TODO(petrosagg): replace with `drain_filter` once it stabilizes
+            let mut cursor = 0;
+            while let Some(update) = updates.get(cursor) {
+                if update.timestamp < advance_to {


This makes sense now. I didn't have a full understanding of how your Append change some weeks ago interacted with the TimestampOracle. This analysis sounds correct. I'm ok with this PR now.

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>

jkosh44 · 2022-05-05T17:05:23Z

@jkosh44 This PR illustrates a second consistency problem with tables: ack'd INSERTs aren't made durable until a SELECT (or the 1 second loop triggers). I believe we need to change table INSERTs to:
1. Disallow multiple tables in the same write transaction.

2. Have an INSERT (in end_transaction) do the call to append before ack'd to the user. This will involve forcibly advancing the oracle timestamp and thus disallow batching of writes into the same timestamp.
Both of those are needed because we don't intend to implement a WAL right now. (This decision could easily change but...no one has argued that yet.) If users want 1 or 2 ends up being too slow (or too frequent which ends up advancing the table write time to be in advance of the system clock), then we should look into a WAL.

Thanks, I'm going to try and write up an issue that details this problem today.

petrosagg requested a review from maddyblue May 4, 2022 13:28

petrosagg marked this pull request as ready for review May 4, 2022 13:28

benesch requested a review from frankmcsherry May 4, 2022 13:34

frankmcsherry reviewed May 4, 2022

View reviewed changes

src/dataflow-types/src/client/controller/storage.rs Outdated Show resolved Hide resolved

src/coord/src/coord.rs Outdated Show resolved Hide resolved

frankmcsherry reviewed May 4, 2022

View reviewed changes

src/coord/src/coord.rs Outdated Show resolved Hide resolved

petrosagg force-pushed the validate-append-batches branch 2 times, most recently from a5d7ee8 to 441b7fe Compare May 4, 2022 17:02

maddyblue reviewed May 4, 2022

View reviewed changes

petrosagg force-pushed the validate-append-batches branch from 223ba4d to 49ecbb6 Compare May 5, 2022 13:58

maddyblue approved these changes May 5, 2022

View reviewed changes

petrosagg added 3 commits May 5, 2022 18:51

storage: validate append batches are well formed

984731b

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>

coord: respect upper frontier when creating append batches

f6ce849

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>

coord: send builtin table updates directly

e6604d8

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>

petrosagg force-pushed the validate-append-batches branch from 49ecbb6 to e6604d8 Compare May 5, 2022 16:51

petrosagg enabled auto-merge May 5, 2022 16:51

petrosagg merged commit 95171c5 into MaterializeInc:main May 5, 2022

petrosagg deleted the validate-append-batches branch May 5, 2022 17:12

This was referenced May 5, 2022

ACKed writes are not durable #12287

Closed

coord: Make all writes durable #12330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: validate append batches are well formed #12246

storage: validate append batches are well formed #12246

petrosagg commented May 4, 2022 •

edited

benesch commented May 4, 2022 •

edited

frankmcsherry left a comment

maddyblue May 4, 2022 •

edited

petrosagg May 4, 2022

maddyblue May 4, 2022

maddyblue May 4, 2022

frankmcsherry May 4, 2022 •

edited

maddyblue May 4, 2022 •

edited

petrosagg May 5, 2022

maddyblue May 5, 2022

maddyblue left a comment •

edited

maddyblue May 5, 2022

petrosagg May 5, 2022

maddyblue May 5, 2022

jkosh44 commented May 5, 2022

storage: validate append batches are well formed #12246

storage: validate append batches are well formed #12246

Conversation

petrosagg commented May 4, 2022 • edited

Motivation

Testing

Release notes

benesch commented May 4, 2022 • edited

frankmcsherry left a comment

Choose a reason for hiding this comment

maddyblue May 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankmcsherry May 4, 2022 • edited

Choose a reason for hiding this comment

maddyblue May 4, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maddyblue left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkosh44 commented May 5, 2022

petrosagg commented May 4, 2022 •

edited

benesch commented May 4, 2022 •

edited

maddyblue May 4, 2022 •

edited

frankmcsherry May 4, 2022 •

edited

maddyblue May 4, 2022 •

edited

maddyblue left a comment •

edited