Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: validate append batches are well formed #12246

Merged
merged 3 commits into from May 5, 2022

Conversation

petrosagg
Copy link
Contributor

@petrosagg petrosagg commented May 4, 2022

Motivation

While working on piping everything through persist I got panics because sometimes we call storage_controller.append() with updates whose timestamp is equal to the upper, but the updates in a batch must be at times strictly less than the batch's upper. I added an assertion to see if we generate these malformed batches on main and indeed we do (build results of just the first commit here).

Testing

  • This PR has adequate test coverage / QA involvement has been duly considered.

Release notes

This PR includes the following user-facing behavior changes:

@petrosagg petrosagg requested a review from maddyblue May 4, 2022 13:28
@petrosagg petrosagg marked this pull request as ready for review May 4, 2022 13:28
@benesch benesch requested a review from frankmcsherry May 4, 2022 13:34
@benesch
Copy link
Member

benesch commented May 4, 2022

Oh, good catch! @frankmcsherry is the original author of this code and may have thoughts on where the responsibility for this deduplication should live.

Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the test is good; I think the fix is probably not the right fix (or at least, I am not ready to accept it :D).

src/dataflow-types/src/client/controller/storage.rs Outdated Show resolved Hide resolved
src/coord/src/coord.rs Outdated Show resolved Hide resolved
@petrosagg petrosagg force-pushed the validate-append-batches branch 2 times, most recently from a5d7ee8 to 441b7fe Compare May 4, 2022 17:02
// TODO(petrosagg): replace with `drain_filter` once it stabilizes
let mut cursor = 0;
while let Some(update) = updates.get(cursor) {
if update.timestamp < advance_to {
Copy link
Contributor

@maddyblue maddyblue May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that these updates are coming from the send_builtin_table_updates_at_offset function, where we use some future timestamp to correctly retract data without possibility for forgetting it. This PR removes the second half of that ("correctly retract"), because now the data are split and backed by a volatile in memory data structure. We (coord) don't have plans on making that WAL-backed for now (it's hard and we don't need it for any upcoming milestone).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These updates come from anything that write to tables, including what you said but also user INSERT statements, which must conform to the semantics of the Append operation. Is there another way to cut the batches that makes coord happy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that between the various needs here, send_builtin_table_updates_at_offset is on the losing end and should be changed to not do its magical timestamp retraction thing. If we commit this PR as is, I think any coord restart will always create some permanently wrong metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, how do INSERTs generate timestamps that would trigger this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to tag in @jkosh44 here? There are various other "tables have the wrong data on restart" issues that he's looked at, where one conclusion was that it might be most ergonomic to support e.g. truncate(table_id).

But, we should def see if we agree on the API and that/whether the calls are each meant to be durable (even if they are not currently so).

edit: Ignore me; I though this was storage controller code, rather than adapter code.

Copy link
Contributor

@maddyblue maddyblue May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the TimestampOracle does correct things such that advance_local_inputs would always append all INSERT data, even with this PR. Is that not the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that is not the case, and it's related to my own misconception of what the TimestampOracle does. We currently call advance_local_inputs whenever the oracle returns Some(ts) from its should_advance_to call. This however will return the same timestamp that is currently used for writing, if we are in writing mode.

This means that we are only able to append to tables the INSERT data that has been written in strictly previous times then the current time of the oracle. If the oracle is in reading mode then you're correct that all INSERT data is appended to the tables. If the oracle is in writing mode however we must exclude the timestamp that is currently being written, and therefore there might be some pending INSERT data that has to wait until the next time we decide to read.

I looked into send_builtin_table_updates_at_offset and we were only calling it with an offset of zero so I removed it in this PR and did the work directly in send_builtin_table_updates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense now. I didn't have a full understanding of how your Append change some weeks ago interacted with the TimestampOracle. This analysis sounds correct. I'm ok with this PR now.

Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkosh44 This PR illustrates a second consistency problem with tables: ack'd INSERTs aren't made durable until a SELECT (or the 1 second loop triggers). I believe we need to change table INSERTs to:

  1. Disallow multiple tables in the same write transaction.
  2. Have an INSERT (in end_transaction) do the call to append before ack'd to the user. This will involve forcibly advancing the oracle timestamp and thus disallow batching of writes into the same timestamp.

Both of those are needed because we don't intend to implement a WAL right now. (This decision could easily change but...no one has argued that yet.) If users want 1 or 2 ends up being too slow (or too frequent which ends up advancing the table write time to be in advance of the system clock), then we should look into a WAL.

@@ -4401,33 +4401,15 @@ impl Coordinator {
Ok(result)
}

async fn send_builtin_table_updates_at_offset(&mut self, updates: Vec<TimestampedUpdate>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that used this has apparently gone away, great! Looks like you can also remove the TimestampedUpdate struct now too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I removed the struct and hit auto-merge

// TODO(petrosagg): replace with `drain_filter` once it stabilizes
let mut cursor = 0;
while let Some(update) = updates.get(cursor) {
if update.timestamp < advance_to {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense now. I didn't have a full understanding of how your Append change some weeks ago interacted with the TimestampOracle. This analysis sounds correct. I'm ok with this PR now.

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
@jkosh44
Copy link
Contributor

jkosh44 commented May 5, 2022

@jkosh44 This PR illustrates a second consistency problem with tables: ack'd INSERTs aren't made durable until a SELECT (or the 1 second loop triggers). I believe we need to change table INSERTs to:

1. Disallow multiple tables in the same write transaction.

2. Have an INSERT (in end_transaction) do the call to append before ack'd to the user. This will involve forcibly advancing the oracle timestamp and thus disallow batching of writes into the same timestamp.

Both of those are needed because we don't intend to implement a WAL right now. (This decision could easily change but...no one has argued that yet.) If users want 1 or 2 ends up being too slow (or too frequent which ends up advancing the table write time to be in advance of the system clock), then we should look into a WAL.

Thanks, I'm going to try and write up an issue that details this problem today.

@petrosagg petrosagg merged commit 95171c5 into MaterializeInc:main May 5, 2022
@petrosagg petrosagg deleted the validate-append-batches branch May 5, 2022 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants