New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: validate append batches are well formed #12246
storage: validate append batches are well formed #12246
Conversation
Oh, good catch! @frankmcsherry is the original author of this code and may have thoughts on where the responsibility for this deduplication should live. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the test is good; I think the fix is probably not the right fix (or at least, I am not ready to accept it :D).
a5d7ee8
to
441b7fe
Compare
// TODO(petrosagg): replace with `drain_filter` once it stabilizes | ||
let mut cursor = 0; | ||
while let Some(update) = updates.get(cursor) { | ||
if update.timestamp < advance_to { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that these updates are coming from the send_builtin_table_updates_at_offset
function, where we use some future timestamp to correctly retract data without possibility for forgetting it. This PR removes the second half of that ("correctly retract"), because now the data are split and backed by a volatile in memory data structure. We (coord) don't have plans on making that WAL-backed for now (it's hard and we don't need it for any upcoming milestone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These updates come from anything that write to tables, including what you said but also user INSERT
statements, which must conform to the semantics of the Append
operation. Is there another way to cut the batches that makes coord happy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that between the various needs here, send_builtin_table_updates_at_offset
is on the losing end and should be changed to not do its magical timestamp retraction thing. If we commit this PR as is, I think any coord restart will always create some permanently wrong metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, how do INSERTs generate timestamps that would trigger this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to tag in @jkosh44 here? There are various other "tables have the wrong data on restart" issues that he's looked at, where one conclusion was that it might be most ergonomic to support e.g. truncate(table_id)
.
But, we should def see if we agree on the API and that/whether the calls are each meant to be durable (even if they are not currently so).
edit: Ignore me; I though this was storage controller code, rather than adapter code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the TimestampOracle does correct things such that advance_local_inputs
would always append
all INSERT data, even with this PR. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that is not the case, and it's related to my own misconception of what the TimestampOracle
does. We currently call advance_local_inputs
whenever the oracle returns Some(ts)
from its should_advance_to
call. This however will return the same timestamp that is currently used for writing, if we are in writing mode.
This means that we are only able to append to tables the INSERT data that has been written in strictly previous times then the current time of the oracle. If the oracle is in reading mode then you're correct that all INSERT data is appended to the tables. If the oracle is in writing mode however we must exclude the timestamp that is currently being written, and therefore there might be some pending INSERT data that has to wait until the next time we decide to read.
I looked into send_builtin_table_updates_at_offset
and we were only calling it with an offset of zero so I removed it in this PR and did the work directly in send_builtin_table_updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense now. I didn't have a full understanding of how your Append change some weeks ago interacted with the TimestampOracle. This analysis sounds correct. I'm ok with this PR now.
223ba4d
to
49ecbb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkosh44 This PR illustrates a second consistency problem with tables: ack'd INSERTs aren't made durable until a SELECT (or the 1 second loop triggers). I believe we need to change table INSERTs to:
- Disallow multiple tables in the same write transaction.
- Have an INSERT (in end_transaction) do the call to append before ack'd to the user. This will involve forcibly advancing the oracle timestamp and thus disallow batching of writes into the same timestamp.
Both of those are needed because we don't intend to implement a WAL right now. (This decision could easily change but...no one has argued that yet.) If users want 1 or 2 ends up being too slow (or too frequent which ends up advancing the table write time to be in advance of the system clock), then we should look into a WAL.
@@ -4401,33 +4401,15 @@ impl Coordinator { | |||
Ok(result) | |||
} | |||
|
|||
async fn send_builtin_table_updates_at_offset(&mut self, updates: Vec<TimestampedUpdate>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that used this has apparently gone away, great! Looks like you can also remove the TimestampedUpdate
struct now too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I removed the struct and hit auto-merge
// TODO(petrosagg): replace with `drain_filter` once it stabilizes | ||
let mut cursor = 0; | ||
while let Some(update) = updates.get(cursor) { | ||
if update.timestamp < advance_to { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense now. I didn't have a full understanding of how your Append change some weeks ago interacted with the TimestampOracle. This analysis sounds correct. I'm ok with this PR now.
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
49ecbb6
to
e6604d8
Compare
Thanks, I'm going to try and write up an issue that details this problem today. |
Motivation
While working on piping everything through persist I got panics because sometimes we call
storage_controller.append()
with updates whose timestamp is equal to the upper, but the updates in a batch must be at times strictly less than the batch's upper. I added an assertion to see if we generate these malformed batches onmain
and indeed we do (build results of just the first commit here).Testing
Release notes
This PR includes the following user-facing behavior changes: