New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coord: Make all writes durable #12330
Conversation
Previously consecutive writes in the coordinator were all done at the same timestamp and saved in an in-memory WAL. These write would only be persisted to STORAGE after 1 second or after a read. If the ADAPTER crashed after the write, but before 1 second or a read, then the writes would be lost forever. This commit updates the ADAPTER code so that every write increases the global timestamp and calls append with the write. In the future if we were to add a durable ADAPTER side WAL, then we wouldn't need to increase the timestamp or call append for every write. Writes would only need to be made durable in the WAL, and could be batched together in a single append call. Fixes MaterializeInc#12287
@mjibson As far as I can tell, this is all that's needed to ensure that all writes are durable, would you agree or am I missing something? One big open question for me is, now that every write immediately calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this PR achieves its goal. The goal is to make all writes durable, which means that once mz has ack'd a write to a user, than write will be present on restart (assuming the storage layer takes care of that, which I don't think it does yet, but that's not coord's job to care about). Even with this PR, take the following sequence:
- user INSERTs data
- mz ack's the INSERT to the user, and dumps the change into
coord.volatile_updates
(materialize/src/coord/src/coord.rs
Line 2888 in 8dd44d0
self.volatile_updates.entry(id).or_default().extend(updates); - mz/coord restarts before advance_local_inputs is run, which is the thing that ships data to storage
- user reconnects and queries the table, observing the updates are gone
I think the solution to this problem is to change sequence_end_transaction_inner
to do the call to append itself (with just a single table's worth of append data, erroring if there's more than 1 table), and then cause advance_local_inputs to be called later on to close that timestamp for all other tables.
src/coord/src/coord.rs
Outdated
@@ -5492,8 +5492,12 @@ mod timeline { | |||
/// `self.read_ts()`. | |||
pub fn write_ts(&mut self) -> T { | |||
match &self.state { | |||
TimestampOracleState::Writing(ts) => ts.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the TimestampOracle to not change. It's pretty carefully considered, and changing anything in it requires reconsidering its guarantees and edge cases. I recommend implementing this PR with the existing oracle API, which I think would be to always call read_ts()
immediately after calling write_ts()
@mjibson Thanks for the feedback, I think all the async/await and channels had me confused about the order. I though that |
@mjibson I'm still thinking through some things so you can ignore this for now, I'll re-request a review when it's ready. |
@mjibson This is ready for review again. @petrosagg This pretty much undoes your work in #12246 |
@mjibson Also something I noticed while working on this, pretty much all DDL requires two commands that are sent separately to storage/compute. One to modify a built in table and one to actually do the DDL. It's not exactly related to this PR but it's something that I plan on looking at because it seems like it could be a source of consistency issues in the face of one command succeeding but the other failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
src/coord/src/coord.rs
Outdated
// This should have been caught earlier, immediately when the second table | ||
// was added to the txn. | ||
assert!(appends.len() <= 1); | ||
if appends.len() == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the assert of the line above can get turned into a match? Not sure if this is better or not but it avoids a second call to len which is...probably about zero time anyway.
match appends.len() {
1 => {..}
0 => {}
_ => unreachable!()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is cleaner, I updated it.
src/coord/src/coord.rs
Outdated
* same timestamp as long as they're written to the WAL first. | ||
*/ | ||
let advance_to = ts.step_forward(); | ||
self.global_timeline.fast_forward(advance_to); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here will, say, generate some write ts = 5. It will then call fast_forward with 6. If a read then comes in for that table, it will use 6, and 5 will never be used for a read: we're closing one timestamp more than we need to. Using self.global_timeline.read_ts()
instead of fast_forward will have the side effect of allowing 5 to be used in a read query. Also, ts 5 will be closed (due to should_advance_to being set to ts.step_forward()
in read_ts
), so ts 5 is immediately readable, whereas ts 6 would not be (it'd require closing before reading).
I think we still have to use ts.step_forward()
in the return here because append
wants the new upper which must be not less than or equal to timestamps in the write batch. It's very annoying to me because that exact timestamp is correctly stashed away in the advance_to
field of the TimestampOracle, but observing that also removes it, so it's not safe to observe in our current API, so it seems we do need to manually call step_forward still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to switch to your approach.
@@ -257,6 +257,9 @@ fn test_tail_basic() -> Result<(), Box<dyn Error>> { | |||
"BEGIN; | |||
DECLARE c CURSOR FOR TAIL t;", | |||
)?; | |||
// Locks the timestamp of the TAIL to before any of the following INSERTs, which is required | |||
// for mz_timestamp column to be accurate | |||
let _ = client_reads.query_one("FETCH 0 c", &[]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine as is, but can probably also do DECLARE c CURSOR FOR TAIL t AS OF 0
which should work because compaction is disabled by default in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried that first, but got an invalid AS OF
error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, that is surprising. Maybe needs to be AS OF 1000? Doesn't matter though.
Previously consecutive writes in the coordinator were all done at the
same timestamp and saved in an in-memory WAL. These write would only be
persisted to STORAGE after 1 second or after a read. If the ADAPTER
crashed after the write, but before 1 second or a read, then the writes
would be lost forever.
This commit updates the ADAPTER code so that every write increases the
global timestamp and calls append with the write.
In the future if we were to add a durable ADAPTER side WAL, then we
wouldn't need to increase the timestamp or call append for every write.
Writes would only need to be made durable in the WAL, and could be
batched together in a single append call.
Fixes #12287
Motivation
This PR fixes a recognized bug.
Testing
Release notes
This PR includes the following user-facing behavior changes: