Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coord: Make all writes durable #12330

Merged
merged 12 commits into from May 10, 2022
Merged

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented May 9, 2022

Previously consecutive writes in the coordinator were all done at the
same timestamp and saved in an in-memory WAL. These write would only be
persisted to STORAGE after 1 second or after a read. If the ADAPTER
crashed after the write, but before 1 second or a read, then the writes
would be lost forever.

This commit updates the ADAPTER code so that every write increases the
global timestamp and calls append with the write.

In the future if we were to add a durable ADAPTER side WAL, then we
wouldn't need to increase the timestamp or call append for every write.
Writes would only need to be made durable in the WAL, and could be
batched together in a single append call.

Fixes #12287

Motivation

This PR fixes a recognized bug.

Testing

  • This PR has adequate test coverage / QA involvement has been duly considered.

Release notes

This PR includes the following user-facing behavior changes:

  • There are no user-facing behavior changes.

Previously consecutive writes in the coordinator were all done at the
same timestamp and saved in an in-memory WAL. These write would only be
persisted to STORAGE after 1 second or after a read. If the ADAPTER
crashed after the write, but before 1 second or a read, then the writes
would be lost forever.

This commit updates the ADAPTER code so that every write increases the
global timestamp and calls append with the write.

In the future if we were to add a durable ADAPTER side WAL, then we
wouldn't need to increase the timestamp or call append for every write.
Writes would only need to be made durable in the WAL, and could be
batched together in a single append call.

Fixes MaterializeInc#12287
@jkosh44 jkosh44 marked this pull request as draft May 9, 2022 13:51
@jkosh44 jkosh44 changed the title coord: Make all write durable [WIP] coord: Make all write durable May 9, 2022
@jkosh44 jkosh44 requested a review from maddyblue May 9, 2022 14:52
@jkosh44
Copy link
Contributor Author

jkosh44 commented May 9, 2022

@mjibson As far as I can tell, this is all that's needed to ensure that all writes are durable, would you agree or am I missing something?

One big open question for me is, now that every write immediately calls append and we don't have a durable WAL, do we benefit from volatile_updates and the background thread that advances every table every 1 sec? With a durable WAL I can see that these allow us to batch many updates together in a single append call, but without it I'm not sure what it achieves. Perhaps though even if it doesn't achieve anything now, it's useful to keep around for when we do add a durable WAL.

@jkosh44 jkosh44 marked this pull request as ready for review May 9, 2022 17:01
Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this PR achieves its goal. The goal is to make all writes durable, which means that once mz has ack'd a write to a user, than write will be present on restart (assuming the storage layer takes care of that, which I don't think it does yet, but that's not coord's job to care about). Even with this PR, take the following sequence:

  1. user INSERTs data
  2. mz ack's the INSERT to the user, and dumps the change into coord.volatile_updates (
    self.volatile_updates.entry(id).or_default().extend(updates);
    )
  3. mz/coord restarts before advance_local_inputs is run, which is the thing that ships data to storage
  4. user reconnects and queries the table, observing the updates are gone

I think the solution to this problem is to change sequence_end_transaction_inner to do the call to append itself (with just a single table's worth of append data, erroring if there's more than 1 table), and then cause advance_local_inputs to be called later on to close that timestamp for all other tables.

@@ -5492,8 +5492,12 @@ mod timeline {
/// `self.read_ts()`.
pub fn write_ts(&mut self) -> T {
match &self.state {
TimestampOracleState::Writing(ts) => ts.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the TimestampOracle to not change. It's pretty carefully considered, and changing anything in it requires reconsidering its guarantees and edge cases. I recommend implementing this PR with the existing oracle API, which I think would be to always call read_ts() immediately after calling write_ts()

src/materialized/tests/sql.rs Show resolved Hide resolved
@jkosh44
Copy link
Contributor Author

jkosh44 commented May 9, 2022

@mjibson Thanks for the feedback, I think all the async/await and channels had me confused about the order. I though that should_advacne_to and advance_local_inputs was being called before the ACK back to the user. I'll make the suggested updates.

@jkosh44 jkosh44 changed the title [WIP] coord: Make all write durable coord: Make all write durable May 9, 2022
@jkosh44
Copy link
Contributor Author

jkosh44 commented May 9, 2022

@mjibson I'm still thinking through some things so you can ignore this for now, I'll re-request a review when it's ready.

@jkosh44 jkosh44 marked this pull request as draft May 9, 2022 20:47
@jkosh44 jkosh44 changed the title coord: Make all write durable coord: Make all writes durable May 9, 2022
@jkosh44 jkosh44 marked this pull request as ready for review May 10, 2022 15:06
@jkosh44
Copy link
Contributor Author

jkosh44 commented May 10, 2022

@mjibson This is ready for review again.

@petrosagg This pretty much undoes your work in #12246

@jkosh44
Copy link
Contributor Author

jkosh44 commented May 10, 2022

@mjibson Also something I noticed while working on this, pretty much all DDL requires two commands that are sent separately to storage/compute. One to modify a built in table and one to actually do the DDL.
For example creating a source requires an append command and a create_source command to be sent to STORAGE.

It's not exactly related to this PR but it's something that I plan on looking at because it seems like it could be a source of consistency issues in the face of one command succeeding but the other failing.

Copy link
Contributor

@petrosagg petrosagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

// This should have been caught earlier, immediately when the second table
// was added to the txn.
assert!(appends.len() <= 1);
if appends.len() == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the assert of the line above can get turned into a match? Not sure if this is better or not but it avoids a second call to len which is...probably about zero time anyway.

match appends.len() {
  1 => {..}
  0 => {}
  _ => unreachable!()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is cleaner, I updated it.

* same timestamp as long as they're written to the WAL first.
*/
let advance_to = ts.step_forward();
self.global_timeline.fast_forward(advance_to);
Copy link
Contributor

@maddyblue maddyblue May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here will, say, generate some write ts = 5. It will then call fast_forward with 6. If a read then comes in for that table, it will use 6, and 5 will never be used for a read: we're closing one timestamp more than we need to. Using self.global_timeline.read_ts() instead of fast_forward will have the side effect of allowing 5 to be used in a read query. Also, ts 5 will be closed (due to should_advance_to being set to ts.step_forward() in read_ts), so ts 5 is immediately readable, whereas ts 6 would not be (it'd require closing before reading).

I think we still have to use ts.step_forward() in the return here because append wants the new upper which must be not less than or equal to timestamps in the write batch. It's very annoying to me because that exact timestamp is correctly stashed away in the advance_to field of the TimestampOracle, but observing that also removes it, so it's not safe to observe in our current API, so it seems we do need to manually call step_forward still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to switch to your approach.

@@ -257,6 +257,9 @@ fn test_tail_basic() -> Result<(), Box<dyn Error>> {
"BEGIN;
DECLARE c CURSOR FOR TAIL t;",
)?;
// Locks the timestamp of the TAIL to before any of the following INSERTs, which is required
// for mz_timestamp column to be accurate
let _ = client_reads.query_one("FETCH 0 c", &[]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as is, but can probably also do DECLARE c CURSOR FOR TAIL t AS OF 0 which should work because compaction is disabled by default in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried that first, but got an invalid AS OF error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, that is surprising. Maybe needs to be AS OF 1000? Doesn't matter though.

src/coord/src/coord.rs Outdated Show resolved Hide resolved
@jkosh44 jkosh44 enabled auto-merge (squash) May 10, 2022 18:06
@jkosh44 jkosh44 merged commit 7c54a6e into MaterializeInc:main May 10, 2022
@jkosh44 jkosh44 deleted the durable_writes branch May 10, 2022 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ACKed writes are not durable
3 participants