coord: Make all writes durable #12330

jkosh44 · 2022-05-09T13:51:42Z

Previously consecutive writes in the coordinator were all done at the
same timestamp and saved in an in-memory WAL. These write would only be
persisted to STORAGE after 1 second or after a read. If the ADAPTER
crashed after the write, but before 1 second or a read, then the writes
would be lost forever.

This commit updates the ADAPTER code so that every write increases the
global timestamp and calls append with the write.

In the future if we were to add a durable ADAPTER side WAL, then we
wouldn't need to increase the timestamp or call append for every write.
Writes would only need to be made durable in the WAL, and could be
batched together in a single append call.

Fixes #12287

Motivation

This PR fixes a recognized bug.

Testing

This PR has adequate test coverage / QA involvement has been duly considered.

Release notes

This PR includes the following user-facing behavior changes:

There are no user-facing behavior changes.

Previously consecutive writes in the coordinator were all done at the same timestamp and saved in an in-memory WAL. These write would only be persisted to STORAGE after 1 second or after a read. If the ADAPTER crashed after the write, but before 1 second or a read, then the writes would be lost forever. This commit updates the ADAPTER code so that every write increases the global timestamp and calls append with the write. In the future if we were to add a durable ADAPTER side WAL, then we wouldn't need to increase the timestamp or call append for every write. Writes would only need to be made durable in the WAL, and could be batched together in a single append call. Fixes MaterializeInc#12287

jkosh44 · 2022-05-09T14:57:19Z

@mjibson As far as I can tell, this is all that's needed to ensure that all writes are durable, would you agree or am I missing something?

One big open question for me is, now that every write immediately calls append and we don't have a durable WAL, do we benefit from volatile_updates and the background thread that advances every table every 1 sec? With a durable WAL I can see that these allow us to batch many updates together in a single append call, but without it I'm not sure what it achieves. Perhaps though even if it doesn't achieve anything now, it's useful to keep around for when we do add a durable WAL.

…able_writes

maddyblue

I don't think this PR achieves its goal. The goal is to make all writes durable, which means that once mz has ack'd a write to a user, than write will be present on restart (assuming the storage layer takes care of that, which I don't think it does yet, but that's not coord's job to care about). Even with this PR, take the following sequence:

user INSERTs data
mz ack's the INSERT to the user, and dumps the change into coord.volatile_updates (

materialize/src/coord/src/coord.rs

Line 2888 in 8dd44d0

self.volatile_updates.entry(id).or_default().extend(updates);

)
mz/coord restarts before advance_local_inputs is run, which is the thing that ships data to storage
user reconnects and queries the table, observing the updates are gone

I think the solution to this problem is to change sequence_end_transaction_inner to do the call to append itself (with just a single table's worth of append data, erroring if there's more than 1 table), and then cause advance_local_inputs to be called later on to close that timestamp for all other tables.

maddyblue · 2022-05-09T17:05:57Z

src/coord/src/coord.rs

@@ -5492,8 +5492,12 @@ mod timeline {
        /// `self.read_ts()`.
        pub fn write_ts(&mut self) -> T {
            match &self.state {
-                TimestampOracleState::Writing(ts) => ts.clone(),


I think we want the TimestampOracle to not change. It's pretty carefully considered, and changing anything in it requires reconsidering its guarantees and edge cases. I recommend implementing this PR with the existing oracle API, which I think would be to always call read_ts() immediately after calling write_ts()

src/materialized/tests/sql.rs

jkosh44 · 2022-05-09T17:27:49Z

@mjibson Thanks for the feedback, I think all the async/await and channels had me confused about the order. I though that should_advacne_to and advance_local_inputs was being called before the ACK back to the user. I'll make the suggested updates.

jkosh44 · 2022-05-09T19:18:48Z

@mjibson I'm still thinking through some things so you can ignore this for now, I'll re-request a review when it's ready.

…able_writes

jkosh44 · 2022-05-10T15:31:59Z

@mjibson This is ready for review again.

@petrosagg This pretty much undoes your work in #12246

jkosh44 · 2022-05-10T15:54:28Z

@mjibson Also something I noticed while working on this, pretty much all DDL requires two commands that are sent separately to storage/compute. One to modify a built in table and one to actually do the DDL.
For example creating a source requires an append command and a create_source command to be sent to STORAGE.

It's not exactly related to this PR but it's something that I plan on looking at because it seems like it could be a source of consistency issues in the face of one command succeeding but the other failing.

petrosagg

Looks great!

maddyblue · 2022-05-10T16:49:27Z

src/coord/src/coord.rs

+                        // This should have been caught earlier, immediately when the second table
+                        // was added to the txn.
+                        assert!(appends.len() <= 1);
+                        if appends.len() == 1 {


This and the assert of the line above can get turned into a match? Not sure if this is better or not but it avoids a second call to len which is...probably about zero time anyway.

match appends.len() { 1 => {..} 0 => {} _ => unreachable!() }

I think this is cleaner, I updated it.

maddyblue · 2022-05-10T17:15:53Z

src/coord/src/coord.rs

+         * same timestamp as long as they're written to the WAL first.
+         */
+        let advance_to = ts.step_forward();
+        self.global_timeline.fast_forward(advance_to);


The code here will, say, generate some write ts = 5. It will then call fast_forward with 6. If a read then comes in for that table, it will use 6, and 5 will never be used for a read: we're closing one timestamp more than we need to. Using self.global_timeline.read_ts() instead of fast_forward will have the side effect of allowing 5 to be used in a read query. Also, ts 5 will be closed (due to should_advance_to being set to ts.step_forward() in read_ts), so ts 5 is immediately readable, whereas ts 6 would not be (it'd require closing before reading).

I think we still have to use ts.step_forward() in the return here because append wants the new upper which must be not less than or equal to timestamps in the write batch. It's very annoying to me because that exact timestamp is correctly stashed away in the advance_to field of the TimestampOracle, but observing that also removes it, so it's not safe to observe in our current API, so it seems we do need to manually call step_forward still.

Updated to switch to your approach.

maddyblue · 2022-05-10T17:59:13Z

src/materialized/tests/sql.rs

@@ -257,6 +257,9 @@ fn test_tail_basic() -> Result<(), Box<dyn Error>> {
        "BEGIN;
         DECLARE c CURSOR FOR TAIL t;",
    )?;
+    // Locks the timestamp of the TAIL to before any of the following INSERTs, which is required
+    // for mz_timestamp column to be accurate
+    let _ = client_reads.query_one("FETCH 0 c", &[]);


This is fine as is, but can probably also do DECLARE c CURSOR FOR TAIL t AS OF 0 which should work because compaction is disabled by default in tests.

I actually tried that first, but got an invalid AS OF error

Hrm, that is surprising. Maybe needs to be AS OF 1000? Doesn't matter though.

src/coord/src/coord.rs

jkosh44 marked this pull request as draft May 9, 2022 13:51

jkosh44 changed the title ~~coord: Make all write durable~~ [WIP] coord: Make all write durable May 9, 2022

Fix tests

248a3b0

jkosh44 requested a review from maddyblue May 9, 2022 14:52

Merge branch 'main' of github.com:MaterializeInc/materialize into dur…

17433f1

…able_writes

jkosh44 marked this pull request as ready for review May 9, 2022 17:01

maddyblue requested changes May 9, 2022

View reviewed changes

Move append into critical path

705a0ab

jkosh44 changed the title ~~[WIP] coord: Make all write durable~~ coord: Make all write durable May 9, 2022

jkosh44 added 2 commits May 9, 2022 15:25

fix test

210187c

Fix some tests

0d72593

jkosh44 marked this pull request as draft May 9, 2022 20:47

Fix empty transactions

5d12e29

jkosh44 changed the title ~~coord: Make all write durable~~ coord: Make all writes durable May 9, 2022

jkosh44 added 2 commits May 10, 2022 10:15

Merge branch 'main' of github.com:MaterializeInc/materialize into dur…

1fd9034

…able_writes

Fix tail test

995c2ea

jkosh44 marked this pull request as ready for review May 10, 2022 15:06

jkosh44 requested review from maddyblue and petrosagg May 10, 2022 15:30

petrosagg approved these changes May 10, 2022

View reviewed changes

maddyblue reviewed May 10, 2022

View reviewed changes

jkosh44 added 2 commits May 10, 2022 13:34

Respond to PR comments

871a1a1

update code

7b4be18

maddyblue approved these changes May 10, 2022

View reviewed changes

Fix typo

a751b98

jkosh44 enabled auto-merge (squash) May 10, 2022 18:06

jkosh44 merged commit 7c54a6e into MaterializeInc:main May 10, 2022

jkosh44 deleted the durable_writes branch May 10, 2022 19:13

maddyblue mentioned this pull request May 16, 2022

[adapter] Reduce TAIL read latency after write by closing timestamp #12198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coord: Make all writes durable #12330

coord: Make all writes durable #12330

jkosh44 commented May 9, 2022

jkosh44 commented May 9, 2022

maddyblue left a comment

maddyblue May 9, 2022

jkosh44 commented May 9, 2022

jkosh44 commented May 9, 2022

jkosh44 commented May 10, 2022

jkosh44 commented May 10, 2022

petrosagg left a comment

maddyblue May 10, 2022

jkosh44 May 10, 2022

maddyblue May 10, 2022 •

edited

jkosh44 May 10, 2022

maddyblue May 10, 2022

jkosh44 May 10, 2022

maddyblue May 10, 2022

coord: Make all writes durable #12330

coord: Make all writes durable #12330

Conversation

jkosh44 commented May 9, 2022

Motivation

Testing

Release notes

jkosh44 commented May 9, 2022

maddyblue left a comment

Choose a reason for hiding this comment

maddyblue May 9, 2022

Choose a reason for hiding this comment

jkosh44 commented May 9, 2022

jkosh44 commented May 9, 2022

jkosh44 commented May 10, 2022

jkosh44 commented May 10, 2022

petrosagg left a comment

Choose a reason for hiding this comment

maddyblue May 10, 2022

Choose a reason for hiding this comment

jkosh44 May 10, 2022

Choose a reason for hiding this comment

maddyblue May 10, 2022 • edited

Choose a reason for hiding this comment

jkosh44 May 10, 2022

Choose a reason for hiding this comment

maddyblue May 10, 2022

Choose a reason for hiding this comment

jkosh44 May 10, 2022

Choose a reason for hiding this comment

maddyblue May 10, 2022

Choose a reason for hiding this comment

maddyblue May 10, 2022 •

edited