Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coord,storage: reduce runtime of Coordinator::advance_local_inputs() #12813

Merged

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Jun 1, 2022

Updated (and less controversial) version of #12777

Motivation

That method, which is invoked at least every timestamp_interval is blocking the main coordinator task, and therefore should run as quickly as possible.

This two multiple mitigations that work towards reducing the runtime of advance_local_inputs(): we parallelize compare_and_append() calls in StorageController::append() and we parallelize downgrade_since() calls in StorageController::update_read_capabilities().

Tips for reviewer

The first commit adds duration logging, which allows us to look into things. The rest of the individual commits have comments in the code that outline why we do things the way we do them.

I used

 bin/mzcompose --find limits run default --scenario Tables

to understand the baseline performance and to gauge the impact of the two main changes. I did reduce COUNT to 100 in tests/limits.mzcompose.py, though, in order to not have to wait too long.

Impact of mitigations (runtime numbers on my linux machine):

  1. Without any mitigations, at around 100 tables the runtime of advance_local_inputs() is about 300ms.
  2. With my two mitigations that parallelize persist calls, runtime goes down to about 300ms. (which is not surprising, because the Postgres persist implementation is serializing calls, even for different shards)
  3. With the two mitigations and Dan's postgres threadpool change, runtime goes down to about 50ms.

The threadpool commit is from #12482 and should not be merged along with these changes. It's only in here to get a feel for its impact. We should definitely merge that PR as well, though.

Testing

  • This PR has adequate test coverage / QA involvement has been duly considered.

Some(updates) => updates,
None => continue,
};

for update in &updates {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should move above before the batches are merged otherwise we can turn a pair of batches of which the first one is invalid and the second one is valid into a big valid one if the updates of the first are beyond the first batch's upper but not beyond the second batch's upper.

let (existing_updates, _current_upper, new_upper) = updates_by_id
.entry(id)
.or_insert_with(|| (Vec::new(), current_upper, T::minimum()));
existing_updates.append(&mut updates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making a big compound array that contains all the data we can instead produce an iterator that will go through all the pieces.

diff --git a/src/dataflow-types/src/client/controller/storage.rs b/src/dataflow-types/src/client/controller/storage.rs
index 0b5d92809..9063f6065 100644
--- a/src/dataflow-types/src/client/controller/storage.rs
+++ b/src/dataflow-types/src/client/controller/storage.rs
@@ -410,15 +410,26 @@ where
     ) -> Result<(), StorageError> {
         let mut updates_by_id = HashMap::new();

-        for (id, mut updates, batch_upper) in commands {
-            let current_upper = self.collection(id)?.write_frontier.frontier().to_owned();
-            let (existing_updates, _current_upper, new_upper) = updates_by_id
+        for (id, updates, batch_upper) in commands {
+            for update in &updates {
+                if !update.timestamp.less_than(&batch_upper) {
+                    return Err(StorageError::UpdateBeyondUpper(id));
+                }
+            }
+
+            let (total_updates, new_upper) = updates_by_id
                 .entry(id)
-                .or_insert_with(|| (Vec::new(), current_upper, T::minimum()));
-            existing_updates.append(&mut updates);
+                .or_insert_with(|| (Vec::new(), T::minimum()));
+            total_updates.push(updates);
             new_upper.join_assign(&batch_upper);
         }

+        let mut appends_by_id = HashMap::new();
+        for (id, (updates, upper)) in updates_by_id {
+            let current_upper = self.collection(id)?.write_frontier.frontier().to_owned();
+            appends_by_id.insert(id, (updates.into_iter().flatten(), current_upper, upper));
+        }
+
         let futs = FuturesUnordered::new();

         // We cannot iterate through the updates and then set off a persist call
@@ -429,17 +440,11 @@ where
         // through all available write handles and see if there are any updates
         // for it. If yes, we send them all in one go.
         for (id, persist_handle) in self.state.persist_handles.iter_mut() {
-            let (updates, upper, new_upper) = match updates_by_id.remove(id) {
+            let (updates, upper, new_upper) = match appends_by_id.remove(id) {
                 Some(updates) => updates,
                 None => continue,
             };

-            for update in &updates {
-                if !update.timestamp.less_than(&new_upper) {
-                    return Err(StorageError::UpdateBeyondUpper(*id));
-                }
-            }
-
             let new_upper = Antichain::from_elem(new_upper);

             let updates = updates

Comment on lines 467 to 471
let change_batches = futs
.collect::<Vec<_>>()
.await
.into_iter()
.collect::<Result<Vec<_>, _>>()?;
Copy link
Contributor

@petrosagg petrosagg Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to use .try_collect() here https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_collect

Suggested change
let change_batches = futs
.collect::<Vec<_>>()
.await
.into_iter()
.collect::<Result<Vec<_>, _>>()?;
let change_batches = futs.try_collect::<Vec<_>>().await?;

@aljoscha
Copy link
Contributor Author

aljoscha commented Jun 1, 2022

thanks for the suggestions, @petrosagg! 😊

Copy link
Contributor

@petrosagg petrosagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage changes look good! I'm deferring to Dan for the persist ones

@danhhz
Copy link
Contributor

danhhz commented Jun 1, 2022

I do think that, given downgrade_since is idempotent, it should be possible for storage to be able to fire off each downgrade_since call in a task and not worry about it again, but it would take some persist work (some cloning and maybe a mutex). it's also not clear if the technique would generalize to empty compare_and_append frontier updates. if petros is happy with this complexity in storage, then I'm also fine with it and potentially circling back later

does this mean I should press on #12482?

  1. Without any mitigations, at around 100 tables the runtime of advance_local_inputs() is about 300ms.
  2. With my two mitigations that parallelize persist calls, runtime goes down to about 300ms. (which is not surprising, because the Postgres persist implementation is serializing calls, even for different shards)

are these number correct? (300ms down to 300ms)

@danhhz
Copy link
Contributor

danhhz commented Jun 1, 2022

does this mean I should press on #12482?

this happened to come up in my 1:1 with @elindsey and his instinct is to hold off on merging it until we have a better sense of whether it's necessary for the M1 demo. where are we at on advance_local_inputs timing without that PR?

@aljoscha
Copy link
Contributor Author

aljoscha commented Jun 1, 2022

are these number correct? (300ms down to 300ms)

Yes! 😅 without your (@danhhz's) PR, my changes do remove the serial nature of persist calls in the storage controller, but then postgres consensus serializes things because the pg client is behind a mutex.

@aljoscha
Copy link
Contributor Author

aljoscha commented Jun 1, 2022

this happened to come up in my 1:1 with @elindsey and his instinct is to hold off on merging it until we have a better sense of whether it's necessary for the M1 demo. where are we at on advance_local_inputs timing without that PR?

I don't think it's an issue for the M1 demo because I don't think we want to create many objects. The numbers with 100 tables are in the PR description, but I also got some smaller scale numbers: runtime of advance_local_inputs() and peek latency with just one user table (and the default system tables).

#### On top of f4357ae690ac306bc6cb9d19e3e1550771f7f431

A "fresh" peek is a peek that comes in right after we downgraded the read/write
timestamp according to the timestamp interval.

baseline:
 - runtime of `advance_local_inputs()`: ~120ms
 - latency of a "fresh" peek: ~160ms

pg-threadpool:
 - runtime of `advance_local_inputs()`: ~120ms
 - latency of a "fresh" peek: ~160ms

pg-threadpool + downgrade-worker:
 - runtime of `advance_local_inputs()`: ~70ms
 - latency of a "fresh" peek: ~100ms

pg-threadpool + concurrent-append:
 - runtime of `advance_local_inputs()`: ~85ms
 - latency of a "fresh" peek: ~90ms

pg-threadpool + downgrade-worker + concurrent-append:
 - runtime of `advance_local_inputs()`: ~25ms
 - latency of a "fresh" peek: ~36ms

pg-threadpool + concurrent-downgrade + concurrent-append:
 - runtime of `advance_local_inputs()`: ~50ms
 - latency of a "fresh" peek: ~50ms

The last set of numbers is this PR. The one with downgrade-worker is with downgrading moved out into a separate task. It's not surprising that this second change halves the runtime of advance_local_inputs() because append() does #num-tables compare_and_append() calls and update_read_capabilities() does #num-tables downgrade_since() calls.

@aljoscha aljoscha force-pushed the storage-parallelize-persist-ops branch from b36b393 to d95100b Compare June 2, 2022 09:09
Before, we would set off each `compare_and_append()` call and
individually await each future. Now we collect the futures of all calls
in a `FuturesUnordered` and await them concurrently.
@aljoscha aljoscha force-pushed the storage-parallelize-persist-ops branch from d95100b to 930310b Compare June 2, 2022 11:26
@aljoscha aljoscha merged commit d7cc59d into MaterializeInc:main Jun 2, 2022
@aljoscha aljoscha deleted the storage-parallelize-persist-ops branch June 2, 2022 16:07
@aljoscha
Copy link
Contributor Author

aljoscha commented Jun 2, 2022

TFTR!

}

let change_batches = futs.try_collect::<Vec<_>>().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can anything bad happen if one of these futures is cancelled in the middle of an await? for example, one could be in the middle of compare_and_append, but another finishes with an error, and the first is dropped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad you're thinking about this! we don't do anything to test it yet, but persist intends to be cancel-safe 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine, at least I don't think it makes things worse. Before, a compare_and_append could be cancelled right in the middle of sth if the future that is returned from append() is cancelled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants