coord,storage: reduce runtime of `Coordinator::advance_local_inputs()` #12813

aljoscha · 2022-06-01T14:41:57Z

Updated (and less controversial) version of #12777

Motivation

That method, which is invoked at least every timestamp_interval is blocking the main coordinator task, and therefore should run as quickly as possible.

This two multiple mitigations that work towards reducing the runtime of advance_local_inputs(): we parallelize compare_and_append() calls in StorageController::append() and we parallelize downgrade_since() calls in StorageController::update_read_capabilities().

Tips for reviewer

The first commit adds duration logging, which allows us to look into things. The rest of the individual commits have comments in the code that outline why we do things the way we do them.

I used

 bin/mzcompose --find limits run default --scenario Tables

to understand the baseline performance and to gauge the impact of the two main changes. I did reduce COUNT to 100 in tests/limits.mzcompose.py, though, in order to not have to wait too long.

Impact of mitigations (runtime numbers on my linux machine):

Without any mitigations, at around 100 tables the runtime of advance_local_inputs() is about 300ms.
With my two mitigations that parallelize persist calls, runtime goes down to about 300ms. (which is not surprising, because the Postgres persist implementation is serializing calls, even for different shards)
With the two mitigations and Dan's postgres threadpool change, runtime goes down to about 50ms.

The threadpool commit is from #12482 and should not be merged along with these changes. It's only in here to get a feel for its impact. We should definitely merge that PR as well, though.

Testing

This PR has adequate test coverage / QA involvement has been duly considered.

petrosagg · 2022-06-01T15:29:38Z

src/dataflow-types/src/client/controller/storage.rs

+                Some(updates) => updates,
+                None => continue,
+            };
+
            for update in &updates {


This check should move above before the batches are merged otherwise we can turn a pair of batches of which the first one is invalid and the second one is valid into a big valid one if the updates of the first are beyond the first batch's upper but not beyond the second batch's upper.

petrosagg · 2022-06-01T15:30:41Z

src/dataflow-types/src/client/controller/storage.rs

+            let (existing_updates, _current_upper, new_upper) = updates_by_id
+                .entry(id)
+                .or_insert_with(|| (Vec::new(), current_upper, T::minimum()));
+            existing_updates.append(&mut updates);


Instead of making a big compound array that contains all the data we can instead produce an iterator that will go through all the pieces.

diff --git a/src/dataflow-types/src/client/controller/storage.rs b/src/dataflow-types/src/client/controller/storage.rs index 0b5d92809..9063f6065 100644 --- a/src/dataflow-types/src/client/controller/storage.rs +++ b/src/dataflow-types/src/client/controller/storage.rs @@ -410,15 +410,26 @@ where ) -> Result<(), StorageError> { let mut updates_by_id = HashMap::new(); - for (id, mut updates, batch_upper) in commands { - let current_upper = self.collection(id)?.write_frontier.frontier().to_owned(); - let (existing_updates, _current_upper, new_upper) = updates_by_id + for (id, updates, batch_upper) in commands { + for update in &updates { + if !update.timestamp.less_than(&batch_upper) { + return Err(StorageError::UpdateBeyondUpper(id)); + } + } + + let (total_updates, new_upper) = updates_by_id .entry(id) - .or_insert_with(|| (Vec::new(), current_upper, T::minimum())); - existing_updates.append(&mut updates); + .or_insert_with(|| (Vec::new(), T::minimum())); + total_updates.push(updates); new_upper.join_assign(&batch_upper); } + let mut appends_by_id = HashMap::new(); + for (id, (updates, upper)) in updates_by_id { + let current_upper = self.collection(id)?.write_frontier.frontier().to_owned(); + appends_by_id.insert(id, (updates.into_iter().flatten(), current_upper, upper)); + } + let futs = FuturesUnordered::new(); // We cannot iterate through the updates and then set off a persist call @@ -429,17 +440,11 @@ where // through all available write handles and see if there are any updates // for it. If yes, we send them all in one go. for (id, persist_handle) in self.state.persist_handles.iter_mut() { - let (updates, upper, new_upper) = match updates_by_id.remove(id) { + let (updates, upper, new_upper) = match appends_by_id.remove(id) { Some(updates) => updates, None => continue, }; - for update in &updates { - if !update.timestamp.less_than(&new_upper) { - return Err(StorageError::UpdateBeyondUpper(*id)); - } - } - let new_upper = Antichain::from_elem(new_upper); let updates = updates

petrosagg · 2022-06-01T15:35:05Z

src/dataflow-types/src/client/controller/storage.rs

+        let change_batches = futs
+            .collect::<Vec<_>>()
+            .await
+            .into_iter()
+            .collect::<Result<Vec<_>, _>>()?;


you probably want to use .try_collect() here https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_collect

Suggested change

let change_batches = futs

.collect::<Vec<_>>()

.await

.into_iter()

.collect::<Result<Vec<_>, _>>()?;

let change_batches = futs.try_collect::<Vec<_>>().await?;

aljoscha · 2022-06-01T16:20:45Z

thanks for the suggestions, @petrosagg! 😊

petrosagg

The storage changes look good! I'm deferring to Dan for the persist ones

danhhz · 2022-06-01T16:35:07Z

I do think that, given downgrade_since is idempotent, it should be possible for storage to be able to fire off each downgrade_since call in a task and not worry about it again, but it would take some persist work (some cloning and maybe a mutex). it's also not clear if the technique would generalize to empty compare_and_append frontier updates. if petros is happy with this complexity in storage, then I'm also fine with it and potentially circling back later

does this mean I should press on #12482?

Without any mitigations, at around 100 tables the runtime of advance_local_inputs() is about 300ms.

With my two mitigations that parallelize persist calls, runtime goes down to about 300ms. (which is not surprising, because the Postgres persist implementation is serializing calls, even for different shards)

are these number correct? (300ms down to 300ms)

danhhz · 2022-06-01T18:46:58Z

does this mean I should press on #12482?

this happened to come up in my 1:1 with @elindsey and his instinct is to hold off on merging it until we have a better sense of whether it's necessary for the M1 demo. where are we at on advance_local_inputs timing without that PR?

aljoscha · 2022-06-01T19:32:09Z

are these number correct? (300ms down to 300ms)

Yes! 😅 without your (@danhhz's) PR, my changes do remove the serial nature of persist calls in the storage controller, but then postgres consensus serializes things because the pg client is behind a mutex.

aljoscha · 2022-06-01T19:39:35Z

this happened to come up in my 1:1 with @elindsey and his instinct is to hold off on merging it until we have a better sense of whether it's necessary for the M1 demo. where are we at on advance_local_inputs timing without that PR?

I don't think it's an issue for the M1 demo because I don't think we want to create many objects. The numbers with 100 tables are in the PR description, but I also got some smaller scale numbers: runtime of advance_local_inputs() and peek latency with just one user table (and the default system tables).

#### On top of f4357ae690ac306bc6cb9d19e3e1550771f7f431

A "fresh" peek is a peek that comes in right after we downgraded the read/write
timestamp according to the timestamp interval.

baseline:
 - runtime of `advance_local_inputs()`: ~120ms
 - latency of a "fresh" peek: ~160ms

pg-threadpool:
 - runtime of `advance_local_inputs()`: ~120ms
 - latency of a "fresh" peek: ~160ms

pg-threadpool + downgrade-worker:
 - runtime of `advance_local_inputs()`: ~70ms
 - latency of a "fresh" peek: ~100ms

pg-threadpool + concurrent-append:
 - runtime of `advance_local_inputs()`: ~85ms
 - latency of a "fresh" peek: ~90ms

pg-threadpool + downgrade-worker + concurrent-append:
 - runtime of `advance_local_inputs()`: ~25ms
 - latency of a "fresh" peek: ~36ms

pg-threadpool + concurrent-downgrade + concurrent-append:
 - runtime of `advance_local_inputs()`: ~50ms
 - latency of a "fresh" peek: ~50ms

The last set of numbers is this PR. The one with downgrade-worker is with downgrading moved out into a separate task. It's not surprising that this second change halves the runtime of advance_local_inputs() because append() does #num-tables compare_and_append() calls and update_read_capabilities() does #num-tables downgrade_since() calls.

…te_read_capabilities

Before, we would set off each `compare_and_append()` call and individually await each future. Now we collect the futures of all calls in a `FuturesUnordered` and await them concurrently.

aljoscha · 2022-06-02T16:07:15Z

TFTR!

guswynn · 2022-06-02T16:19:23Z

src/dataflow-types/src/client/controller/storage.rs

        }
+
+        let change_batches = futs.try_collect::<Vec<_>>().await?;


can anything bad happen if one of these futures is cancelled in the middle of an await? for example, one could be in the middle of compare_and_append, but another finishes with an error, and the first is dropped

glad you're thinking about this! we don't do anything to test it yet, but persist intends to be cancel-safe 🤷

I think it should be fine, at least I don't think it makes things worse. Before, a compare_and_append could be cancelled right in the middle of sth if the future that is returned from append() is cancelled.

aljoscha requested review from danhhz, petrosagg and ruchirK June 1, 2022 14:41

aljoscha mentioned this pull request Jun 1, 2022

coord,storage (WIP, strawman-ish): reduce runtime of Coordinator::advance_local_inputs() #12777

Closed

1 task

petrosagg reviewed Jun 1, 2022

View reviewed changes

petrosagg approved these changes Jun 1, 2022

View reviewed changes

coord: log runtime of Coordinator::advance_local_inputs()

1fbc0b4

aljoscha force-pushed the storage-parallelize-persist-ops branch from b36b393 to d95100b Compare June 2, 2022 09:09

aljoscha added 2 commits June 2, 2022 13:26

storage: parallelize downgrade_since calls in StorageController::upda…

ca92f98

…te_read_capabilities

storage: parallelize appending in StorageController::append

930310b

Before, we would set off each `compare_and_append()` call and individually await each future. Now we collect the futures of all calls in a `FuturesUnordered` and await them concurrently.

aljoscha force-pushed the storage-parallelize-persist-ops branch from d95100b to 930310b Compare June 2, 2022 11:26

danhhz approved these changes Jun 2, 2022

View reviewed changes

aljoscha merged commit d7cc59d into MaterializeInc:main Jun 2, 2022

aljoscha deleted the storage-parallelize-persist-ops branch June 2, 2022 16:07

guswynn reviewed Jun 2, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

coord,storage: reduce runtime of `Coordinator::advance_local_inputs()` #12813

coord,storage: reduce runtime of `Coordinator::advance_local_inputs()` #12813

Uh oh!

aljoscha commented Jun 1, 2022

Uh oh!

petrosagg Jun 1, 2022

Uh oh!

petrosagg Jun 1, 2022

Uh oh!

petrosagg Jun 1, 2022 •

edited

Loading

Uh oh!

aljoscha commented Jun 1, 2022

Uh oh!

petrosagg left a comment

Uh oh!

danhhz commented Jun 1, 2022

Uh oh!

danhhz commented Jun 1, 2022

Uh oh!

aljoscha commented Jun 1, 2022

Uh oh!

aljoscha commented Jun 1, 2022

Uh oh!

aljoscha commented Jun 2, 2022

Uh oh!

guswynn Jun 2, 2022

Uh oh!

danhhz Jun 2, 2022

Uh oh!

aljoscha Jun 3, 2022

Uh oh!

Uh oh!

coord,storage: reduce runtime of Coordinator::advance_local_inputs() #12813

coord,storage: reduce runtime of Coordinator::advance_local_inputs() #12813

Uh oh!

Conversation

aljoscha commented Jun 1, 2022

Motivation

Tips for reviewer

Testing

Uh oh!

petrosagg Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

petrosagg Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

petrosagg Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aljoscha commented Jun 1, 2022

Uh oh!

petrosagg left a comment

Choose a reason for hiding this comment

Uh oh!

danhhz commented Jun 1, 2022

Uh oh!

danhhz commented Jun 1, 2022

Uh oh!

aljoscha commented Jun 1, 2022

Uh oh!

aljoscha commented Jun 1, 2022

Uh oh!

aljoscha commented Jun 2, 2022

Uh oh!

guswynn Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

danhhz Jun 2, 2022

Choose a reason for hiding this comment

Uh oh!

aljoscha Jun 3, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coord,storage: reduce runtime of `Coordinator::advance_local_inputs()` #12813

coord,storage: reduce runtime of `Coordinator::advance_local_inputs()` #12813

petrosagg Jun 1, 2022 •

edited

Loading