New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataflow/logging: Activate on data #8539
Conversation
b39c13b
to
cb53747
Compare
Adding some evidence of performance impact. I'm testing dbbench with a constant throughput of 400 dbbench config:
Master: Memory ~2.5GiB
PR applied: Memory ~1GiB
|
cb53747
to
36fd6a1
Compare
I figured the activation ack shouldn't come from the logging demux operator but directly from the replay operator. It should prevent some lifeness issues that we'd risk otherwise (only for up to the introspection interval).
|
36fd6a1
to
a5aa3a1
Compare
Setting the threshold to 64 batches seems to have a beneficial effect:
This would increase memory consumption to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this mostly looks good, with the caveat that the build in delaying mechanism is probably non-obvious enough to want to fix up before pushing anywhere else. I can see the goal there, but I wonder if there is a cleaner way to do it (e.g. have the logger receive buffers, and notify at a threshold number of events rather than number of calls).
/// Acknowledge the activation, which enables new activations to be scheduled. | ||
pub fn ack(&mut self) { | ||
self.inner.borrow_mut().ack() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most timely activators do not have this. Is there an explainer for why this is important, and how to use it correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe relatedly, looking at the use, is this important to cause the delaying, or could one just count up, and activate every 32 (or whatever) calls to activate
, and reset the counter on each activation? It seems like this allows a pre-emptive resetting, which .. could be helpful in preventing some spurious activations but probably no more than the native activations (i.e. once per second).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important that once the operator has been activated, we do not activate it again until it was scheduled. Without this mechanism, we tend to queue activations, which potentially cause more work to consolidate but do not trigger additional or earlier activations. This seems to be an anti-pattern, but is safer in that it doesn't try to put more load on the system once it is under load.
src/dataflow/src/activator.rs
Outdated
} | ||
|
||
impl ActivatorInner { | ||
const THRESHOLD: usize = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an important part of the explanation, I think. Like, "you shouldn't expect an activation unless you activate this many times". I can see why that is useful in our code, but it seems a bit opaque and mysterious. If it's an important concept, maybe promote it to a struct member and make it part of the new()
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the PR to expose the parameter and make it part of the new()
call. Also added more documentation to describe what its purpose is.
Currently, the logging dataflows activate once every introspection interval to process data. This can cause a significant buildup of data before it is processed. In addition to activating the replay operator periodically, also activate it once a sufficient amount of data has been accumulated. Ideally, the data processing should be activated as soon as some data is available, but this could cause a situation where activating causes additional data to be produced. For this reason, the dataflow activates after 32 batches have been published by the logging infrastructure. This seems to be a reasonable trade-off between memory (currently at most 8KiB per batch) and latency. The time-based activation is still required in case there is no data for a specific dataflow. In this case, we still need to advance the clock, which the time-based activation takes care of. Signed-off-by: Moritz Hoffmann <mh@materialize.com>
Signed-off-by: Moritz Hoffmann <mh@materialize.com>
Signed-off-by: Moritz Hoffmann <mh@materialize.com>
Signed-off-by: Moritz Hoffmann <mh@materialize.com>
a5aa3a1
to
b0af0e6
Compare
The underlying log infrastructure tries to make sure that log buffers are limited in size by 8KiB, but they can be smaller. I'd like to keep the intuition that the number of published buffers corresponds to a size in memory rather than a number of elements, which the current approach fulfills. |
Currently, the logging dataflows activate once every introspection
interval to process data. This can cause a significant buildup of data
before it is processed.
In addition to activating the replay operator periodically, also
activate it once a sufficient amount of data has been accumulated.
Ideally, the data processing should be activated as soon as some data is
available, but this could cause a situation where activating causes
additional data to be produced.
For this reason, the dataflow activates after 32 batches have been
published by the logging infrastructure. This seems to be a reasonable
trade-off between memory (currently at most 8KiB per batch) and latency.
The time-based activation is still required in case there is no data for
a specific dataflow. In this case, we still need to advance the clock,
which the time-based activation takes care of.
Signed-off-by: Moritz Hoffmann mh@materialize.com
Motivation
Description
Tips for reviewer
Checklist