New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compute: add token-controlled fuse in feedback edges #18718
Conversation
6ebd923
to
e102225
Compare
ff7b46e
to
71ab6c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
I think this deserves a test. What do you think about salvaging the one from #18442? It is meant to test active dataflow cancellation, but it really only test cancellation of divergent dataflows.
@aalexandrov @ggevay FYI. This will fix #16800, but you might want to leave that ticket open for tracking the hard limit. |
Oh, this is really great, thanks @petrosagg! I'm very happy about this! I tried it in a little bit of manual testing by hitting Ctrl+C on WMR selects and by dropping WMR materialized views, and it seems to work. But I'll leave the code review to people who know Timely better. |
This PR implements an alternative solution to recursive dataflow cancellation where tokeneds passthrough operators are rendered across all feedback edges of the dataflow. The tokens are held by the exported sinks and indexes in the same way source and imported index tokens are. When the token gets dropped iteration will immediatley stop, akin to having just executed a `break` instruction, and even divergent dataflows quickly shut down. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
6637d52
to
5d055eb
Compare
Thanks! We should probably change the name of this file and the comments inside it, since it's not active dataflow cancellation that we are testing, but shutdown of divergent WMR dataflows. I can push another commit with that later if you don't get to it first! Edit: I added a fixup commit that makes these changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach is valuable! Thanks for the PR.
I added some comments around naming and describing concepts, which I think are important to resolve before merging. The concept as a whole seems good, though.
This PR can introduce a consistency when logging errors to Sentry. At the moment, we only ever process whole batches of data, which might not be the case with this PR. For reductions, this might mean that they'll see negative accumulations and log it to Sentry. Ideally, we prevent these logs, but I'm fine if it's not part of the same PR (but please make sure it'll be part of the same release.)
src/timely-util/src/operator.rs
Outdated
|
||
/// Wraps the stream with a passthrough operator that will be shut down when the provided | ||
/// token cannot be upgraded anymore. | ||
fn fused(&self, token: Weak<()>) -> Stream<G, D1>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favor of the name fused
. I think the function should represent a verb, not a noun, and fused mostly means "some things mixed together," which is not the meaning we want to have here. I liked the old with_token
(?) better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation can be more specific to what the implementation actually provides. At the moment, it will gate all data once the token
is dropped, but still passes through progress traffic. For some dataflows, this might be surprising if they assume that dropping the token also advances the frontier to the empty frontier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having "fuse" in the name because it evokes the notion of a thing that stops stuff from flowing through once it is triggered. (It would be nicer to have a water analogy rather than one from electricity, but I can't think of one.) "token" doesn't evoke the same associations for me, it's pretty generic. But it does have a specific meaning in the context of MZ dataflows, so I think it would be fine too.
A couple more entries to the bike shedding:
with_fuse
with_circuit_breaker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The noun form of fuse is distinct from the verb form here, so I prefer with_fuse
over fused
. Circuit breakers normally can be reset, unlike fuses, so I'd prefer to avoid that here.
This is something I have on my plate because we also need it for making persist_source exit early, and generally for everything we do for making dataflow shutdown faster. We can just leave this PR open until we have that? Should be ready somewhen next week. |
d71d3b8
to
f3e9320
Compare
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
f3e9320
to
c614544
Compare
This is from the "feature flags" Nightly job:
The query to cause the crash is:
After you fix it, can you make sure this query is put in an SLT file -- this should have been detected in the normal CI rather than in a random test in Nightly. |
I'm sorry, this was a stupid bug in my code that broke recursive rendering so all recursive queries will panic. That's why testdrive 0 failed too. I'm pushing a fix now |
c614544
to
2867f7d
Compare
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
2867f7d
to
511de41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled again and ran a stress test with 50 concurrent divergent WMR queries. All WMR dataflows both inside cursors and inside INSERT ... SELECT are being cancelled correctly. At the end of the test computed CPU usage drops to zero, so it is all good.
This PR now needs to be adapted to the changes made in #18772 (specifically 2a9097f). I can take care of that if you don't mind @petrosagg. There is some uncertainty about aggressively advancing the fuse operator's output to the empty frontier: #18760 (comment). I think if we cannot come to an agreement here quickly, we should merge this PR with the initial version of the fuse operator that doesn't have special handling of its output frontier, to unblock divergent WMR cancellation. |
The previous implementation of `with_token` progress handling was flawed and prevented downstream operators from seeing any progress. In addition, there is some uncertainty about the safety of aggressively advancing to the empty frontier on token drop within dataflow loops. So this commit reverts back to the previous behavior of only dropping data updates on token drop. We can add the frontier advancement back in as a follow up, if we agree that it is safe.
As the sqllogictests found, the last commit that made the |
I think this PR is ready for the final review. All comments have been addressed AFAICT. @vmarcos and @tokenrove can you take a look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me; I added a few smaller comments regarding code organization / comments, but nothing that would stand in the way of merging.
TFTRs! And thanks @petrosagg for the implementation! |
This PR implements an alternative solution to recursive dataflow cancellation where tokened passthrough operators are rendered across all feedback edges of the dataflow.
The tokens are held by the exported sinks and indexes in the same way source and imported index tokens are.
When the token gets dropped iteration will immediatley stop, akin to having just executed a
break
instruction, and even divergent dataflows quickly shut down.Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way) and therefore is tagged with aT-proto
label.