compute: tokenize error logging #18772

teskje · 2023-04-14T12:42:34Z

This PR modifies the error logging behavior of compute operators to only log messages to Sentry when the dataflow is not in the process of shutting down. This avoids producing noise caused by dataflow operators emitting partial update sets when they are cancelled. It is a precondition for enabling the faster dataflow shutdown methods described in the "Faster Dataflow Shutdown" design.

The shutdown checking is performed through a shutdown token that is wrapped inside a new ErrorLogger type that gets passed around to the various operator renderers. An ErrorLogger encapsulates both shutdown checking and our warn!/error! logging format.

Motivation

This PR adds a known-desirable feature.

Enables #18760.

Tips for reviewer

Look at the commits and their descriptions separately!

The first commit moves the MaybeValidatingRow trait into a new mz_compute::render::errors module, which is also where the new ErrorLogger type lives.
The second commit adds the ErrorLogger type.
The third commit attaches a shutdown token to the rendering Context and updates the top-level rendering code to make sure the token gets added to the exports depending on it.
The fourth commits adjusts error logging in dataflow operators to use the ErrorLogger, rather than using the warn!/error! macros directly.
The fifth commit makes most of the free-standing reduce and top-k rendering functions methods of the Context type. Rather than requiring debug_name and error_logger arguments, they can now get these fields directly from the context, reducing the number of function arguments that need to be passed.

Unfortunately, the diff for the last commit is quite hairy, since it changes not only the indentation level, but also causes rustfmt to re-wrap many of the higher-indented lines. Hiding whitespace changes helps a bit, but still produces a diff much larger than it should be. Since the last commit is a pure code organization change and doesn't affect functionality, I would be fine with dropping it if you'd rather not review it.

One thought about the ErrorLogger type: It would be quite easy to also wrap the debug_name, in addition to the token, inside it. Then we wouldn't have to pass that around in rendering anymore. The drawback is that we'd lose our current flexibility in formatting the "details" part. We'd probably need to put the dataflow name as a fixed prefix of the message, so constructs like "found invalid accumulations in dataflow {debug_name}" would no longer be possible. This is now implemented in the sixth commit!

If anyone has an idea how to actually test this change, I'd be happy to hear it! I couldn't come up with a way to provoke invalid retractions during dataflow shutdown (they should be impossible today), much less doing so in a reliable way.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/A

vmarcos

The first four commits look good to me; I only have a minor remark about some warning messages that appear to have been shortened in some places.

I am uncertain about the last commit. The previous code organization was already in place prior to my start in Materialize and I do not think I have a clear understanding of why these functions were not part of Context to start with or otherwise why they were not collected under some other abstraction. Perhaps we can ask for someone with historical background to manifest before moving on with this change? Also, I did not give enough thinking to the question of whether there may also be a better code organization that we could think of rather than having all of the functions collected under Context.

src/compute/src/render/reduce.rs

tokenrove · 2023-04-17T12:11:45Z

src/compute/src/render/reduce.rs

@@ -508,8 +541,12 @@ where
                        continue;
                    }
                    let message = "Non-positive multiplicity in DistinctBy Retractions";
-                    warn!(?row, ?count, debug_name, "[customer-data] {message}");


One downside to this approach is that we lose structured field logging. How ugly would it be to supply some macros that take an ErrorLogger?

I've not been able to come up with anything ergonomic so far. We could probably introduce something like:

log_dataflow_error!( self.error_logger(), "Non-positive multiplicity in DistinctBy Retractions", ?row, ?count, debug_name, "[customer-data] Non-positive multiplicity in DistinctBy Retractions", );

Doesn't seem too bad, actually. Especially because we could make it possible to reference the message in the details format string, and make the debug_name be implicitly added to all logs.

log_dataflow_error!( self.error_logger(), "Non-positive multiplicity in DistinctBy Retractions", ?row, ?count, "[customer-data] {message}", );

I'm just not sure whether my macro foo is up to this task :)

In any case, I'd rather make this the scope of another PR, since I foresee some bikeshedding.

Nice! I'm also happy to do this if you don't want to deal with it.

src/compute/src/render/errors.rs

tokenrove · 2023-04-17T12:18:41Z

src/compute/src/render/reduce.rs

-                self.error_logger(),
-            )
-            .leave_region()
+            self.render_reduce_plan(reduce_plan, ok, err, key_arity)


I'm a fan of this change (throughout).

src/compute/src/render/context.rs

tokenrove

👍 modulo some of the details we discussed, and as mentioned, we can do a followup to this for the most bikesheddable bits.

This commit introduces a new `ErrorLogger` type that will be used for logging errors inside dataflows. This type wraps a dataflow shutdown token and encapsules both shutdown-checked logging and our `warn!`/`error!` approach to logging dataflow errors.

This commit extends the rendering `Context` by a `shutdown_token` field that can be used by rendered operators to check whether the dataflow is in the progess of shutting down. The shutdown token is replaced for each entry in `objects_to_build`, to support finer-grained dataflow shutdown in a future where dataflows can have more than one export. This commit also adjusts variable naming to make it clear that the various `Context::export_*` methods receive a list of all dependency IDs, which include `objects_to_build` IDs, rather than just IDs of dataflow imports. As a consequence, there is no need to adjust the export methods to ensure that they also pick up the new object tokens.

This commit modifies the error logging behavior of compute operators to only log messages when the dataflow is not in the process of shutting down. This avoids producing noise caused by dataflow operators emitting partial update sets when they are cancelled. The shutdown checking is performed through a shutdown token and wrapped inside the new `ErrorLogger` type that gets passed around to the various operator renderers.

This commit converts all freestanding reduce and top-k rendering functions that want to do error logging into methods of the `Context` type. The benefit is that we can rid of two parameters, `debug_name` and `error_logger`, in exchange for a new `&self` parameter, and changed indentation. Note that the top-k functions where previously definied inside the scope of `Context::render_topk` (for no apparent reason). Making them methods of `Context` actually reduces indentation.

Everywhere we log something in compute rendering, we include the debug name of the dataflow in the logged details. It makes sense to build the dataflow name logging directly into the `ErrorLogger`, so we don't have to repeat ourselves at the callsites.

teskje · 2023-04-17T16:03:24Z

I am uncertain about the last commit. The previous code organization was already in place prior to my start in Materialize and I do not think I have a clear understanding of why these functions were not part of Context to start with or otherwise why they were not collected under some other abstraction. Perhaps we can ask for someone with historical background to manifest before moving on with this change?

I asked in Slack and there doesn't seem to be a specific reason.

This should be ready for another look! I've addressed the comments and added another commit that pulls the debug_name printing into the ErrorLogger as well, making the callsites a bit more DRY in the process.

vmarcos

Thanks for asking in Slack; the changes LGTM! I think it is quite tricky to test this at the moment, but maybe we can think about an approach when we allow persist_source to stop emitting early as well.

teskje · 2023-04-17T18:28:24Z

TFTRs despite the huge diff!

teskje force-pushed the tokenize-error-logging branch from 702204a to eccdfa0 Compare April 14, 2023 14:02

compute: move MaybeValidatingRow into errors module

16aa56c

teskje force-pushed the tokenize-error-logging branch 2 times, most recently from 55f4749 to 30b98ee Compare April 17, 2023 08:37

teskje marked this pull request as ready for review April 17, 2023 09:02

teskje requested review from a team, tokenrove and vmarcos April 17, 2023 09:02

teskje changed the title ~~Tokenize error logging~~ compute: tokenize error logging Apr 17, 2023

vmarcos reviewed Apr 17, 2023

View reviewed changes

src/compute/src/render/reduce.rs Outdated Show resolved Hide resolved

tokenrove reviewed Apr 17, 2023

View reviewed changes

src/compute/src/render/context.rs Outdated Show resolved Hide resolved

tokenrove approved these changes Apr 17, 2023

View reviewed changes

teskje force-pushed the tokenize-error-logging branch 2 times, most recently from 4158a99 to 4911650 Compare April 17, 2023 15:41

teskje added 4 commits April 17, 2023 17:43

teskje force-pushed the tokenize-error-logging branch from 4911650 to bcc7f8d Compare April 17, 2023 15:43

teskje requested review from vmarcos and tokenrove April 17, 2023 16:03

vmarcos approved these changes Apr 17, 2023

View reviewed changes

tokenrove approved these changes Apr 17, 2023

View reviewed changes

teskje merged commit d6448ae into MaterializeInc:main Apr 17, 2023

teskje mentioned this pull request Apr 18, 2023

compute: add token-controlled fuse in feedback edges #18718

Merged

5 tasks

teskje deleted the tokenize-error-logging branch April 19, 2023 10:46

teskje mentioned this pull request Apr 19, 2023

Refactor or document guidance on warn!/error! pattern in Compute code #18214

Open

materialize-bot mentioned this pull request Apr 21, 2023

release: v0.52.0 required reviews #18907

Closed

12 tasks

vmarcos mentioned this pull request Jul 10, 2023

Improve synchronization when dropping shutdown tokens #20442

Open

def- mentioned this pull request Jun 5, 2024

error: executing query failed: db error: ERROR: Evaluation error: internal error: Invalid data in source, saw net-zero records for key ("C") with non-zero accumulation in accumulable aggregate #27453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: tokenize error logging #18772

compute: tokenize error logging #18772

teskje commented Apr 14, 2023 •

edited

Loading

vmarcos left a comment

tokenrove Apr 17, 2023

teskje Apr 17, 2023

tokenrove Apr 17, 2023

tokenrove Apr 17, 2023

tokenrove left a comment

teskje commented Apr 17, 2023

vmarcos left a comment

teskje commented Apr 17, 2023

compute: tokenize error logging #18772

compute: tokenize error logging #18772

Conversation

teskje commented Apr 14, 2023 • edited Loading

Motivation

Tips for reviewer

Checklist

vmarcos left a comment

Choose a reason for hiding this comment

tokenrove Apr 17, 2023

Choose a reason for hiding this comment

teskje Apr 17, 2023

Choose a reason for hiding this comment

tokenrove Apr 17, 2023

Choose a reason for hiding this comment

tokenrove Apr 17, 2023

Choose a reason for hiding this comment

tokenrove left a comment

Choose a reason for hiding this comment

teskje commented Apr 17, 2023

vmarcos left a comment

Choose a reason for hiding this comment

teskje commented Apr 17, 2023

teskje commented Apr 14, 2023 •

edited

Loading