Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: tokenize error logging #18772

Merged
merged 6 commits into from
Apr 17, 2023

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Apr 14, 2023

This PR modifies the error logging behavior of compute operators to only log messages to Sentry when the dataflow is not in the process of shutting down. This avoids producing noise caused by dataflow operators emitting partial update sets when they are cancelled. It is a precondition for enabling the faster dataflow shutdown methods described in the "Faster Dataflow Shutdown" design.

The shutdown checking is performed through a shutdown token that is wrapped inside a new ErrorLogger type that gets passed around to the various operator renderers. An ErrorLogger encapsulates both shutdown checking and our warn!/error! logging format.

Motivation

  • This PR adds a known-desirable feature.

Enables #18760.

Tips for reviewer

Look at the commits and their descriptions separately!

  • The first commit moves the MaybeValidatingRow trait into a new mz_compute::render::errors module, which is also where the new ErrorLogger type lives.
  • The second commit adds the ErrorLogger type.
  • The third commit attaches a shutdown token to the rendering Context and updates the top-level rendering code to make sure the token gets added to the exports depending on it.
  • The fourth commits adjusts error logging in dataflow operators to use the ErrorLogger, rather than using the warn!/error! macros directly.
  • The fifth commit makes most of the free-standing reduce and top-k rendering functions methods of the Context type. Rather than requiring debug_name and error_logger arguments, they can now get these fields directly from the context, reducing the number of function arguments that need to be passed.

Unfortunately, the diff for the last commit is quite hairy, since it changes not only the indentation level, but also causes rustfmt to re-wrap many of the higher-indented lines. Hiding whitespace changes helps a bit, but still produces a diff much larger than it should be. Since the last commit is a pure code organization change and doesn't affect functionality, I would be fine with dropping it if you'd rather not review it.

One thought about the ErrorLogger type: It would be quite easy to also wrap the debug_name, in addition to the token, inside it. Then we wouldn't have to pass that around in rendering anymore. The drawback is that we'd lose our current flexibility in formatting the "details" part. We'd probably need to put the dataflow name as a fixed prefix of the message, so constructs like "found invalid accumulations in dataflow {debug_name}" would no longer be possible. This is now implemented in the sixth commit!

If anyone has an idea how to actually test this change, I'd be happy to hear it! I couldn't come up with a way to provoke invalid retractions during dataflow shutdown (they should be impossible today), much less doing so in a reliable way.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/A

@teskje teskje force-pushed the tokenize-error-logging branch 2 times, most recently from 55f4749 to 30b98ee Compare April 17, 2023 08:37
@teskje teskje marked this pull request as ready for review April 17, 2023 09:02
@teskje teskje requested review from a team, tokenrove and vmarcos April 17, 2023 09:02
@teskje teskje changed the title Tokenize error logging compute: tokenize error logging Apr 17, 2023
Copy link
Contributor

@vmarcos vmarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first four commits look good to me; I only have a minor remark about some warning messages that appear to have been shortened in some places.

I am uncertain about the last commit. The previous code organization was already in place prior to my start in Materialize and I do not think I have a clear understanding of why these functions were not part of Context to start with or otherwise why they were not collected under some other abstraction. Perhaps we can ask for someone with historical background to manifest before moving on with this change? Also, I did not give enough thinking to the question of whether there may also be a better code organization that we could think of rather than having all of the functions collected under Context.

src/compute/src/render/reduce.rs Outdated Show resolved Hide resolved
@@ -508,8 +541,12 @@ where
continue;
}
let message = "Non-positive multiplicity in DistinctBy Retractions";
warn!(?row, ?count, debug_name, "[customer-data] {message}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside to this approach is that we lose structured field logging. How ugly would it be to supply some macros that take an ErrorLogger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not been able to come up with anything ergonomic so far. We could probably introduce something like:

log_dataflow_error!(
    self.error_logger(),
    "Non-positive multiplicity in DistinctBy Retractions",
    ?row, ?count, debug_name,
    "[customer-data] Non-positive multiplicity in DistinctBy Retractions",
);

Doesn't seem too bad, actually. Especially because we could make it possible to reference the message in the details format string, and make the debug_name be implicitly added to all logs.

log_dataflow_error!(
    self.error_logger(),
    "Non-positive multiplicity in DistinctBy Retractions",
    ?row, ?count,
    "[customer-data] {message}",
);

I'm just not sure whether my macro foo is up to this task :)

In any case, I'd rather make this the scope of another PR, since I foresee some bikeshedding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm also happy to do this if you don't want to deal with it.

src/compute/src/render/errors.rs Show resolved Hide resolved
self.error_logger(),
)
.leave_region()
self.render_reduce_plan(reduce_plan, ok, err, key_arity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of this change (throughout).

Copy link
Contributor

@tokenrove tokenrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 modulo some of the details we discussed, and as mentioned, we can do a followup to this for the most bikesheddable bits.

@teskje teskje force-pushed the tokenize-error-logging branch 2 times, most recently from 4158a99 to 4911650 Compare April 17, 2023 15:41
This commit introduces a new `ErrorLogger` type that will be used for
logging errors inside dataflows. This type wraps a dataflow shutdown
token and encapsules both shutdown-checked logging and our
`warn!`/`error!` approach to logging dataflow errors.
This commit extends the rendering `Context` by a `shutdown_token` field
that can be used by rendered operators to check whether the dataflow is
in the progess of shutting down. The shutdown token is replaced for each
entry in `objects_to_build`, to support finer-grained dataflow shutdown
in a future where dataflows can have more than one export.

This commit also adjusts variable naming to make it clear that the
various `Context::export_*` methods receive a list of all dependency
IDs, which include `objects_to_build` IDs, rather than just IDs of
dataflow imports. As a consequence, there is no need to adjust the
export methods to ensure that they also pick up the new object tokens.
This commit modifies the error logging behavior of compute operators to
only log messages when the dataflow is not in the process of shutting
down. This avoids producing noise caused by dataflow operators emitting
partial update sets when they are cancelled.

The shutdown checking is performed through a shutdown token and wrapped
inside the new `ErrorLogger` type that gets passed around to the various
operator renderers.
This commit converts all freestanding reduce and top-k rendering
functions that want to do error logging into methods of the `Context`
type. The benefit is that we can rid of two parameters, `debug_name` and
`error_logger`, in exchange for a new `&self` parameter, and changed
indentation.

Note that the top-k functions where previously definied inside the scope
of `Context::render_topk` (for no apparent reason). Making them methods
of `Context` actually reduces indentation.
Everywhere we log something in compute rendering, we include the debug
name of the dataflow in the logged details. It makes sense to build the
dataflow name logging directly into the `ErrorLogger`, so we don't have
to repeat ourselves at the callsites.
@teskje
Copy link
Contributor Author

teskje commented Apr 17, 2023

I am uncertain about the last commit. The previous code organization was already in place prior to my start in Materialize and I do not think I have a clear understanding of why these functions were not part of Context to start with or otherwise why they were not collected under some other abstraction. Perhaps we can ask for someone with historical background to manifest before moving on with this change?

I asked in Slack and there doesn't seem to be a specific reason.

This should be ready for another look! I've addressed the comments and added another commit that pulls the debug_name printing into the ErrorLogger as well, making the callsites a bit more DRY in the process.

Copy link
Contributor

@vmarcos vmarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for asking in Slack; the changes LGTM! I think it is quite tricky to test this at the moment, but maybe we can think about an approach when we allow persist_source to stop emitting early as well.

@teskje teskje merged commit d6448ae into MaterializeInc:main Apr 17, 2023
@teskje
Copy link
Contributor Author

teskje commented Apr 17, 2023

TFTRs despite the huge diff!

@teskje teskje deleted the tokenize-error-logging branch April 19, 2023 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants