Surface dataflow errors (indexes and sinks) in a system table/view #7804

aljoscha · 2021-08-11T15:14:00Z

We need to surface the errors that we get in the "errors" part of a CollectionBundle, which we usually get from Context::lookup_id() during rendering. Here's an example call site, where we ignore the errors:

materialize/src/dataflow/src/render/sinks.rs

Line 60 in 480aa30

let (collection, _err_collection) = self

I mention indexes and sinks in the title because these are the two "items" that we currently render dataflow operator graphs for, and which we can identify and join with information in the catalog.

Concretely, we should have new system views (or tables, or log) mz_index_errors and mz_sink_errors that surface the error (probably as a string) along with the GlobalId (also as a string) of the index or sink in which the errors occured. We could also just have one view mz_dataflow_errors, we'll see how ergonomic either option is when we get there but this shouldn't be that hard part.

Example of a view that could be built with this (from Nikhil):

CREATE VIEW mz_catalog.mz_index_status AS
SELECT
  i.*,
  EXISTS (SELECT 1 FROM mz_index_errors e WHERE e.id = i.id) AS presently_has_error
FROM mz_indexes i

Some things to consider:

is "timely logging" the right solution for this?
Do we want to retain and serve all the errors? What's a good cleanup policy?

The text was updated successfully, but these errors were encountered:

aljoscha · 2021-08-24T10:30:01Z

@frankmcsherry & @benesch We discussed this a while ago in slack and you both had opinions. The one remaining question I have is what the timeline of the error views should be. We could have an error view per different timeline that we support or we could say the error views are all in the system timeline (same as the other system views/logs/tables) but we add the timestamp of the error (which is in the timeline of the index/sink) as a data column.

I think the latter is easier for users, because they don't need to fiddle with different tables and it makes the error views easily joinable with the other system views. They would roughly reflect "errors as of now", independent of where we are in the timeline of the dataflow. Which I find useful because it allows answering the question "can I query this thing now".

What do you think?

aljoscha · 2021-08-25T11:16:17Z

Also, this one might be affected by whatever comes out of #8008.

frankmcsherry · 2021-08-25T14:58:51Z

If they are in the errors part of a CollectionBundle then they need to be in the timeline of the collection.

frankmcsherry · 2021-08-25T15:10:40Z

Can you say more about why we need to surface them? I can see "want" to surface them, but what is the actual requirement? If they are errors that happen in the course of evaluation, we capture them in an arrangement already. Is the goal to have some broad view over a timeline of all errors in things folks have written? Is it instead to get a system-wide view of the errors that are happening in the system?

If it is important for the errors to be aligned with the times at which they are produced (e.g. a DivByZero error which corresponds to specific input data, and may eventually be retracted at a specific time) then they should be in the timeline of the collection. If it is either not important, or hard to put them in the timeline (e.g. BufferBuildingUpInSinkIDKWHATTODO) then logging is probably a better answer.

aljoscha · 2021-08-25T16:20:40Z

It's intended as a way to get a system-wide overview of the status of indexes and sinks (and views, in the end). Right now, the only way of figuring out whether a view/index is wedged is a) try and peek/query that one specific view/index, or b) hope that there is something in the logs. Especially in a cloud setting, I don't think trawling through logs is very ergonomic.

For context:

this is the PR for the design proposal: docs: add design doc for consistent error handling and surfacing connector failures #7584
and there was this quick internal discussion: https://materializeinc.slack.com/archives/C01CFKM1QRF/p1629140382231800

(I'm using view and index somewhat interchangeably above because when querying a view you transitively are querying the index, and get its errors. But yes, the dataflow layer only knows indexes and sinks.)

aljoscha · 2021-09-01T14:21:36Z

Logging the results of an internal discussion on this:

connector errors (for example sink or source) shouldn't be added to the already exiting "err" collections/arrangements. Mostly because they are not deterministic/reproducible. A division by zero error, for example, can be removed by retracting the offending input errors.
There are different classes of errors. System errors (which happen "randomly", mostly when interacting with other systems) and the deterministic errors that originate from SQL, let's call them dataflow errors.
For SQL/dataflow errors, we can create a timely logging view that reports "has error" if there is any error, but we don't need to report back the individual errors through logging. (Thats what this Github issue is about)

For addressing system errors, we should add an mz_connector_errors view (technically a timely logging source) where we throw in all the system errors. Maybe?

aljoscha · 2023-02-20T18:40:11Z

Closing this one for now:

There is no appetite in having sth like it.
The architecture changed quite a bit during the "platformification" of Materialize. It would now be significantly harder to keep a "global" (whatever that means) record of the errors in all views/indixes across different clusters.
If/when we feel like we need something like this again we should start fresh, from clear requirements.

aljoscha mentioned this issue Aug 11, 2021

[Epic] Surface connector errors (and consistent policy for errors/retries to external systems) #7115

Closed

33 tasks

aljoscha added A-sink A-monitoring Area: monitoring and metrics labels Aug 11, 2021

aljoscha added this to To Do in Storage (Old) Aug 11, 2021

aljoscha added this to the 1.0 milestone Aug 11, 2021

aljoscha changed the title ~~Log connector errors in system tables using timely logging~~ Surface connector errors in a system table/view Aug 11, 2021

aljoscha changed the title ~~Surface connector errors in a system table/view~~ Surface dataflow errors (indexes and sinks) in a system table/view Aug 24, 2021

This was referenced Aug 24, 2021

Add convenience views for checking the error status of indexes and sinks #7803

Closed

Surface errors from coordinator-side source logic in error views #7805

Closed

nmeagan11 moved this from To Do to Icebox in Storage (Old) Oct 18, 2021

heeringa removed this from the 1.0 milestone Mar 29, 2022

nmeagan11 removed this from Icebox in Storage (Old) Aug 11, 2022

nmeagan11 added the A-storage Area: storage label Aug 22, 2022

aljoscha closed this as not planned Won't fix, can't repro, duplicate, stale Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface dataflow errors (indexes and sinks) in a system table/view #7804

Surface dataflow errors (indexes and sinks) in a system table/view #7804

aljoscha commented Aug 11, 2021 •

edited

Loading

aljoscha commented Aug 24, 2021

aljoscha commented Aug 25, 2021

frankmcsherry commented Aug 25, 2021

frankmcsherry commented Aug 25, 2021

aljoscha commented Aug 25, 2021

aljoscha commented Sep 1, 2021

aljoscha commented Feb 20, 2023

Surface dataflow errors (indexes and sinks) in a system table/view #7804

Surface dataflow errors (indexes and sinks) in a system table/view #7804

Comments

aljoscha commented Aug 11, 2021 • edited Loading

aljoscha commented Aug 24, 2021

aljoscha commented Aug 25, 2021

frankmcsherry commented Aug 25, 2021

frankmcsherry commented Aug 25, 2021

aljoscha commented Aug 25, 2021

aljoscha commented Sep 1, 2021

aljoscha commented Feb 20, 2023

aljoscha commented Aug 11, 2021 •

edited

Loading