Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface dataflow errors (indexes and sinks) in a system table/view #7804

Closed
Tracked by #7115
aljoscha opened this issue Aug 11, 2021 · 7 comments
Closed
Tracked by #7115

Surface dataflow errors (indexes and sinks) in a system table/view #7804

aljoscha opened this issue Aug 11, 2021 · 7 comments
Labels
A-monitoring Area: monitoring and metrics A-storage Area: storage

Comments

@aljoscha
Copy link
Contributor

aljoscha commented Aug 11, 2021

We need to surface the errors that we get in the "errors" part of a CollectionBundle, which we usually get from Context::lookup_id() during rendering. Here's an example call site, where we ignore the errors:

let (collection, _err_collection) = self

I mention indexes and sinks in the title because these are the two "items" that we currently render dataflow operator graphs for, and which we can identify and join with information in the catalog.

Concretely, we should have new system views (or tables, or log) mz_index_errors and mz_sink_errors that surface the error (probably as a string) along with the GlobalId (also as a string) of the index or sink in which the errors occured. We could also just have one view mz_dataflow_errors, we'll see how ergonomic either option is when we get there but this shouldn't be that hard part.

Example of a view that could be built with this (from Nikhil):

CREATE VIEW mz_catalog.mz_index_status AS
SELECT
  i.*,
  EXISTS (SELECT 1 FROM mz_index_errors e WHERE e.id = i.id) AS presently_has_error
FROM mz_indexes i

Some things to consider:

  • is "timely logging" the right solution for this?
  • Do we want to retain and serve all the errors? What's a good cleanup policy?
@aljoscha aljoscha added A-sink A-monitoring Area: monitoring and metrics labels Aug 11, 2021
@aljoscha aljoscha added this to To Do in Storage (Old) Aug 11, 2021
@aljoscha aljoscha added this to the 1.0 milestone Aug 11, 2021
@aljoscha aljoscha changed the title Log connector errors in system tables using timely logging Surface connector errors in a system table/view Aug 11, 2021
@aljoscha aljoscha changed the title Surface connector errors in a system table/view Surface dataflow errors (indexes and sinks) in a system table/view Aug 24, 2021
@aljoscha
Copy link
Contributor Author

@frankmcsherry & @benesch We discussed this a while ago in slack and you both had opinions. The one remaining question I have is what the timeline of the error views should be. We could have an error view per different timeline that we support or we could say the error views are all in the system timeline (same as the other system views/logs/tables) but we add the timestamp of the error (which is in the timeline of the index/sink) as a data column.

I think the latter is easier for users, because they don't need to fiddle with different tables and it makes the error views easily joinable with the other system views. They would roughly reflect "errors as of now", independent of where we are in the timeline of the dataflow. Which I find useful because it allows answering the question "can I query this thing now".

What do you think?

@aljoscha
Copy link
Contributor Author

Also, this one might be affected by whatever comes out of #8008.

@frankmcsherry
Copy link
Contributor

If they are in the errors part of a CollectionBundle then they need to be in the timeline of the collection.

@frankmcsherry
Copy link
Contributor

Can you say more about why we need to surface them? I can see "want" to surface them, but what is the actual requirement? If they are errors that happen in the course of evaluation, we capture them in an arrangement already. Is the goal to have some broad view over a timeline of all errors in things folks have written? Is it instead to get a system-wide view of the errors that are happening in the system?

If it is important for the errors to be aligned with the times at which they are produced (e.g. a DivByZero error which corresponds to specific input data, and may eventually be retracted at a specific time) then they should be in the timeline of the collection. If it is either not important, or hard to put them in the timeline (e.g. BufferBuildingUpInSinkIDKWHATTODO) then logging is probably a better answer.

@aljoscha
Copy link
Contributor Author

It's intended as a way to get a system-wide overview of the status of indexes and sinks (and views, in the end). Right now, the only way of figuring out whether a view/index is wedged is a) try and peek/query that one specific view/index, or b) hope that there is something in the logs. Especially in a cloud setting, I don't think trawling through logs is very ergonomic.

For context:

(I'm using view and index somewhat interchangeably above because when querying a view you transitively are querying the index, and get its errors. But yes, the dataflow layer only knows indexes and sinks.)

@aljoscha
Copy link
Contributor Author

aljoscha commented Sep 1, 2021

Logging the results of an internal discussion on this:

  • connector errors (for example sink or source) shouldn't be added to the already exiting "err" collections/arrangements. Mostly because they are not deterministic/reproducible. A division by zero error, for example, can be removed by retracting the offending input errors.
  • There are different classes of errors. System errors (which happen "randomly", mostly when interacting with other systems) and the deterministic errors that originate from SQL, let's call them dataflow errors.
  • For SQL/dataflow errors, we can create a timely logging view that reports "has error" if there is any error, but we don't need to report back the individual errors through logging. (Thats what this Github issue is about)

For addressing system errors, we should add an mz_connector_errors view (technically a timely logging source) where we throw in all the system errors. Maybe?

@nmeagan11 nmeagan11 moved this from To Do to Icebox in Storage (Old) Oct 18, 2021
@heeringa heeringa removed this from the 1.0 milestone Mar 29, 2022
@nmeagan11 nmeagan11 removed this from Icebox in Storage (Old) Aug 11, 2022
@nmeagan11 nmeagan11 added the A-storage Area: storage label Aug 22, 2022
@aljoscha
Copy link
Contributor Author

Closing this one for now:

  • There is no appetite in having sth like it.
  • The architecture changed quite a bit during the "platformification" of Materialize. It would now be significantly harder to keep a "global" (whatever that means) record of the errors in all views/indixes across different clusters.
  • If/when we feel like we need something like this again we should start fresh, from clear requirements.

@aljoscha aljoscha closed this as not planned Won't fix, can't repro, duplicate, stale Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-monitoring Area: monitoring and metrics A-storage Area: storage
Projects
None yet
Development

No branches or pull requests

4 participants