docs: add design doc for consistent error handling and surfacing connector failures #7584

aljoscha · 2021-07-28T18:57:55Z

Currently, Materialize reacts to errors happening at different stages of a connectors lifecycle in different ways. For example, an error happening during purification/creation of a connector will be reported back to the user immediately while errors that happen during runtime are only logged. We should agree on what the behavior should be and fix it where needed.

Additionally, I propose to formally add the concept of a lifecycle for connectors, which will allow the system and users to determine the status of a connector. Right now, this would only be possible by looking through log files for error messages and manually associating them with connectors.

This addresses #7115

aljoscha · 2021-07-28T18:59:29Z

cc @petrosagg: This should be mostly orthogonal to your work on the Source Trait. But there will have to be a mechanism to get the timeout configs to the relevant pieces.

benesch · 2021-07-29T08:44:54Z

doc/developer/design/20210728_error_handling_surfacing_connector_failures.md

+again they will start consuming again. At least that's the case in our setup,
+with split consumer keys. And the consumer will log errors still.
+
+We can be fine with that, or also try and cover this with a timeout and try to


This has always irked me about librdkafka. I think we'd be well served by finding some way to surface a transient "degraded" state while librdkafka is spewing errors.

👍 Once we have the machinery in place to track connector state this should be doable. If we can coax it out of librdkafka...

benesch · 2021-07-29T08:47:24Z

doc/developer/design/20210728_error_handling_surfacing_connector_failures.md

+
+The concrete steps to achieve this, in the order we should address them:
+
+ - Introduce global settings for `timeout`, `num retries`, and potentially


Are these user-settable globals? Or just a default bundle of Retry options that get plumbed around but are hardcoded into the binary? I think I like the second thing better! Retry settings to me feel like something that the user typically doesn't have much useful insight into, at least not at the global level. If you do have an opinion on retry configuration it's usually in the context of one specific retry loop that is misconfigured.

I initially intended it to be user-settable globals, but then thought along the same lines as you and added some text under Alternatives that takes that back. I'll remove this from here.

By now, I'm not even sure we need global Retry options except maybe for restarting whole connectors/views.

aljoscha · 2021-07-29T09:38:40Z

I tweaked the title to make the intent clearer, moved the section about user-settable global retry settings to Alternatives, and added a task about changing mz_views to add a status column, along with a new mz_view_errors.

philip-stoev · 2021-07-29T09:46:49Z

doc/developer/design/20210728_error_handling_surfacing_connector_failures.md

+  - (coord) reading Kafka consistency topic
+  - (coord) publishing schemas
+  - (coord) creating topics
+  - (coord) creating output file, checking input file


Listing the keys in an S3 bucket is also a one-time task that can fail.

It is, but this is only started after the source was successfully created, I believe. I've looked at this one:

materialize/src/dataflow/src/source/s3.rs

Line 219 in 480aa30

async fn scan_bucket_task(

I'm adding these examples to the doc as well, so thanks!

philip-stoev · 2021-07-29T09:47:25Z

doc/developer/design/20210728_error_handling_surfacing_connector_failures.md

+- Continuously:
+  - (coord) metadata loops, for example fetching new Kafka partitions, listening
+   on BYO topics, listening on SQS for S3 updates
+  - (dataflow) actual data ingest and writing


also consuming the postgres replication stream

Adding that as an example 👌

philip-stoev · 2021-07-29T09:47:43Z

doc/developer/design/20210728_error_handling_surfacing_connector_failures.md

+  - (coord) purification
+- When creating a source/sink or when materialized is restarted:
+  - (coord) initial connector setup, for example:
+  - (coord) reading Kafka consistency topic


reading the initial snapshot from postgres

I think this also starts only after the source was created and added to the catalog, in the dataflow layer.

Or did you observe that a missing Postgres will prevent Materialize from starting?

philip-stoev · 2021-07-29T09:49:14Z

doc/developer/design/20210728_error_handling_surfacing_connector_failures.md

+ - Failures that occur during a restart with an already filled catalog must not
+   bring down Materialize. Instead, errors that occur must be reported to the
+   user for each individual connector.
+


failures in the source should also cause selecting from the source to start erroring out, rather than continuing to return stale or no data. This should be in addition to sufracing that same error in a mz_ table.

Do you mean fatal errors, like a malformed message that we can't parse. Or transient errors, like a Kafka broker not being reachable. For the former, we should already prevent querying, but you're right that we don't do anything about the latter right now.

philip-stoev

Thank you for taking up on this. I believe a consistent implementation will save our customers and our own customer-facing peope a LOT of grief.

benesch

I gave this a very thorough review and, uh... I straight up don't have any comments. Not one! This is fantastic work, @aljoscha, and something we'd been meaning to get to for so long. Really excited that you're going to make it happen.

aljoscha · 2021-08-02T19:23:56Z

Thanks! 🎉

Also, you all did have helpful comments earlier, so thanks for those as well!

…ector failures

aljoscha changed the title ~~docs: add design doc for consistent error handling and connector lifecycle~~ docs: add design doc for consistent error handling and surfacing connector failures Jul 29, 2021

benesch reviewed Jul 29, 2021

View reviewed changes

philip-stoev reviewed Jul 29, 2021

View reviewed changes

philip-stoev approved these changes Jul 29, 2021

View reviewed changes

benesch approved these changes Aug 2, 2021

View reviewed changes

aljoscha force-pushed the doc-error-handling branch from e096cab to ea16262 Compare August 2, 2021 19:45

docs: add design doc for consistent error handling and surfacing conn…

ea16262

…ector failures

aljoscha merged commit 7dcb97f into MaterializeInc:main Aug 2, 2021

aljoscha deleted the doc-error-handling branch August 2, 2021 20:04

benesch mentioned this pull request Aug 9, 2021

release: v0.9.0 required reviews #7736

Closed

materialize-bot mentioned this pull request Aug 9, 2021

release: v0.9.0-rc1 required reviews #7739

Closed

29 tasks

aljoscha mentioned this pull request Aug 25, 2021

Surface dataflow errors (indexes and sinks) in a system table/view #7804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add design doc for consistent error handling and surfacing connector failures #7584

docs: add design doc for consistent error handling and surfacing connector failures #7584

aljoscha commented Jul 28, 2021 •

edited

Loading

aljoscha commented Jul 28, 2021

benesch Jul 29, 2021

aljoscha Jul 29, 2021

benesch Jul 29, 2021

aljoscha Jul 29, 2021

aljoscha commented Jul 29, 2021

philip-stoev Jul 29, 2021

aljoscha Jul 29, 2021

philip-stoev Jul 29, 2021

aljoscha Jul 29, 2021

philip-stoev Jul 29, 2021

aljoscha Jul 29, 2021

philip-stoev Jul 29, 2021

aljoscha Jul 29, 2021

philip-stoev left a comment

benesch left a comment

aljoscha commented Aug 2, 2021


		The concrete steps to achieve this, in the order we should address them:

		- Introduce global settings for `timeout`, `num retries`, and potentially

docs: add design doc for consistent error handling and surfacing connector failures #7584

docs: add design doc for consistent error handling and surfacing connector failures #7584

Conversation

aljoscha commented Jul 28, 2021 • edited Loading

aljoscha commented Jul 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aljoscha commented Jul 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philip-stoev left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

aljoscha commented Aug 2, 2021

aljoscha commented Jul 28, 2021 •

edited

Loading