Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add design doc for consistent error handling and surfacing connector failures #7584

Merged
merged 1 commit into from
Aug 2, 2021

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Jul 28, 2021

rendered

Currently, Materialize reacts to errors happening at different stages of a connectors lifecycle in different ways. For example, an error happening during purification/creation of a connector will be reported back to the user immediately while errors that happen during runtime are only logged. We should agree on what the behavior should be and fix it where needed.

Additionally, I propose to formally add the concept of a lifecycle for connectors, which will allow the system and users to determine the status of a connector. Right now, this would only be possible by looking through log files for error messages and manually associating them with connectors.

This addresses #7115

@aljoscha
Copy link
Contributor Author

cc @petrosagg: This should be mostly orthogonal to your work on the Source Trait. But there will have to be a mechanism to get the timeout configs to the relevant pieces.

@aljoscha aljoscha changed the title docs: add design doc for consistent error handling and connector lifecycle docs: add design doc for consistent error handling and surfacing connector failures Jul 29, 2021
again they will start consuming again. At least that's the case in our setup,
with split consumer keys. And the consumer will log errors still.

We can be fine with that, or also try and cover this with a timeout and try to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has always irked me about librdkafka. I think we'd be well served by finding some way to surface a transient "degraded" state while librdkafka is spewing errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Once we have the machinery in place to track connector state this should be doable. If we can coax it out of librdkafka...


The concrete steps to achieve this, in the order we should address them:

- Introduce global settings for `timeout`, `num retries`, and potentially
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these user-settable globals? Or just a default bundle of Retry options that get plumbed around but are hardcoded into the binary? I think I like the second thing better! Retry settings to me feel like something that the user typically doesn't have much useful insight into, at least not at the global level. If you do have an opinion on retry configuration it's usually in the context of one specific retry loop that is misconfigured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially intended it to be user-settable globals, but then thought along the same lines as you and added some text under Alternatives that takes that back. I'll remove this from here.

By now, I'm not even sure we need global Retry options except maybe for restarting whole connectors/views.

@aljoscha
Copy link
Contributor Author

I tweaked the title to make the intent clearer, moved the section about user-settable global retry settings to Alternatives, and added a task about changing mz_views to add a status column, along with a new mz_view_errors.

- (coord) reading Kafka consistency topic
- (coord) publishing schemas
- (coord) creating topics
- (coord) creating output file, checking input file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing the keys in an S3 bucket is also a one-time task that can fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but this is only started after the source was successfully created, I believe. I've looked at this one:

async fn scan_bucket_task(

I'm adding these examples to the doc as well, so thanks!

- Continuously:
- (coord) metadata loops, for example fetching new Kafka partitions, listening
on BYO topics, listening on SQS for S3 updates
- (dataflow) actual data ingest and writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also consuming the postgres replication stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding that as an example 👌

- (coord) purification
- When creating a source/sink or when materialized is restarted:
- (coord) initial connector setup, for example:
- (coord) reading Kafka consistency topic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading the initial snapshot from postgres

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also starts only after the source was created and added to the catalog, in the dataflow layer.

Or did you observe that a missing Postgres will prevent Materialize from starting?

- Failures that occur during a restart with an already filled catalog must not
bring down Materialize. Instead, errors that occur must be reported to the
user for each individual connector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failures in the source should also cause selecting from the source to start erroring out, rather than continuing to return stale or no data. This should be in addition to sufracing that same error in a mz_ table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean fatal errors, like a malformed message that we can't parse. Or transient errors, like a Kafka broker not being reachable. For the former, we should already prevent querying, but you're right that we don't do anything about the latter right now.

Copy link
Contributor

@philip-stoev philip-stoev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking up on this. I believe a consistent implementation will save our customers and our own customer-facing peope a LOT of grief.

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a very thorough review and, uh... I straight up don't have any comments. Not one! This is fantastic work, @aljoscha, and something we'd been meaning to get to for so long. Really excited that you're going to make it happen.

@aljoscha
Copy link
Contributor Author

aljoscha commented Aug 2, 2021

Thanks! 🎉

Also, you all did have helpful comments earlier, so thanks for those as well!

@aljoscha aljoscha merged commit 7dcb97f into MaterializeInc:main Aug 2, 2021
@aljoscha aljoscha deleted the doc-error-handling branch August 2, 2021 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants