-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add design doc for consistent error handling and surfacing connector failures #7584
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,188 @@ | ||||
# Consistent Error Handling and Surfacing Connector Failures | ||||
|
||||
*I will use connector to refer to either source or sink below. I will use | ||||
`mz_connector` to mean either `mz_source` or `mz_sink` or both.* | ||||
|
||||
## Summary | ||||
|
||||
Currently, Materialize reacts to errors happening at different stages of a | ||||
connectors lifecycle in different ways. For example, an error happening during | ||||
purification/creation of a connector will be reported back to the user | ||||
immediately while errors that happen during runtime are only logged. We should | ||||
agree on what the behavior should be and fix it where needed. | ||||
|
||||
Additionally, I propose to formally add the concept of a *lifecycle* for | ||||
connectors, which will allow the system and users to determine the status of a | ||||
connector. Right now, this would only be possible by looking through log files | ||||
for error messages and manually associating them with connectors. | ||||
|
||||
## Goals | ||||
|
||||
- unify error handling policies across connectors | ||||
- make it explicit how Materialize should react to errors at different stages | ||||
of execution | ||||
- introduce connector lifecycle and surface it to users through system tables | ||||
|
||||
## Non-Goals | ||||
|
||||
- introduce more structured error messages, see | ||||
https://github.com/MaterializeInc/materialize/discussions/5340 | ||||
|
||||
## Description | ||||
|
||||
The actual proposal is [way down](#proposed-changes) but it helps to first | ||||
understand the current state. | ||||
|
||||
### Current State | ||||
|
||||
#### Where/when do we interact with external systems? | ||||
|
||||
- When creating a source/sink: | ||||
- (coord) purification | ||||
- When creating a source/sink or when materialized is restarted: | ||||
- (coord) initial connector setup, for example: | ||||
- (coord) reading Kafka consistency topic | ||||
- (coord) publishing schemas | ||||
- (coord) creating topics | ||||
- (coord) creating output file, checking input file | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Listing the keys in an S3 bucket is also a one-time task that can fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is, but this is only started after the source was successfully created, I believe. I've looked at this one: materialize/src/dataflow/src/source/s3.rs Line 219 in 480aa30
I'm adding these examples to the doc as well, so thanks! |
||||
- (coord) validate AWS credentials | ||||
- Continuously/During dataflow execution: | ||||
- (coord) metadata loops, for example fetching new Kafka partitions, | ||||
listening on BYO topics | ||||
- (dataflow) listening on SQS for S3 updates, scanning S3 bucket (though it | ||||
only happens once, it's still in the dataflow, happening after succesful | ||||
source creation) | ||||
- (dataflow) actual data ingest and writing, for example, from Kafka, S3, | ||||
Postgres... | ||||
|
||||
#### Failures during purification | ||||
|
||||
In general, failures during purification lead to the statement being aborted, | ||||
nothing gets added to the catalog. This seems fair, because we cannot store | ||||
un-purified statements in the catalog and purification happens infrequently, | ||||
only when a statement is issued. | ||||
|
||||
Most purification logic doesn't specify timeouts, only S3 and Kinesis have a 1 | ||||
second timeout for the validation call. I believe none of the purification | ||||
checks do retries | ||||
|
||||
#### Failures During Source/Sink Creation | ||||
|
||||
The behaviour is very similar (the same?) to failures during purification: | ||||
nothing gets added to the catalog when there are failures. | ||||
|
||||
Some calls have custom timeouts, some don't. I believe none of the calls have | ||||
retries. | ||||
|
||||
#### Failures During "Normal Operations" | ||||
|
||||
Failures during execution don't bring down materialized. If failures happen, | ||||
they are only surfaced in the log. Materialized doesn't "know" about the state | ||||
of a connector. | ||||
|
||||
There are no global config options for timeouts/retries/back-off interval. | ||||
|
||||
#### Failures During Restart (restarting a materialized with things in the catalog) | ||||
|
||||
Failures during sink (re-)creation kill materialized when restarting with a | ||||
filled catalog. Failures during source creation are fine. | ||||
|
||||
The reason is that (re-)creating a source runs no source-specific code in the | ||||
coordinator. Everything happens in dataflow, where we have retries. This is not | ||||
true for sinks. | ||||
|
||||
### Proposed Changes | ||||
|
||||
The reaction to failures at various stages should uniform across different | ||||
connectors and informed by these rules: | ||||
|
||||
- Failures during connector creation (which includes purification) immediately | ||||
report an error back to the user. No catalog entry is created. | ||||
- Failures during runtime need to be surfaced to the user in system tables. | ||||
- Failures that occur during a restart with an already filled catalog must not | ||||
bring down Materialize. Instead, errors that occur must be reported to the | ||||
user for each individual connector. | ||||
|
||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. failures in the source should also cause selecting from the source to start erroring out, rather than continuing to return stale or no data. This should be in addition to sufracing that same error in a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean fatal errors, like a malformed message that we can't parse. Or transient errors, like a Kafka broker not being reachable. For the former, we should already prevent querying, but you're right that we don't do anything about the latter right now. |
||||
The concrete steps to achieve this, in the order we should address them: | ||||
|
||||
- "Teach" Materialize about the lifecycle/state of connectors. Connectors can | ||||
be either `RUNNING`, `FAILED`, or `CANCELED`. We might get a bit more fancy | ||||
by adding things like `STARTING`/`RESTARTING`. Issues that happen during | ||||
execution must change the state, instead of just logging error messages. | ||||
- Add "status" columns to `mz_connector` that indicates the status of a | ||||
connector. These would surface the lifecycle state mentioned above. | ||||
- Add a new view `mz_connector_errors` that surface errors that are only | ||||
reported in the log so far. | ||||
- Add a "status" column to `mz_views` and surface errors in a new | ||||
`mz_view_errors` system table. We need these because errors in depended-upon | ||||
sources propagate to views. | ||||
- Maybe log state transitions in the above new views or add additional views | ||||
for that. | ||||
- Failures in a connector should never bring down Materialize. Right now this | ||||
can happen when re-starting. For those cases we should instead set the | ||||
lifecycle status accordingly and continue. It's somewhat tricky because this | ||||
happens when restoring the catalog and looks to the sink like it's just | ||||
being created. Normally, when creating a sink, failures should cancel the | ||||
statement and not add the sink to the catalog but we cannot do this here. | ||||
- Report a `DEGRADED` state in system tables when a connector is in a | ||||
temporarily degraded state. For example when `rdkafka` can't reach a broker | ||||
but just sits and waits for them to come back online. | ||||
|
||||
#### Dealing with transitive failures | ||||
|
||||
What should be the state of a sink when one of the sources that it depends on | ||||
fails. I think it should also be put in a failed state, with | ||||
`mz_connector_errors` indicating that it was failed because one of its inputs | ||||
has failed. | ||||
|
||||
## Musing/Future work | ||||
|
||||
- Add something like `RESTART SOURCE` or `RESTART SINK` that can get a | ||||
source/sink out of a wedged state. This is potentially complicated because | ||||
restarting sources would also require restarting dependent sinks. | ||||
- We can think about letting the system restart a fatally failed connector | ||||
instead of requiring user intervention, as we do now. For this, we would | ||||
need global `Retry` settings. Either defaults or user-settable. | ||||
- When we add *one-shot sinks* later, we can add a state `FINISHED` that would | ||||
indicate finished sinks. | ||||
- Keep entries of dropped/failed/whatever connector in a | ||||
`mz_connector_history` view. Or keep dropped connector in `mz_connector` for | ||||
a while. | ||||
|
||||
## Alternatives | ||||
|
||||
We could introduce global settings for `timeout`, `num retries`, and | ||||
potentially `back-off interval`. Not all connector SDKs support specifying all | ||||
of these, but those that do would use the global configuration instead of | ||||
relying on arbitrary hard-coded values. | ||||
|
||||
I don't think this is to valuable, because usually these are to connector | ||||
specific. | ||||
|
||||
## Open Questions | ||||
|
||||
There were some, which lead to the addition of a `DEGRADED` state above. | ||||
|
||||
## References | ||||
|
||||
- Product thoughts from Nicolle: https://docs.google.com/document/d/10RsEnpJJBN-lQKyYl08KRQ15Ckn-_gCWa5MAnZcc-ts/edit#heading=h.yvax2aldz1n9 | ||||
- Meta-issue by Eli: https://github.com/MaterializeInc/materialize/issues/7115 | ||||
|
||||
Critical: | ||||
- https://github.com/MaterializeInc/materialize/issues/7396: sink error during startup kills materialized | ||||
- https://github.com/MaterializeInc/materialize/issues/7009: failure in S3 source is printed to log but source doesn't error out | ||||
- https://github.com/MaterializeInc/materialize/issues/6586: transient broker failure leads to error in log, but source doesn't report as errored and doesn't continue to produce data | ||||
|
||||
Unknown: | ||||
- https://github.com/MaterializeInc/materialize/issues/7035: fetching in source doesn't continue after transient failure | ||||
- https://github.com/MaterializeInc/materialize/issues/6916: metadata fetch errors in log | ||||
|
||||
Related: | ||||
- https://github.com/MaterializeInc/materialize/issues/3122: "no complete timestamps" message is not useful | ||||
- related: https://github.com/MaterializeInc/materialize/issues/2917 | ||||
|
||||
Future Work, aka. non-goals: | ||||
- https://github.com/MaterializeInc/materialize/discussions/5340: more structured user errors (error, details, hints) | ||||
|
||||
Merged: | ||||
- https://github.com/MaterializeInc/materialize/pull/6952 graceful retries for S3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading the initial snapshot from postgres
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also starts only after the source was created and added to the catalog, in the dataflow layer.
Or did you observe that a missing Postgres will prevent Materialize from starting?