Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add design doc for consistent error handling and surfacing connector failures #7584

Merged
merged 1 commit into from
Aug 2, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Consistent Error Handling and Surfacing Connector Failures

*I will use connector to refer to either source or sink below. I will use
`mz_connector` to mean either `mz_source` or `mz_sink` or both.*

## Summary

Currently, Materialize reacts to errors happening at different stages of a
connectors lifecycle in different ways. For example, an error happening during
purification/creation of a connector will be reported back to the user
immediately while errors that happen during runtime are only logged. We should
agree on what the behavior should be and fix it where needed.

Additionally, I propose to formally add the concept of a *lifecycle* for
connectors, which will allow the system and users to determine the status of a
connector. Right now, this would only be possible by looking through log files
for error messages and manually associating them with connectors.

## Goals

- unify error handling policies across connectors
- make it explicit how Materialize should react to errors at different stages
of execution
- introduce connector lifecycle and surface it to users through system tables

## Non-Goals

- introduce more structured error messages, see
https://github.com/MaterializeInc/materialize/discussions/5340

## Description

The actual proposal is [way down](#proposed-changes) but it helps to first
understand the current state.

### Current State

#### Where/when do we interact with external systems?

- When creating a source/sink:
- (coord) purification
- When creating a source/sink or when materialized is restarted:
- (coord) initial connector setup, for example:
- (coord) reading Kafka consistency topic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading the initial snapshot from postgres

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also starts only after the source was created and added to the catalog, in the dataflow layer.

Or did you observe that a missing Postgres will prevent Materialize from starting?

- (coord) publishing schemas
- (coord) creating topics
- (coord) creating output file, checking input file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing the keys in an S3 bucket is also a one-time task that can fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but this is only started after the source was successfully created, I believe. I've looked at this one:

async fn scan_bucket_task(

I'm adding these examples to the doc as well, so thanks!

- (coord) validate AWS credentials
- Continuously/During dataflow execution:
- (coord) metadata loops, for example fetching new Kafka partitions,
listening on BYO topics
- (dataflow) listening on SQS for S3 updates, scanning S3 bucket (though it
only happens once, it's still in the dataflow, happening after succesful
source creation)
- (dataflow) actual data ingest and writing, for example, from Kafka, S3,
Postgres...

#### Failures during purification

In general, failures during purification lead to the statement being aborted,
nothing gets added to the catalog. This seems fair, because we cannot store
un-purified statements in the catalog and purification happens infrequently,
only when a statement is issued.

Most purification logic doesn't specify timeouts, only S3 and Kinesis have a 1
second timeout for the validation call. I believe none of the purification
checks do retries

#### Failures During Source/Sink Creation

The behaviour is very similar (the same?) to failures during purification:
nothing gets added to the catalog when there are failures.

Some calls have custom timeouts, some don't. I believe none of the calls have
retries.

#### Failures During "Normal Operations"

Failures during execution don't bring down materialized. If failures happen,
they are only surfaced in the log. Materialized doesn't "know" about the state
of a connector.

There are no global config options for timeouts/retries/back-off interval.

#### Failures During Restart (restarting a materialized with things in the catalog)

Failures during sink (re-)creation kill materialized when restarting with a
filled catalog. Failures during source creation are fine.

The reason is that (re-)creating a source runs no source-specific code in the
coordinator. Everything happens in dataflow, where we have retries. This is not
true for sinks.

### Proposed Changes

The reaction to failures at various stages should uniform across different
connectors and informed by these rules:

- Failures during connector creation (which includes purification) immediately
report an error back to the user. No catalog entry is created.
- Failures during runtime need to be surfaced to the user in system tables.
- Failures that occur during a restart with an already filled catalog must not
bring down Materialize. Instead, errors that occur must be reported to the
user for each individual connector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failures in the source should also cause selecting from the source to start erroring out, rather than continuing to return stale or no data. This should be in addition to sufracing that same error in a mz_ table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean fatal errors, like a malformed message that we can't parse. Or transient errors, like a Kafka broker not being reachable. For the former, we should already prevent querying, but you're right that we don't do anything about the latter right now.

The concrete steps to achieve this, in the order we should address them:

- "Teach" Materialize about the lifecycle/state of connectors. Connectors can
be either `RUNNING`, `FAILED`, or `CANCELED`. We might get a bit more fancy
by adding things like `STARTING`/`RESTARTING`. Issues that happen during
execution must change the state, instead of just logging error messages.
- Add "status" columns to `mz_connector` that indicates the status of a
connector. These would surface the lifecycle state mentioned above.
- Add a new view `mz_connector_errors` that surface errors that are only
reported in the log so far.
- Add a "status" column to `mz_views` and surface errors in a new
`mz_view_errors` system table. We need these because errors in depended-upon
sources propagate to views.
- Maybe log state transitions in the above new views or add additional views
for that.
- Failures in a connector should never bring down Materialize. Right now this
can happen when re-starting. For those cases we should instead set the
lifecycle status accordingly and continue. It's somewhat tricky because this
happens when restoring the catalog and looks to the sink like it's just
being created. Normally, when creating a sink, failures should cancel the
statement and not add the sink to the catalog but we cannot do this here.
- Report a `DEGRADED` state in system tables when a connector is in a
temporarily degraded state. For example when `rdkafka` can't reach a broker
but just sits and waits for them to come back online.

#### Dealing with transitive failures

What should be the state of a sink when one of the sources that it depends on
fails. I think it should also be put in a failed state, with
`mz_connector_errors` indicating that it was failed because one of its inputs
has failed.

## Musing/Future work

- Add something like `RESTART SOURCE` or `RESTART SINK` that can get a
source/sink out of a wedged state. This is potentially complicated because
restarting sources would also require restarting dependent sinks.
- We can think about letting the system restart a fatally failed connector
instead of requiring user intervention, as we do now. For this, we would
need global `Retry` settings. Either defaults or user-settable.
- When we add *one-shot sinks* later, we can add a state `FINISHED` that would
indicate finished sinks.
- Keep entries of dropped/failed/whatever connector in a
`mz_connector_history` view. Or keep dropped connector in `mz_connector` for
a while.

## Alternatives

We could introduce global settings for `timeout`, `num retries`, and
potentially `back-off interval`. Not all connector SDKs support specifying all
of these, but those that do would use the global configuration instead of
relying on arbitrary hard-coded values.

I don't think this is to valuable, because usually these are to connector
specific.

## Open Questions

There were some, which lead to the addition of a `DEGRADED` state above.

## References

- Product thoughts from Nicolle: https://docs.google.com/document/d/10RsEnpJJBN-lQKyYl08KRQ15Ckn-_gCWa5MAnZcc-ts/edit#heading=h.yvax2aldz1n9
- Meta-issue by Eli: https://github.com/MaterializeInc/materialize/issues/7115

Critical:
- https://github.com/MaterializeInc/materialize/issues/7396: sink error during startup kills materialized
- https://github.com/MaterializeInc/materialize/issues/7009: failure in S3 source is printed to log but source doesn't error out
- https://github.com/MaterializeInc/materialize/issues/6586: transient broker failure leads to error in log, but source doesn't report as errored and doesn't continue to produce data

Unknown:
- https://github.com/MaterializeInc/materialize/issues/7035: fetching in source doesn't continue after transient failure
- https://github.com/MaterializeInc/materialize/issues/6916: metadata fetch errors in log

Related:
- https://github.com/MaterializeInc/materialize/issues/3122: "no complete timestamps" message is not useful
- related: https://github.com/MaterializeInc/materialize/issues/2917

Future Work, aka. non-goals:
- https://github.com/MaterializeInc/materialize/discussions/5340: more structured user errors (error, details, hints)

Merged:
- https://github.com/MaterializeInc/materialize/pull/6952 graceful retries for S3