[Epic] storage: improve precision of source/sink status reporting #20036

benesch · 2023-06-20T04:20:17Z

Notion epic

Link: https://www.notion.so/storage-improve-precision-of-source-sink-status-reporting-2346c110a137476b88bed95c562060e8
Product brief:
Status: Not started
Prioritization: 10 - October
Estimated delivery date:

Updated overview

Source statuses have been a point of confusion for many users lately. Here's a rough summary of the issues:

Subsource statuses are confusing. Sometimes subsource statuses don't line up with the parent source. E.g., a subsource can get stuck in "starting" if it produces no data since the last restart, even though it is in fact healthy.
Not all transient errors are reported as stalled. E.g., Kafka network errors never bubble up to the source status.
Some definite errors are reported as stalled, but not all definite errors are reported as such.
The stalled status is confusing—not all errors that we might want to report in mz_source_statuses cause stalls, and CPU overload can cause a stall that never gets reported as a stalled status.

Here's a quick proposal for how to improve the situation:

Rename stalled to retrying or somesuch (see discussion in storage: rename stalled source/sink status to errored #26749).
Teach Kafka sources to emit transient network errors to the source status as a retrying status.
Stop emitting definite but retractable errors to the source status collection. These errors should be written only to the data shard. The presence of definite but retractable errors can be detected in the console via (SELECT FROM {source} LIMIT 1) for now, and eventually the ergonomics will be improved via [Epic] Expose error stream via SQL ("bad data"/"programmable errors") #22430. This should eliminate the defect where PostgreSQL/MySQL subsources can get stuck in "starting" if they emit no data after a restart.
Handle final errors (e.g., replication slot dropped) by moving the source to "ceased". I think it's okay and correct to not emit the error to the data shard and leave it readable—semantically, we want to represent that the shard is "frozen" in time. I think we want to avoid advancing the write frontier to the end of time to allow for a future feature that allows repairing the shard by taking a new snapshot against the upstream database and producing the diff.

The next step here is to flesh the above proposal out into a mini-design doc. It may be worth downscoping the proposal to something that can be implemented more quickly, but we should find a cutpoint that makes the behavior of mz_source_statuses easy to explain to end users.

Historical issue

We have four reports of the mz_{source|sink}_status_history tables presenting an imprecise accounting of a source/sink's status that's lead to user confusion:

Kafka sources are marked running even if no replica is associated with the source cluster #18795
K8s pod errors are not reported in the source/sink status #16580
storage: mz_source_status empty for sources with REMOTE directive #16582
Failed sources/sinks are not distinguished from stalled ones: storage: add support for "failed" source statuses #23013
storage: rename stalled source/sink status to errored #26749

Fixing these issues has been challenging because of the current architecture. Each source/sink is responsible for reporting its own status from clusterd. That means that if clusterd goes down (because all replicas of a cluster have been dropped, or Kubernetes is having issues), no error is reported in the table.

We have a design doc that will address all of these issues by changing the architecture to instead have environmentd write status updates to the mz_{source|sink}_status_history tables. This will allow environmentd to update the status to "unknown" if it has trouble communicating with a cluster:

design doc(storage): shift status writes from storage workers to the storage controller #19792

Other issues that would improve the situation here:

cc @dseisun-materialize — just tracking/bundling the work that @rkrishn7 is already doing in #19792.

The text was updated successfully, but these errors were encountered:

This PR updates the `mz_source_statuses` BuiltinView to have a default status of "running" for webhook sources. It's known that webhook sources don't currently report any status since the mechanism to do so is within clusterd, meanwhile webhook sources run in environmentd. There is ongoing work to change this, see #20036. For now to improve this UX papercut we change the default status of webhook sources to "running" instead of "created", since currently there is no distinction between the two for webhooks. ### Motivation * This PR fixes a previously unreported bug. In the `mz_source_statuses` BuiltinView which powers the web console, we report all webhook sources as "Created" whereas the healthy state for all sources is "Running". [Slack](https://materializeinc.slack.com/archives/CU7ELJ6E9/p1692724790997529) ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [x] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): - Changes the user facing status of webhooks from "created" to "running" to unify the UX for source status reporting.

…rializeInc#21461) This PR updates the `mz_source_statuses` BuiltinView to have a default status of "running" for webhook sources. It's known that webhook sources don't currently report any status since the mechanism to do so is within clusterd, meanwhile webhook sources run in environmentd. There is ongoing work to change this, see MaterializeInc#20036. For now to improve this UX papercut we change the default status of webhook sources to "running" instead of "created", since currently there is no distinction between the two for webhooks. ### Motivation * This PR fixes a previously unreported bug. In the `mz_source_statuses` BuiltinView which powers the web console, we report all webhook sources as "Created" whereas the healthy state for all sources is "Running". [Slack](https://materializeinc.slack.com/archives/CU7ELJ6E9/p1692724790997529) ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [x] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): - Changes the user facing status of webhooks from "created" to "running" to unify the UX for source status reporting.

guswynn · 2023-11-27T18:12:57Z

@bkirwi @benesch
#23222 originally partially implemented #16580, but I removed it because it ultimately revealed information about our k8s setup, which to me seemed effectively not interpretable by a user. Do we actually want this? any pod error that isn't reported by mz_cluster_replica_statuses is a bug on our end, right?

bkirwi · 2023-11-27T18:34:20Z

It seems broadly useful to me for "the cluster the source/sink is running on is totally busted" to show up in the source/sink status at some point, even if it's mentioned in replica statuses, if only to send people hunting less. (It is ~the sort of source-wide health issue that that table can effectively represent.)

But I don't feel strongly about it at all!

benesch · 2023-11-27T19:25:55Z

Post #23222, how does the issue present? Say I have a source whose cluster is in CrashLoopBackoff—the source keeps panicking right after startup, and so the backoff period gets longer and longer. Will I see "starting", "running", "stalled", (backoff occurs), "starting", "running", "stalled", (backoff occurs), "starting", "running", "stalled"? Or do I see "starting", "running", (backoff occurs), "starting", "running", (backoff occurs)?

Agree with not wanting to leak too much k8s state into the source_status_history relation. But agree with Ben that we want to reflect something in the source status history table—and I think that "something" is somehow ensuring that the status is "stalled" when the source replica's pod is in CrashLoopBackoff.

guswynn · 2023-11-28T18:10:45Z

@benesch I am not sure; i don't think we test that because we in general we don't have a thing like mz_panic to force a storage cluster to go down (though we could do so with that new unification feature flag)

definitely worth looking into telling the user that the cluster crashed, but we should be thoughtful about how to present it; unfortunately the RehydratingStorageClient makes this a bit annoying to manage...

benesch · 2023-11-28T19:19:57Z

i don't think we test that because we in general we don't have a thing like mz_panic to force a storage cluster to go down (though we could do so with that new unification feature flag)

Should be able to do it with mzcompose pretty easily! Either use docker exec to kill the clusterd process every few seconds, or use REMOTE replicas and c.down.

but we should be thoughtful about how to present it; unfortunately the RehydratingStorageClient makes this a bit annoying to manage...

Can't you just have the RehydratingStorageClient conjure up a "SourceStatus(Stalled)" message every time it observes the client disconnect? The stalled can be something generic like "internal error: lost connection to source cluster".

guswynn · 2023-11-28T19:24:44Z

Can't you just have the RehydratingStorageClient conjure up a "SourceStatus(Stalled)" message every time it observes the client disconnect?

right now sources start paused, with this change they would also start with this stall, unless the rehydrating client knows more about if its waiting on the first connection i think?

benesch · 2023-11-28T19:29:03Z

Right, exactly! You want the rehydrating client to produce the stall message on a disconnection event, and not in the initial disconnected state.

benesch added the C-feature Category: new feature or request label Jun 20, 2023

benesch mentioned this issue Jun 20, 2023

design doc(storage): shift status writes from storage workers to the storage controller #19792

Merged

5 tasks

aljoscha added the A-storage Area: storage label Jun 20, 2023

benesch mentioned this issue Jul 12, 2023

[webhook] Correctly populate mz_source_status_history #20473

Open

ParkMyCar mentioned this issue Aug 29, 2023

webhook: Fix papercut with reporting of webhook source statuses #21461

Merged

5 tasks

benesch mentioned this issue Jan 28, 2024

storage: Q1FY25 mega tracker #24755

Closed

benesch changed the title ~~[Epic] Improve precision of source/sink status reporting~~ storage: improve precision of source/sink status reporting Feb 3, 2024

benesch mentioned this issue Apr 21, 2024

storage: consider stabilizing mz_source_statuses and mz_sink_statuses #24937

Open

bosconi assigned petrosagg Jun 11, 2024

morsapaes mentioned this issue Aug 7, 2024

storage: fail UPSERT sink when we have more than one update at timestamp #17549

Open

materialize-bot changed the title ~~storage: improve precision of source/sink status reporting~~ [Epic] storage: improve precision of source/sink status reporting Aug 27, 2024

morsapaes assigned rjobanp and unassigned petrosagg Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] storage: improve precision of source/sink status reporting #20036

[Epic] storage: improve precision of source/sink status reporting #20036

benesch commented Jun 20, 2023 •

edited by materialize-bot

Loading

guswynn commented Nov 27, 2023

bkirwi commented Nov 27, 2023

benesch commented Nov 27, 2023

guswynn commented Nov 28, 2023

benesch commented Nov 28, 2023

guswynn commented Nov 28, 2023

benesch commented Nov 28, 2023

[Epic] storage: improve precision of source/sink status reporting #20036

[Epic] storage: improve precision of source/sink status reporting #20036

Comments

benesch commented Jun 20, 2023 • edited by materialize-bot Loading

Notion epic

Updated overview

Historical issue

guswynn commented Nov 27, 2023

bkirwi commented Nov 27, 2023

benesch commented Nov 27, 2023

guswynn commented Nov 28, 2023

benesch commented Nov 28, 2023

guswynn commented Nov 28, 2023

benesch commented Nov 28, 2023

benesch commented Jun 20, 2023 •

edited by materialize-bot

Loading