Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] storage: improve precision of source/sink status reporting #20036

Open
1 of 5 tasks
benesch opened this issue Jun 20, 2023 · 7 comments
Open
1 of 5 tasks

[Epic] storage: improve precision of source/sink status reporting #20036

benesch opened this issue Jun 20, 2023 · 7 comments
Assignees
Labels
A-storage Area: storage C-feature Category: new feature or request

Comments

@benesch
Copy link
Member

benesch commented Jun 20, 2023

Notion epic

Link: https://www.notion.so/storage-improve-precision-of-source-sink-status-reporting-2346c110a137476b88bed95c562060e8
Product brief:
Status: Not started
Prioritization: 10 - October
Estimated delivery date:

Updated overview

Source statuses have been a point of confusion for many users lately. Here's a rough summary of the issues:

  • Subsource statuses are confusing. Sometimes subsource statuses don't line up with the parent source. E.g., a subsource can get stuck in "starting" if it produces no data since the last restart, even though it is in fact healthy.
  • Not all transient errors are reported as stalled. E.g., Kafka network errors never bubble up to the source status.
  • Some definite errors are reported as stalled, but not all definite errors are reported as such.
  • The stalled status is confusing—not all errors that we might want to report in mz_source_statuses cause stalls, and CPU overload can cause a stall that never gets reported as a stalled status.

Here's a quick proposal for how to improve the situation:

  • Rename stalled to retrying or somesuch (see discussion in storage: rename stalled source/sink status to errored #26749).
  • Teach Kafka sources to emit transient network errors to the source status as a retrying status.
  • Stop emitting definite but retractable errors to the source status collection. These errors should be written only to the data shard. The presence of definite but retractable errors can be detected in the console via (SELECT FROM {source} LIMIT 1) for now, and eventually the ergonomics will be improved via [Epic] Expose error stream via SQL ("bad data"/"programmable errors") #22430. This should eliminate the defect where PostgreSQL/MySQL subsources can get stuck in "starting" if they emit no data after a restart.
  • Handle final errors (e.g., replication slot dropped) by moving the source to "ceased". I think it's okay and correct to not emit the error to the data shard and leave it readable—semantically, we want to represent that the shard is "frozen" in time. I think we want to avoid advancing the write frontier to the end of time to allow for a future feature that allows repairing the shard by taking a new snapshot against the upstream database and producing the diff.

The next step here is to flesh the above proposal out into a mini-design doc. It may be worth downscoping the proposal to something that can be implemented more quickly, but we should find a cutpoint that makes the behavior of mz_source_statuses easy to explain to end users.

Historical issue

We have four reports of the mz_{source|sink}_status_history tables presenting an imprecise accounting of a source/sink's status that's lead to user confusion:

Fixing these issues has been challenging because of the current architecture. Each source/sink is responsible for reporting its own status from clusterd. That means that if clusterd goes down (because all replicas of a cluster have been dropped, or Kubernetes is having issues), no error is reported in the table.

We have a design doc that will address all of these issues by changing the architecture to instead have environmentd write status updates to the mz_{source|sink}_status_history tables. This will allow environmentd to update the status to "unknown" if it has trouble communicating with a cluster:

Other issues that would improve the situation here:

cc @dseisun-materialize — just tracking/bundling the work that @rkrishn7 is already doing in #19792.

@benesch benesch added the C-feature Category: new feature or request label Jun 20, 2023
@aljoscha aljoscha added the A-storage Area: storage label Jun 20, 2023
ParkMyCar added a commit that referenced this issue Aug 29, 2023
This PR updates the `mz_source_statuses` BuiltinView to have a default
status of "running" for webhook sources. It's known that webhook sources
don't currently report any status since the mechanism to do so is within
clusterd, meanwhile webhook sources run in environmentd. There is
ongoing work to change this, see
#20036.

For now to improve this UX papercut we change the default status of
webhook sources to "running" instead of "created", since currently there
is no distinction between the two for webhooks.

### Motivation

* This PR fixes a previously unreported bug.

In the `mz_source_statuses` BuiltinView which powers the web console, we
report all webhook sources as "Created" whereas the healthy state for
all sources is "Running".
[Slack](https://materializeinc.slack.com/archives/CU7ELJ6E9/p1692724790997529)

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [x] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
- Changes the user facing status of webhooks from "created" to "running"
to unify the UX for source status reporting.
mgree pushed a commit to mgree/materialize that referenced this issue Nov 20, 2023
…rializeInc#21461)

This PR updates the `mz_source_statuses` BuiltinView to have a default
status of "running" for webhook sources. It's known that webhook sources
don't currently report any status since the mechanism to do so is within
clusterd, meanwhile webhook sources run in environmentd. There is
ongoing work to change this, see
MaterializeInc#20036.

For now to improve this UX papercut we change the default status of
webhook sources to "running" instead of "created", since currently there
is no distinction between the two for webhooks.

### Motivation

* This PR fixes a previously unreported bug.

In the `mz_source_statuses` BuiltinView which powers the web console, we
report all webhook sources as "Created" whereas the healthy state for
all sources is "Running".
[Slack](https://materializeinc.slack.com/archives/CU7ELJ6E9/p1692724790997529)

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [x] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
- Changes the user facing status of webhooks from "created" to "running"
to unify the UX for source status reporting.
@guswynn
Copy link
Contributor

guswynn commented Nov 27, 2023

@bkirwi @benesch
#23222 originally partially implemented #16580, but I removed it because it ultimately revealed information about our k8s setup, which to me seemed effectively not interpretable by a user. Do we actually want this? any pod error that isn't reported by mz_cluster_replica_statuses is a bug on our end, right?

@bkirwi
Copy link
Contributor

bkirwi commented Nov 27, 2023

It seems broadly useful to me for "the cluster the source/sink is running on is totally busted" to show up in the source/sink status at some point, even if it's mentioned in replica statuses, if only to send people hunting less. (It is ~the sort of source-wide health issue that that table can effectively represent.)

But I don't feel strongly about it at all!

@benesch
Copy link
Member Author

benesch commented Nov 27, 2023

Post #23222, how does the issue present? Say I have a source whose cluster is in CrashLoopBackoff—the source keeps panicking right after startup, and so the backoff period gets longer and longer. Will I see "starting", "running", "stalled", (backoff occurs), "starting", "running", "stalled", (backoff occurs), "starting", "running", "stalled"? Or do I see "starting", "running", (backoff occurs), "starting", "running", (backoff occurs)?

Agree with not wanting to leak too much k8s state into the source_status_history relation. But agree with Ben that we want to reflect something in the source status history table—and I think that "something" is somehow ensuring that the status is "stalled" when the source replica's pod is in CrashLoopBackoff.

@guswynn
Copy link
Contributor

guswynn commented Nov 28, 2023

@benesch I am not sure; i don't think we test that because we in general we don't have a thing like mz_panic to force a storage cluster to go down (though we could do so with that new unification feature flag)

definitely worth looking into telling the user that the cluster crashed, but we should be thoughtful about how to present it; unfortunately the RehydratingStorageClient makes this a bit annoying to manage...

@benesch
Copy link
Member Author

benesch commented Nov 28, 2023

i don't think we test that because we in general we don't have a thing like mz_panic to force a storage cluster to go down (though we could do so with that new unification feature flag)

Should be able to do it with mzcompose pretty easily! Either use docker exec to kill the clusterd process every few seconds, or use REMOTE replicas and c.down.

but we should be thoughtful about how to present it; unfortunately the RehydratingStorageClient makes this a bit annoying to manage...

Can't you just have the RehydratingStorageClient conjure up a "SourceStatus(Stalled)" message every time it observes the client disconnect? The stalled can be something generic like "internal error: lost connection to source cluster".

@guswynn
Copy link
Contributor

guswynn commented Nov 28, 2023

Can't you just have the RehydratingStorageClient conjure up a "SourceStatus(Stalled)" message every time it observes the client disconnect?

right now sources start paused, with this change they would also start with this stall, unless the rehydrating client knows more about if its waiting on the first connection i think?

@benesch
Copy link
Member Author

benesch commented Nov 28, 2023

Right, exactly! You want the rehydrating client to produce the stall message on a disconnection event, and not in the initial disconnected state.

@benesch benesch changed the title [Epic] Improve precision of source/sink status reporting storage: improve precision of source/sink status reporting Feb 3, 2024
@materialize-bot materialize-bot changed the title storage: improve precision of source/sink status reporting [Epic] storage: improve precision of source/sink status reporting Aug 27, 2024
@morsapaes morsapaes assigned rjobanp and unassigned petrosagg Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage C-feature Category: new feature or request
Projects
None yet
Development

No branches or pull requests

6 participants