storage: rename `stalled` source/sink status to `errored` #26749

benesch · 2024-04-21T04:42:31Z

The stalled status for a source or a sink has led to user confusion. Users often expect to see a stalled status when a source is technically running but not keeping up with ingestion (e.g., because the pod has a broken network connection but hasn't realized it, or because the source is using 100% CPU). But in fact that's not what stalled means! Stalled means that the source encountered a non-fatal error that will be retried after a 30s backoff period.

I propose that we find a name for stalled that better reflects that it indicates a transient error that will be retried. That will hopefully clarify that the statuses in mz_{source|sink}_status_history reflect the lifecycle of the source (starting, running, failed and retrying, failed forever, successfully completed forever), and that whether or not a running source is actually keeping up with ingestion or stalled requires looking elsewhere (e.g., at frontiers and statistics).

Context

These are interesting examples because they feel pretty different to me! The former seems like the sort of thing that should be reported as a status issue, and if we weren't ~eventually noticing it and reporting stalled I'd feel like that were a bug. Whereas the latter doesn't feel like a status issue -- since the source is able to make progress, just more slowly than we'd want.

It's interesting that these seem to fall in the same bucket for users, though. Maybe "able to make progress" is not the right conceptual line to draw, regardless of what we call it.

Also: at a quick check, we don't seem to define these states publicly anywhere? Adding a tooltip to the console or a section in the docs might help users get the right idea even if the status name is not perfectly self-explanatory...

benesch · 2024-04-22T17:42:14Z

The former seems like the sort of thing that should be reported as a status issue, and if we weren't ~eventually noticing it and reporting stalled I'd feel like that were a bug.

Ah, so this is at the heart of what I think is confusing about the stalled name. Colloquially, I'd say that the source has "stalled" during that period where the network connection has broken but we haven't yet noticed, and I'd say it's "restarting" once we've noticed the failed network connection and timed out. But this doesn't match how the mz_source_statuses table reports the situation. A stalled network connection doesn't show up in the status initially, and then once it times out, the timeout error connection shows up as stalled in the table.

Totally agree that it would be a bug if we don't eventually notice that the network connection has wedged and restart the source.

Also: at a quick check, we don't seem to define these states publicly anywhere? Adding a tooltip to the console or a section in the docs might help users get the right idea even if the status name is not perfectly self-explanatory...

I think that's right. Some in-product tooltips would definitely go a long way. Although I think stalled has proven to be confusing enough that it's worth renaming, vs just explaining further with a tooltip.

bkirwi · 2024-04-23T15:43:37Z

Colloquially, I'd say that the source has "stalled" during that period where the network connection has broken but we haven't yet noticed, and I'd say it's "restarting" once we've noticed the failed network connection and timed out.

For sure... though this is tricky to map to eg. the Kafka source, which does not generally need to restart on network issues... the retry loop is handled mostly within the Kafka client.

I'm starting to feel like part of the difficulty here is that the statuses mix a couple layers of abstraction: "is my timely dataflow restarting right now" is a lower-level / operational concern, where "this dataflow is permanently failed / needs to be recreated" seems higher-level / black-box. The "stalled" status may make more or less sense depending on which mindset you have...

benesch · 2024-04-28T21:43:16Z

I'm starting to feel like part of the difficulty here is that the statuses mix a couple layers of abstraction: "is my timely dataflow restarting right now" is a lower-level / operational concern, where "this dataflow is permanently failed / needs to be recreated" seems higher-level / black-box. The "stalled" status may make more or less sense depending on which mindset you have...

Yeah, I like this framing a lot!

Personally comfortable leaning into the "lower level" statuses. started and stalled are fundamentally about the lifecycle of the low level dataflow and it seems hard to rip those out now.

though this is tricky to map to eg. the Kafka source, which does not generally need to restart on network issues... the retry loop is handled mostly within the Kafka client.

I feel like this is a good point in favor of renaming the stalled status to errored or retrying! The new status name would be less prescriptive about whether the source was making progress or not, and would allow us to emit errored statuses for Kafka sources that e.g. encountered an error on one partition but perhaps were making progress on other sources.

benesch added the A-storage Area: storage label Apr 21, 2024

benesch mentioned this issue Apr 21, 2024

[Epic] storage: improve precision of source/sink status reporting #20036

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: rename `stalled` source/sink status to `errored` #26749

storage: rename `stalled` source/sink status to `errored` #26749

benesch commented Apr 21, 2024 •

edited

Loading

bkirwi commented Apr 22, 2024

benesch commented Apr 22, 2024

bkirwi commented Apr 23, 2024

benesch commented Apr 28, 2024

storage: rename stalled source/sink status to errored #26749

storage: rename stalled source/sink status to errored #26749

Comments

benesch commented Apr 21, 2024 • edited Loading

Context

See also

bkirwi commented Apr 22, 2024

benesch commented Apr 22, 2024

bkirwi commented Apr 23, 2024

benesch commented Apr 28, 2024

storage: rename `stalled` source/sink status to `errored` #26749

storage: rename `stalled` source/sink status to `errored` #26749

benesch commented Apr 21, 2024 •

edited

Loading