Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: rename stalled source/sink status to errored #26749

Open
benesch opened this issue Apr 21, 2024 · 4 comments
Open

storage: rename stalled source/sink status to errored #26749

benesch opened this issue Apr 21, 2024 · 4 comments
Labels
A-storage Area: storage

Comments

@benesch
Copy link
Member

benesch commented Apr 21, 2024

The stalled status for a source or a sink has led to user confusion. Users often expect to see a stalled status when a source is technically running but not keeping up with ingestion (e.g., because the pod has a broken network connection but hasn't realized it, or because the source is using 100% CPU). But in fact that's not what stalled means! Stalled means that the source encountered a non-fatal error that will be retried after a 30s backoff period.

I propose that we find a name for stalled that better reflects that it indicates a transient error that will be retried. That will hopefully clarify that the statuses in mz_{source|sink}_status_history reflect the lifecycle of the source (starting, running, failed and retrying, failed forever, successfully completed forever), and that whether or not a running source is actually keeping up with ingestion or stalled requires looking elsewhere (e.g., at frontiers and statistics).

Context

See also

cc @guswynn @SangJunBak @bkirwi

@bkirwi
Copy link
Contributor

bkirwi commented Apr 22, 2024

e.g., because the pod has a broken network connection but hasn't realized it, or because the source is using 100% CPU

These are interesting examples because they feel pretty different to me! The former seems like the sort of thing that should be reported as a status issue, and if we weren't ~eventually noticing it and reporting stalled I'd feel like that were a bug. Whereas the latter doesn't feel like a status issue -- since the source is able to make progress, just more slowly than we'd want.

It's interesting that these seem to fall in the same bucket for users, though. Maybe "able to make progress" is not the right conceptual line to draw, regardless of what we call it.

Also: at a quick check, we don't seem to define these states publicly anywhere? Adding a tooltip to the console or a section in the docs might help users get the right idea even if the status name is not perfectly self-explanatory...

@benesch
Copy link
Member Author

benesch commented Apr 22, 2024

The former seems like the sort of thing that should be reported as a status issue, and if we weren't ~eventually noticing it and reporting stalled I'd feel like that were a bug.

Ah, so this is at the heart of what I think is confusing about the stalled name. Colloquially, I'd say that the source has "stalled" during that period where the network connection has broken but we haven't yet noticed, and I'd say it's "restarting" once we've noticed the failed network connection and timed out. But this doesn't match how the mz_source_statuses table reports the situation. A stalled network connection doesn't show up in the status initially, and then once it times out, the timeout error connection shows up as stalled in the table.

Totally agree that it would be a bug if we don't eventually notice that the network connection has wedged and restart the source.

Also: at a quick check, we don't seem to define these states publicly anywhere? Adding a tooltip to the console or a section in the docs might help users get the right idea even if the status name is not perfectly self-explanatory...

I think that's right. Some in-product tooltips would definitely go a long way. Although I think stalled has proven to be confusing enough that it's worth renaming, vs just explaining further with a tooltip.

@bkirwi
Copy link
Contributor

bkirwi commented Apr 23, 2024

Colloquially, I'd say that the source has "stalled" during that period where the network connection has broken but we haven't yet noticed, and I'd say it's "restarting" once we've noticed the failed network connection and timed out.

For sure... though this is tricky to map to eg. the Kafka source, which does not generally need to restart on network issues... the retry loop is handled mostly within the Kafka client.

I'm starting to feel like part of the difficulty here is that the statuses mix a couple layers of abstraction: "is my timely dataflow restarting right now" is a lower-level / operational concern, where "this dataflow is permanently failed / needs to be recreated" seems higher-level / black-box. The "stalled" status may make more or less sense depending on which mindset you have...

@benesch
Copy link
Member Author

benesch commented Apr 28, 2024

I'm starting to feel like part of the difficulty here is that the statuses mix a couple layers of abstraction: "is my timely dataflow restarting right now" is a lower-level / operational concern, where "this dataflow is permanently failed / needs to be recreated" seems higher-level / black-box. The "stalled" status may make more or less sense depending on which mindset you have...

Yeah, I like this framing a lot!

Personally comfortable leaning into the "lower level" statuses. started and stalled are fundamentally about the lifecycle of the low level dataflow and it seems hard to rip those out now.

though this is tricky to map to eg. the Kafka source, which does not generally need to restart on network issues... the retry loop is handled mostly within the Kafka client.

I feel like this is a good point in favor of renaming the stalled status to errored or retrying! The new status name would be less prescriptive about whether the source was making progress or not, and would allow us to emit errored statuses for Kafka sources that e.g. encountered an error on one partition but perhaps were making progress on other sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage
Projects
None yet
Development

No branches or pull requests

2 participants