-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: rename stalled
source/sink status to errored
#26749
Comments
These are interesting examples because they feel pretty different to me! The former seems like the sort of thing that should be reported as a status issue, and if we weren't ~eventually noticing it and reporting It's interesting that these seem to fall in the same bucket for users, though. Maybe "able to make progress" is not the right conceptual line to draw, regardless of what we call it. Also: at a quick check, we don't seem to define these states publicly anywhere? Adding a tooltip to the console or a section in the docs might help users get the right idea even if the status name is not perfectly self-explanatory... |
Ah, so this is at the heart of what I think is confusing about the Totally agree that it would be a bug if we don't eventually notice that the network connection has wedged and restart the source.
I think that's right. Some in-product tooltips would definitely go a long way. Although I think |
For sure... though this is tricky to map to eg. the Kafka source, which does not generally need to restart on network issues... the retry loop is handled mostly within the Kafka client. I'm starting to feel like part of the difficulty here is that the statuses mix a couple layers of abstraction: "is my timely dataflow restarting right now" is a lower-level / operational concern, where "this dataflow is permanently failed / needs to be recreated" seems higher-level / black-box. The "stalled" status may make more or less sense depending on which mindset you have... |
Yeah, I like this framing a lot! Personally comfortable leaning into the "lower level" statuses.
I feel like this is a good point in favor of renaming the |
The
stalled
status for a source or a sink has led to user confusion. Users often expect to see astalled
status when a source is technically running but not keeping up with ingestion (e.g., because the pod has a broken network connection but hasn't realized it, or because the source is using 100% CPU). But in fact that's not whatstalled
means! Stalled means that the source encountered a non-fatal error that will be retried after a 30s backoff period.I propose that we find a name for
stalled
that better reflects that it indicates a transient error that will be retried. That will hopefully clarify that the statuses inmz_{source|sink}_status_history
reflect the lifecycle of the source (starting, running, failed and retrying, failed forever, successfully completed forever), and that whether or not a running source is actually keeping up with ingestion or stalled requires looking elsewhere (e.g., at frontiers and statistics).Context
See also
stalled
source/sink status toerrored
#26749cc @guswynn @SangJunBak @bkirwi
The text was updated successfully, but these errors were encountered: