-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface OOMs in mz_cluster_replica_statuses
#18796
Conversation
NotReady, | ||
/// The inner element is `None` if the reason | ||
/// is unknown | ||
NotReady(Option<NotReadyReason>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of Option<NotReadyReason>
what do you think about adding an Unknown
variant to NotReadyReason
? It would remove one layer of nesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "extra" layer of nesting matches the actual value we emit to SQL -- NULL
corresponds to None
, whereas an inhabited SQL value corresponds to Some
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh that makes sense!
.with_column("status", ScalarType::String.nullable(false)) | ||
.with_column("reason", ScalarType::String.nullable(true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these two need to be separate, or can they be folded in together? e.g. making them separate technically allows us to represent status: "ready"
and reason: "OOMKilled"
, which should be an invalid state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, but it also feels weird to have ready
, not-ready-oom-killed
, and not-ready-other
as the possible values in a single column. The reason for non-readiness feels like a logically separate piece of information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking you could distill the reasons a bit, i.e. today you could have one of three statuses:
ready
oom-killed
not-ready
Where oom-killed
implies "not ready". I totally see where you're coming from though w.r.t. non-readiness being a logically separate piece of info, so I could go either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me! At least from a Surfaces point of view
.with_column("status", ScalarType::String.nullable(false)) | ||
.with_column("reason", ScalarType::String.nullable(true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking you could distill the reasons a bit, i.e. today you could have one of three statuses:
ready
oom-killed
not-ready
Where oom-killed
implies "not ready". I totally see where you're coming from though w.r.t. non-readiness being a logically separate piece of info, so I could go either way
NotReady, | ||
/// The inner element is `None` if the reason | ||
/// is unknown | ||
NotReady(Option<NotReadyReason>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea seems sound, I just have a couple improvement suggestions.
Also note that in mz_cluster_replica_statuses
you will need to be fast/lucky to see the "reason" since the replica will be restarting quickly (unless it's crashlooping). Is this relation one of the "retained metrics" ones?
src/adapter/src/notice.rs
Outdated
AdapterNotice::ClusterReplicaStatusChanged { status, .. } => { | ||
match status { | ||
ServiceStatus::NotReady(None) => Some("The cluster replica may be restarting or going offline.".into()), | ||
ServiceStatus::NotReady(Some(NotReadyReason::OOMKilled)) => Some("The cluster replica may have run out of memory and been killed.".into()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use "may" here? If k8s says the replica was OOM-killed, is there a chance that this is not correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know -- I doubt it! I was just trying to match the language in the other similar notice (the NotReady(None)
case). Probably we should let @mjibson or someone else on Surfaces decide what the correct language is here for both cases.
Eep, hang on! This needs a test, or it is virtually certain to regress. Let's figure out how to write one, possibly involving the QA team if necessary. |
Yes, Philip has already helped me find something, so I'm planning to add a
test today
…On Mon, Apr 17, 2023 at 12:17 PM Nikhil Benesch ***@***.***> wrote:
Tested manually. No automated tests because I don't know a way to force
pods to OOM that doesn't take several minutes.
Eep, hang on! This needs a test, or it is virtually certain to regress.
Let's figure out how to write one, possibly involving the QA team if
necessary.
—
Reply to this email directly, view it on GitHub
<#18796 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOOGJ4IU3L6KKNR3JDBFA3XBVULNANCNFSM6AAAAAAXBHW6QE>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
test Pr is here: #18806 |
Retroactively adding a release note and patching the docs after #18796. --------- Co-authored-by: umanwizard <brennan@umanwizard.com>
When reporting a pod status change, check whether it's due to OOM, and report this in
mz_cluster_replica_statuses
Tested manually. No automated tests because I don't know a way to force pods to OOM that doesn't take several minutes.
Partially fixes #18621