Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface OOMs in mz_cluster_replica_statuses #18796

Merged
merged 3 commits into from
Apr 17, 2023

Conversation

umanwizard
Copy link
Contributor

@umanwizard umanwizard commented Apr 17, 2023

When reporting a pod status change, check whether it's due to OOM, and report this in mz_cluster_replica_statuses

Tested manually. No automated tests because I don't know a way to force pods to OOM that doesn't take several minutes.

Partially fixes #18621

@umanwizard umanwizard requested a review from teskje April 17, 2023 14:07
@umanwizard umanwizard requested review from benesch and a team as code owners April 17, 2023 14:07
@umanwizard umanwizard requested a review from a team April 17, 2023 14:08
src/orchestrator-kubernetes/src/lib.rs Outdated Show resolved Hide resolved
NotReady,
/// The inner element is `None` if the reason
/// is unknown
NotReady(Option<NotReadyReason>),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of Option<NotReadyReason> what do you think about adding an Unknown variant to NotReadyReason? It would remove one layer of nesting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "extra" layer of nesting matches the actual value we emit to SQL -- NULL corresponds to None, whereas an inhabited SQL value corresponds to Some.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that makes sense!

Comment on lines 1682 to +1683
.with_column("status", ScalarType::String.nullable(false))
.with_column("reason", ScalarType::String.nullable(true))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these two need to be separate, or can they be folded in together? e.g. making them separate technically allows us to represent status: "ready" and reason: "OOMKilled", which should be an invalid state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but it also feels weird to have ready, not-ready-oom-killed, and not-ready-other as the possible values in a single column. The reason for non-readiness feels like a logically separate piece of information.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking you could distill the reasons a bit, i.e. today you could have one of three statuses:

  1. ready
  2. oom-killed
  3. not-ready

Where oom-killed implies "not ready". I totally see where you're coming from though w.r.t. non-readiness being a logically separate piece of info, so I could go either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me! At least from a Surfaces point of view

Comment on lines 1682 to +1683
.with_column("status", ScalarType::String.nullable(false))
.with_column("reason", ScalarType::String.nullable(true))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking you could distill the reasons a bit, i.e. today you could have one of three statuses:

  1. ready
  2. oom-killed
  3. not-ready

Where oom-killed implies "not ready". I totally see where you're coming from though w.r.t. non-readiness being a logically separate piece of info, so I could go either way

NotReady,
/// The inner element is `None` if the reason
/// is unknown
NotReady(Option<NotReadyReason>),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that makes sense!

Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea seems sound, I just have a couple improvement suggestions.

Also note that in mz_cluster_replica_statuses you will need to be fast/lucky to see the "reason" since the replica will be restarting quickly (unless it's crashlooping). Is this relation one of the "retained metrics" ones?

src/adapter/src/catalog.rs Outdated Show resolved Hide resolved
src/orchestrator-kubernetes/src/lib.rs Outdated Show resolved Hide resolved
src/orchestrator/src/lib.rs Outdated Show resolved Hide resolved
AdapterNotice::ClusterReplicaStatusChanged { status, .. } => {
match status {
ServiceStatus::NotReady(None) => Some("The cluster replica may be restarting or going offline.".into()),
ServiceStatus::NotReady(Some(NotReadyReason::OOMKilled)) => Some("The cluster replica may have run out of memory and been killed.".into()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use "may" here? If k8s says the replica was OOM-killed, is there a chance that this is not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know -- I doubt it! I was just trying to match the language in the other similar notice (the NotReady(None) case). Probably we should let @mjibson or someone else on Surfaces decide what the correct language is here for both cases.

src/adapter/src/catalog/builtin_table_updates.rs Outdated Show resolved Hide resolved
@umanwizard
Copy link
Contributor Author

@teskje

Is this relation one of the "retained metrics" ones?

No, but as a follow-up PR, we will make it one (assuming the memory requirements are acceptable -- I'll do some math first and post it in the comment of that PR).

For more details see the discussion here.

@umanwizard umanwizard merged commit 9679528 into MaterializeInc:main Apr 17, 2023
@benesch
Copy link
Member

benesch commented Apr 17, 2023

Tested manually. No automated tests because I don't know a way to force pods to OOM that doesn't take several minutes.

Eep, hang on! This needs a test, or it is virtually certain to regress. Let's figure out how to write one, possibly involving the QA team if necessary.

@umanwizard
Copy link
Contributor Author

umanwizard commented Apr 17, 2023 via email

@umanwizard
Copy link
Contributor Author

test Pr is here: #18806

morsapaes added a commit that referenced this pull request Apr 28, 2023
Retroactively adding a release note and patching the docs after #18796.

---------

Co-authored-by: umanwizard <brennan@umanwizard.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report OOMs in introspection
4 participants