Add test for `oom-killed` reason in `mz_cluster_replica_statuses` #18806

umanwizard · 2023-04-17T19:28:06Z

This test creates an index on a query that is known to OOM, repeatedly checks mz_cluster_replica_statuses until the OOM is observed, clears the query, and then repeatedly checks until the cluster is observed ready.

It's possible to imagine that this test can have false negatives (i.e., the test is green but the underlying behavior is broken) if certain patterns of flapping occur in the data. I think this is unlikely to be a concern in practice. It would be more precise/elegant if we could control how long it takes a pod to restart after OOMing, but there doesn't currently seem to be a way to do so.

benesch

Thank you very much for this! Deferring review to the QA team.

philip-stoev · 2023-04-18T06:00:03Z

test/cloudtest/test_replica_restart.py

+    def verify_status(status: str, reason: Optional[str]) -> None:
+        while True:
+            (status, reason) = mz.environmentd.sql_query(
+                """


you may wish to use dedent` for the SQL for aeeaaesthetic reasons

philip-stoev · 2023-04-18T06:01:18Z

test/cloudtest/test_replica_restart.py

+    verify_status("not-ready", "oom-killed")
+
+    mz.environmentd.sql("DROP VIEW v CASCADE")
+    # Now that we've dropped the problematic view, the replica should come back


you may wish to confirm the successful comeback of the replica in question by issuing a SELECT against that replica.

philip-stoev

Thank you! If there is any flakiness in this test, we will address it as it happens.

umanwizard · 2023-04-18T13:52:09Z

This is still too slow to converge; opened a thread on Slack to discuss: https://materializeinc.slack.com/archives/C01LKF361MZ/p1681825775251649

…

On Tue, Apr 18, 2023 at 2:02 AM Philip Stoev ***@***.***> wrote: ***@***.**** approved this pull request. Thank you! If there is any flakiness in this test, we will address it as it happens. — Reply to this email directly, view it on GitHub <#18806 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOOGJ6P23GSEG2PNBI4GTTXBYU53ANCNFSM6AAAAAAXBTYE5E> . You are receiving this because you authored the thread.Message ID: ***@***.***>

src/orchestrator-kubernetes/src/lib.rs

benesch

Beautiful, thanks Brennan!

umanwizard force-pushed the test_cluster_oom branch from 5d4e36d to 044af8a Compare April 17, 2023 19:51

umanwizard requested a review from benesch April 17, 2023 20:33

umanwizard changed the title . Add test for oom-killed reason in mz_cluster_replica_statuses Apr 17, 2023

umanwizard requested a review from philip-stoev April 17, 2023 20:37

umanwizard marked this pull request as ready for review April 17, 2023 20:37

umanwizard mentioned this pull request Apr 17, 2023

Surface OOMs in mz_cluster_replica_statuses #18796

Merged

benesch reviewed Apr 18, 2023

View reviewed changes

philip-stoev reviewed Apr 18, 2023

View reviewed changes

philip-stoev approved these changes Apr 18, 2023

View reviewed changes

umanwizard enabled auto-merge (squash) April 19, 2023 14:05

umanwizard added 9 commits April 19, 2023 10:52

wip

61be093

fix query

7b5e646

pyfmt

0973c14

oops, fix verify_status

ca1f239

debugging

36af140

fixes

83a0d9c

fixes

ea041c6

remove random blank lines

0b5a1cb

Appease linter

19ae3a2

umanwizard force-pushed the test_cluster_oom branch from 47ad335 to 19ae3a2 Compare April 19, 2023 14:53

umanwizard commented Apr 19, 2023

View reviewed changes

src/orchestrator-kubernetes/src/lib.rs Outdated Show resolved Hide resolved

Update src/orchestrator-kubernetes/src/lib.rs

00243e5

umanwizard merged commit b42bb1d into MaterializeInc:main Apr 19, 2023

benesch reviewed Apr 19, 2023

View reviewed changes

materialize-bot mentioned this pull request Apr 21, 2023

release: v0.52.0 required reviews #18907

Closed

12 tasks

def- mentioned this pull request May 16, 2023

no oom printing #19306

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test for `oom-killed` reason in `mz_cluster_replica_statuses` #18806

Add test for `oom-killed` reason in `mz_cluster_replica_statuses` #18806

umanwizard commented Apr 17, 2023 •

edited

Loading

benesch left a comment

philip-stoev Apr 18, 2023

philip-stoev Apr 18, 2023

philip-stoev left a comment

umanwizard commented Apr 18, 2023 via email

benesch left a comment

Add test for oom-killed reason in mz_cluster_replica_statuses #18806

Add test for oom-killed reason in mz_cluster_replica_statuses #18806

Conversation

umanwizard commented Apr 17, 2023 • edited Loading

benesch left a comment

Choose a reason for hiding this comment

philip-stoev Apr 18, 2023

Choose a reason for hiding this comment

philip-stoev Apr 18, 2023

Choose a reason for hiding this comment

philip-stoev left a comment

Choose a reason for hiding this comment

umanwizard commented Apr 18, 2023 via email

benesch left a comment

Choose a reason for hiding this comment

Add test for `oom-killed` reason in `mz_cluster_replica_statuses` #18806

Add test for `oom-killed` reason in `mz_cluster_replica_statuses` #18806

umanwizard commented Apr 17, 2023 •

edited

Loading