Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for oom-killed reason in mz_cluster_replica_statuses #18806

Merged
merged 10 commits into from
Apr 19, 2023

Conversation

umanwizard
Copy link
Contributor

@umanwizard umanwizard commented Apr 17, 2023

This test creates an index on a query that is known to OOM, repeatedly checks mz_cluster_replica_statuses until the OOM is observed, clears the query, and then repeatedly checks until the cluster is observed ready.

It's possible to imagine that this test can have false negatives (i.e., the test is green but the underlying behavior is broken) if certain patterns of flapping occur in the data. I think this is unlikely to be a concern in practice. It would be more precise/elegant if we could control how long it takes a pod to restart after OOMing, but there doesn't currently seem to be a way to do so.

@umanwizard umanwizard requested a review from benesch April 17, 2023 20:33
@umanwizard umanwizard changed the title . Add test for oom-killed reason in mz_cluster_replica_statuses Apr 17, 2023
@umanwizard umanwizard marked this pull request as ready for review April 17, 2023 20:37
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this! Deferring review to the QA team.

def verify_status(status: str, reason: Optional[str]) -> None:
while True:
(status, reason) = mz.environmentd.sql_query(
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may wish to use dedent` for the SQL for aeeaaesthetic reasons

verify_status("not-ready", "oom-killed")

mz.environmentd.sql("DROP VIEW v CASCADE")
# Now that we've dropped the problematic view, the replica should come back
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may wish to confirm the successful comeback of the replica in question by issuing a SELECT against that replica.

Copy link
Contributor

@philip-stoev philip-stoev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! If there is any flakiness in this test, we will address it as it happens.

@umanwizard
Copy link
Contributor Author

umanwizard commented Apr 18, 2023 via email

@umanwizard umanwizard enabled auto-merge (squash) April 19, 2023 14:05
@umanwizard umanwizard merged commit b42bb1d into MaterializeInc:main Apr 19, 2023
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful, thanks Brennan!

@def- def- mentioned this pull request May 16, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants