-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for oom-killed
reason in mz_cluster_replica_statuses
#18806
Conversation
5d4e36d
to
044af8a
Compare
oom-killed
reason in mz_cluster_replica_statuses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this! Deferring review to the QA team.
def verify_status(status: str, reason: Optional[str]) -> None: | ||
while True: | ||
(status, reason) = mz.environmentd.sql_query( | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may wish to use dedent` for the SQL for aeeaaesthetic reasons
verify_status("not-ready", "oom-killed") | ||
|
||
mz.environmentd.sql("DROP VIEW v CASCADE") | ||
# Now that we've dropped the problematic view, the replica should come back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may wish to confirm the successful comeback of the replica in question by issuing a SELECT
against that replica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! If there is any flakiness in this test, we will address it as it happens.
This is still too slow to converge; opened a thread on Slack to discuss:
https://materializeinc.slack.com/archives/C01LKF361MZ/p1681825775251649
…On Tue, Apr 18, 2023 at 2:02 AM Philip Stoev ***@***.***> wrote:
***@***.**** approved this pull request.
Thank you! If there is any flakiness in this test, we will address it as
it happens.
—
Reply to this email directly, view it on GitHub
<#18806 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOOGJ6P23GSEG2PNBI4GTTXBYU53ANCNFSM6AAAAAAXBTYE5E>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
47ad335
to
19ae3a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful, thanks Brennan!
This test creates an index on a query that is known to OOM, repeatedly checks
mz_cluster_replica_statuses
until the OOM is observed, clears the query, and then repeatedly checks until the cluster is observed ready.It's possible to imagine that this test can have false negatives (i.e., the test is green but the underlying behavior is broken) if certain patterns of flapping occur in the data. I think this is unlikely to be a concern in practice. It would be more precise/elegant if we could control how long it takes a pod to restart after OOMing, but there doesn't currently seem to be a way to do so.