Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudtest: Fortify the test_oom_clusterd test #18861

Merged

Conversation

philip-stoev
Copy link
Contributor

The test was forcing an clusterd to OOM, but this could also cause the entire Buildkite instance that it is running on to become unresponsive.

Fix by running the test against a memory-constrained Mz cluster that will OOM without bringing down the entire machine.

Motivation

  • This PR fixes a previously unreported bug.

CI was failing with "Agent lost" error, indicating a runaway process.

@@ -132,6 +135,16 @@ impl Default for ClusterReplicaSizeMap {
workers: scale.into(),
},
);

inner.insert(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new SIZE type , along with all the other types in this file, are defaults that are only used for testing. In the cloud, a completely separate set of SIZEs is installed and used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel more confident if we had this behind a if testing

Copy link
Contributor Author

@philip-stoev philip-stoev Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this entire thing is the default when executing bin/environmentd unless the entire SIZE list is overriden with a command-line option, which is what happens in the cloud
there is no easy way to distinguish "test" from "non-test" invocations of bin/environmentd at this time

src/adapter/src/catalog/config.rs Show resolved Hide resolved
The test was forcing an clusterd to OOM, but this could also cause
the entire Buildkite instance that it is running on to become unresponsive.

Fix by running the test against a memory-constrained Mz cluster that will
OOM without bringing down the entire machine.
@@ -132,6 +135,16 @@ impl Default for ClusterReplicaSizeMap {
workers: scale.into(),
},
);

inner.insert(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel more confident if we had this behind a if testing

@philip-stoev philip-stoev merged commit 808c114 into MaterializeInc:main Apr 20, 2023
@umanwizard
Copy link
Contributor

Thanks, Philip!

@def- def- mentioned this pull request May 16, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants