Skip to content

conditionally create mz_system and mz_probe cluster replicas#31452

Merged
SangJunBak merged 4 commits intomainfrom
jun/#8954/remove-system-clusters
Feb 13, 2025
Merged

conditionally create mz_system and mz_probe cluster replicas#31452
SangJunBak merged 4 commits intomainfrom
jun/#8954/remove-system-clusters

Conversation

@SangJunBak
Copy link
Contributor

See commit messages for details

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@SangJunBak SangJunBak requested a review from a team as a code owner February 10, 2025 22:13
@SangJunBak SangJunBak requested a review from aljoscha February 10, 2025 22:13
@SangJunBak SangJunBak marked this pull request as draft February 10, 2025 22:14
@SangJunBak SangJunBak requested review from jubrad and removed request for aljoscha February 10, 2025 22:14
@SangJunBak SangJunBak force-pushed the jun/#8954/remove-system-clusters branch 2 times, most recently from d584a05 to 067349a Compare February 11, 2025 00:40
…lication factor

This change allows configurable replication factors for builtin clusters during bootstrap. This will be useful for disabling certain clusters for self managed while also not breaking any of our test infra.
@SangJunBak SangJunBak force-pushed the jun/#8954/remove-system-clusters branch from 067349a to 2381138 Compare February 11, 2025 00:59
@SangJunBak SangJunBak force-pushed the jun/#8954/remove-system-clusters branch from 2381138 to 1369e7f Compare February 11, 2025 01:04
@SangJunBak
Copy link
Contributor Author

@jubrad Unable to run bin/orchestratord to test due to the error

  Compiling mz-ore v0.1.0 (/Users/sangjunbak/materialize/src/ore)
error[E0432]: unresolved import `chrono`
  --> src/ore/src/panic.rs:31:5
   |
31 | use chrono::Utc;
   |     ^^^^^^ use of undeclared crate or module `chrono`

For more information about this error, try `rustc --explain E0432`.
error: could not compile `mz-ore` (lib) due to 1 previous error
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/sangjunbak/materialize/misc/python/materialize/cli/mzimage.py", line 135, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/sangjunbak/materialize/misc/python/materialize/cli/mzimage.py", line 37, in main
    deps.acquire()
  File "/Users/sangjunbak/materialize/misc/python/materialize/mzbuild.py", line 1080, in acquire
    prep = self._prepare_batch(deps_to_build)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sangjunbak/materialize/misc/python/materialize/mzbuild.py", line 1051, in _prepare_batch
    pre_image_prep[cls] = pre_image.prepare_batch(instances)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sangjunbak/materialize/misc/python/materialize/mzbuild.py", line 525, in prepare_batch
    spawn.runv(cargo_build, cwd=rd.root)
  File "/Users/sangjunbak/materialize/misc/python/materialize/spawn.py", line 75, in runv
    return subprocess.run(
           ^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['env', 'CMAKE_SYSTEM_NAME=Linux', 'CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc', 'CARGO_TARGET_DIR=/Users/sangjunbak/materialize/target-xcompile', 'TARGET_AR=aarch64-unknown-linux-gnu-ar', 'TARGET_CPP=aarch64-unknown-linux-gnu-cpp', 'TARGET_CC=aarch64-unknown-linux-gnu-cc', 'TARGET_CXX=aarch64-unknown-linux-gnu-c++', 'TARGET_CXXSTDLIB=static=stdc++', 'TARGET_LD=aarch64-unknown-linux-gnu-ld', 'TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib', 'RUSTFLAGS=--cfg=tokio_unstable -Clink-arg=-Wl,--compress-debug-sections=zlib -Clink-arg=-Wl,-O3 -Csymbol-mangling-version=v0 --cfg=tokio_unstable -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib -Clink-arg=-fuse-ld=lld -Clink-arg=-B/opt/homebrew/opt/lld/bin', 'cargo', 'build', '--target', 'aarch64-unknown-linux-gnu', '--bin', 'orchestratord', '--package=mz-orchestratord', '--release']' returned non-zero exit status 101.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/sangjunbak/materialize/misc/python/materialize/cli/orchestratord.py", line 422, in <module>
    main()
  File "/Users/sangjunbak/materialize/misc/python/materialize/cli/orchestratord.py", line 76, in main
    args.func(args)
  File "/Users/sangjunbak/materialize/misc/python/materialize/cli/orchestratord.py", line 80, in run
    acquire(
  File "/Users/sangjunbak/materialize/misc/python/materialize/cli/orchestratord.py", line 356, in acquire
    subprocess.check_call(["bin/mzimage", "acquire", image, *args])
  File "/opt/homebrew/Cellar/python@3.11/3.11.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bin/mzimage', 'acquire', 'orchestratord']' returned non-zero exit status 1.

. I'm guessing it's because I added adapter-types to sqllogictest's cargo.toml? Regardless, I don't think this should block the review

@SangJunBak SangJunBak marked this pull request as ready for review February 11, 2025 01:07
@SangJunBak SangJunBak force-pushed the jun/#8954/remove-system-clusters branch from 1369e7f to dba9f12 Compare February 11, 2025 04:25
@SangJunBak SangJunBak requested a review from ParkMyCar February 11, 2025 18:13
@SangJunBak SangJunBak force-pushed the jun/#8954/remove-system-clusters branch from dba9f12 to 6993b53 Compare February 11, 2025 18:55
Copy link
Contributor

@jubrad jubrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the changes in src/orchestratord/src/controller/materialize/environmentd.rs

and some of the comments need to be cleaned up.

@SangJunBak SangJunBak force-pushed the jun/#8954/remove-system-clusters branch from 6993b53 to 4e4e92a Compare February 11, 2025 20:49
@SangJunBak SangJunBak force-pushed the jun/#8954/remove-system-clusters branch from 4e4e92a to 3a76dd4 Compare February 12, 2025 04:36
@SangJunBak SangJunBak requested a review from jubrad February 12, 2025 04:37
Copy link
Contributor

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, would wait to merge until you get @jubrad's approval for orchestratord changes though!

Comment on lines 1147 to 1162
let cluster_config = match cluster_name {
name if name == mz_catalog::builtin::MZ_SYSTEM_CLUSTER.name => &self.system_cluster,
name if name == mz_catalog::builtin::MZ_CATALOG_SERVER_CLUSTER.name => {
&self.catalog_server_cluster
}
name if name == mz_catalog::builtin::MZ_PROBE_CLUSTER.name => &self.probe_cluster,
name if name == mz_catalog::builtin::MZ_SUPPORT_CLUSTER.name => &self.support_cluster,
name if name == mz_catalog::builtin::MZ_ANALYTICS_CLUSTER.name => {
&self.analytics_cluster
}
_ => {
return Err(mz_catalog::durable::CatalogError::Catalog(
SqlCatalogError::UnexpectedBuiltinCluster(cluster_name.to_owned()),
))
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it feels like this would be better written as if-else statements? If you want to prevent all of the Ok wrapping you can do:

let cluster_config = if cluster_name == "foo" {
    &self.foo_cluster
} else {
    return ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

Comment on lines +21 to +22
pub const SUPPORT_CLUSTER_DEFAULT_REPLICATION_FACTOR: u32 = 0;
pub const ANALYTICS_CLUSTER_DEFAULT_REPLICATION_FACTOR: u32 = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here, can you add a comment explaining why these have a default of 0? e.g. that they are ephemeral clusters we spin up only to scrape analytics or for support when debugging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

long,
env = "BOOTSTRAP_BUILTIN_SYSTEM_CLUSTER_REPLICATION_FACTOR",
default_value = SYSTEM_CLUSTER_DEFAULT_REPLICATION_FACTOR.to_string(),
value_parser = clap::value_parser!(u32).range(0..=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would maybe allow 0..=2 for these ranges, seems like it could maybe be useful in the future

Comment on lines +698 to +717
builtin_system_cluster_config: BootstrapBuiltinClusterConfig {
size: replica_size.clone(),
replication_factor: SYSTEM_CLUSTER_DEFAULT_REPLICATION_FACTOR,
},
builtin_catalog_server_cluster_config: BootstrapBuiltinClusterConfig {
size: replica_size.clone(),
replication_factor: CATALOG_SERVER_CLUSTER_DEFAULT_REPLICATION_FACTOR,
},
builtin_probe_cluster_config: BootstrapBuiltinClusterConfig {
size: replica_size.clone(),
replication_factor: PROBE_CLUSTER_DEFAULT_REPLICATION_FACTOR,
},
builtin_support_cluster_config: BootstrapBuiltinClusterConfig {
size: replica_size.clone(),
replication_factor: SUPPORT_CLUSTER_DEFAULT_REPLICATION_FACTOR,
},
builtin_analytics_cluster_config: BootstrapBuiltinClusterConfig {
size: replica_size.clone(),
replication_factor: ANALYTICS_CLUSTER_DEFAULT_REPLICATION_FACTOR,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying these 5 fields all the time feels brittle, e.g. easy to mix up the config values for two different clusters. A fix might be to use newtypes, e.g.

pub struct SystemClusterReplicationFactor(usize);
pub struct CatalogServerClusterReplicationFactor(usize);
...

But that seems quite tedious too. If you feel inspired thinking about how we can make this more succinct might be nice, but definitely not blocking!

Copy link
Contributor Author

@SangJunBak SangJunBak Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying these 5 fields all the time feels brittle

hmm my Rust knowledge is kinda capped here! Like the cluster_config variables ? Would that require a new BootstrapBuiltinClusterConfig struct per <builtin cluster>ClusterReplicationFactor(usize)? Or could you do something like:

... 
   builtin_analytics_cluster_config: BootstrapBuiltinClusterConfig {
                    size: replica_size.clone(),
                    replication_factor: AnalyticsClusterReplicationFactor (ANALYTICS_CLUSTER_DEFAULT_REPLICATION_FACTOR),
                },

If we had to make a new ...ClusterConfig struct per builtin cluster, i kinda feel like it makes each config less generic which might be bad? Might be helpful to go over this in person since I'm genuinely curious!

Comment on lines +841 to +858
config
.bootstrap_builtin_system_cluster_replication_factor
.as_ref()
.map(|replication_factor| {
format!("--bootstrap-builtin-system-cluster-replication-factor={replication_factor}")
}),
config
.bootstrap_builtin_probe_cluster_replication_factor
.as_ref()
.map(|replication_factor| format!("--bootstrap-builtin-probe-cluster-replication-factor={replication_factor}")),
config
.bootstrap_builtin_support_cluster_replication_factor
.as_ref()
.map(|replication_factor| format!("--bootstrap-builtin-support-cluster-replication-factor={replication_factor}")),
config
.bootstrap_builtin_analytics_cluster_replication_factor
.as_ref()
.map(|replication_factor| format!("--bootstrap-builtin-analytics-cluster-replication-factor={replication_factor}")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a config for the catalog_server replication factor, maybe that's intentional? It seems like it would be nice to have parity for all system clusters here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Justin:

I think we should only include these options where it wouldn't be a complete footgun to set it to a value other than 1.

And I kinda agree with him here!

Copy link
Contributor

@jubrad jubrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jubrad jubrad added the self-managed-backport-v25.1 Needs to be backported into the v25.1 self-managed release label Feb 12, 2025
- modifies the allowed replication factor range for system clusters from 0-1 to 0-2, providing more flexibility in cluster configuration.
- the code for retrieving builtin cluster configurations has been refactored to use a more concise if-else structure
- Adds a comments for builtin clusters with replication factor 0
@SangJunBak SangJunBak enabled auto-merge (squash) February 13, 2025 19:16
@SangJunBak SangJunBak merged commit b2128e7 into main Feb 13, 2025
81 checks passed
@SangJunBak SangJunBak deleted the jun/#8954/remove-system-clusters branch February 13, 2025 20:15
def- pushed a commit to def-/materialize that referenced this pull request Feb 27, 2025
…lizeInc#31452)

See commit messages for details
<!--
Describe the contents of the PR briefly but completely.

If you write detailed commit messages, it is acceptable to copy/paste
them
here, or write "see commit messages for details." If there is only one
commit
in the PR, GitHub will have already added its commit message above.
-->

  * This PR adds a known-desirable feature.
MaterializeInc/database-issues#8954
<!--
Which of the following best describes the motivation behind this PR?

  * This PR fixes a recognized bug.

    [Ensure issue is linked somewhere.]

    [Ensure issue is linked somewhere.]

  * This PR fixes a previously unreported bug.

    [Describe the bug in detail, as if you were filing a bug report.]

  * This PR adds a feature that has not yet been specified.

[Write a brief specification for the feature, including justification
for its inclusion in Materialize, as if you were writing the original
     feature specification.]

   * This PR refactors existing code.

[Describe what was wrong with the existing code, if it is not obvious.]
-->

<!--
Leave some tips for your reviewer, like:

    * The diff is much smaller if viewed with whitespace hidden.
    * [Some function/module/file] deserves extra attention.
* [Some function/module/file] is pure code movement and only needs a
skim.

Delete this section if no tips.
-->

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.
def- pushed a commit to def-/materialize that referenced this pull request Feb 27, 2025
…lizeInc#31452)

See commit messages for details
<!--
Describe the contents of the PR briefly but completely.

If you write detailed commit messages, it is acceptable to copy/paste
them
here, or write "see commit messages for details." If there is only one
commit
in the PR, GitHub will have already added its commit message above.
-->

  * This PR adds a known-desirable feature.
MaterializeInc/database-issues#8954
<!--
Which of the following best describes the motivation behind this PR?

  * This PR fixes a recognized bug.

    [Ensure issue is linked somewhere.]

    [Ensure issue is linked somewhere.]

  * This PR fixes a previously unreported bug.

    [Describe the bug in detail, as if you were filing a bug report.]

  * This PR adds a feature that has not yet been specified.

[Write a brief specification for the feature, including justification
for its inclusion in Materialize, as if you were writing the original
     feature specification.]

   * This PR refactors existing code.

[Describe what was wrong with the existing code, if it is not obvious.]
-->

<!--
Leave some tips for your reviewer, like:

    * The diff is much smaller if viewed with whitespace hidden.
    * [Some function/module/file] deserves extra attention.
* [Some function/module/file] is pure code movement and only needs a
skim.

Delete this section if no tips.
-->

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

self-managed-backport-v25.1-done self-managed-backport-v25.1 Needs to be backported into the v25.1 self-managed release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants