spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios#36540
Conversation
The existing scenarios scale cluster size or envd CPU cores -- nothing
measures how adapter/envd latency moves as the catalog itself grows. Add
two scenarios under a new `envd_scalability` group that fix the
measurement cluster and vary the number of catalog objects.
`envd_scalability_tables` puts N empty tables in the catalog -- pure
catalog/adapter pressure, no controller load. `envd_scalability_mvs`
does N materialized views over a single 1-row base table -- same
catalog footprint, plus controller load proportional to N. The MV
scenario shards across single-replica pad clusters at 10000 MVs per
cluster (so 100k MVs spans 10 clusters), since one cluster can't
reasonably host that many dataflows.
For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run
10 reps each of `CREATE TABLE` (DDL through the coordinator) and
`SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster).
The catalog is built incrementally across size points, so going from
N=k to the next size point only adds (next - k) objects -- otherwise
we'd pay an O(sizes * N) build cost. The size list is overridable via
`--envd-scalability-sizes` for scaffolding runs.
Results land in a third CSV (`*.envd_scalability.csv`) reusing the
cluster CSV schema; `mode='envd_scalability'` distinguishes the rows.
Test analytics rides on the existing `cluster_spec_sheet_result` table
-- no schema change needed. The analyzer plots `time_ms` vs N per
(scenario, category, test_name).
This is going to be long-running, especially the MV scenario where each
create exercises the controller -- expect hours for the full size
range.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two new scenarios -- cluster_object_limits_indexes and cluster_object_limits_mvs -- that find, per cluster size, the maximum number of idle materializations one cluster can keep fresh. The materializations are derived from a one-row, never-updated base table so the only work the cluster has to do is keep advancing each materialization's write_frontier in step with the upstream table. Once the cluster can't keep up, freshness collapses; the driver records the largest N at which `max(local_lag) < 2s` was still achievable, with the unhealthy data point recorded too so the cliff is visible. Staging-only (rejects --target=cloud-production), to avoid burning production resources on long object-limit searches.
…lability default at 50k When a materialization stalls completely (write_frontier never advances past the minimum timestamp), `mz_internal.mz_materialization_lag` reports `now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this crushes every healthy data point to ~0 on the plot. Cap the recorded value at 10x the healthy threshold (= 20 s), preserve the underlying truth via the `healthy` column, and label the plot to make the cap and healthy threshold explicit. Also drop 100_000 from the envd_scalability default size list: 50_000 is a more sensible default ceiling for staging. The full size list is still override-able via --envd-scalability-sizes for ad-hoc runs.
…tion The release-qualification pipeline already runs three cluster-spec-sheet groups (cluster_compute on production, source_ingestion on production, environmentd on staging). Add two more groups -- envd_scalability and cluster_object_limits -- both running against staging, since both push the catalog / cluster to limits we don't want to exercise on production.
The three "envd / cluster" groups in the cluster-spec-sheet were named inconsistently. Settle on the three concept names the cluster-spec-sheet effort uses verbally: environmentd -> envd_qps_scalability (QPS vs envd CPU) envd_scalability -> envd_objects_scalability (latency vs catalog N) cluster_object_limits -> cluster_object_limits (unchanged) Renames apply to: scenario constants, scenario-name string values, group keys in SCENARIO_GROUPS, class names, the run/analyze function names, the --envd-scalability-sizes CLI flag, the result CSV suffix, and the `mode` field written into CSV rows. The pre-existing QPS scenarios keep their individual `*_envd_strong_scaling` names since only the group is renamed. Also updates the release-qualification pipeline step ids/args and the README to match.
…w start When debugging cluster-spec-sheet runs on staging it's hard to tell which environment we're actually talking to and whether the system parameter defaults we expect (lifted via LaunchDarkly or similar) are actually applied. Add a one-shot diagnostic right after target.initialize() that prints mz_environment_id() and SHOWs the limits the test depends on (max_tables, max_materialized_views, max_objects_per_schema, max_clusters, max_credit_consumption_rate, memory_limiter_interval). Best-effort: any probe error is logged and swallowed so a transient failure does not abort the workflow.
psycopg3's execute() requires a LiteralString, so the f-string SHOW query tripped pyright in CI. Compose the statement with psycopg.sql.SQL/Identifier instead, matching the pattern already used in test/orchestratord/mzcompose.py.
A staging run of `envd_objects_scalability_mvs` (release-qualification
build 1219) aborted at N≈19800 with:
Retryable error: consuming input failed: SSL error: unexpected eof
while reading, reconnecting...
psycopg.errors.InternalError_: materialized view
"materialize.pad_schema.pad_mv_19805" already exists
The TLS connection dropped mid-statement; envd had already committed the
CREATE but the response was lost. ConnectionHandler.retryable reconnects
and replays the same statement, which then fails with "already exists".
Use ``CREATE ... IF NOT EXISTS`` for every CREATE issued via _bulk_run so
the retry is a no-op. Affects the bulk-creation paths in both
envd_objects_scalability scenarios (tables, MVs) and both
cluster_object_limits scenarios (indexed views, MVs). Add a docstring on
_bulk_run spelling out the idempotency requirement so future CREATEs
don't reintroduce the hazard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50k scale point pushes a single Envd Objects Scalability run past the 13-hour mark on staging — adapter latency degrades so much by then that each measurement repetition takes several seconds, and the catalog build itself runs at <1/s. 30k is where the interesting signal already lives. Drop 50k from the default list; ad-hoc runs that want it can still pass --envd-objects-scalability-sizes explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits N-list defaults to a +1k linear step past N=1000, which is too coarse: a run on staging showed the cliff sits in (1000, 2000] for cluster_object_limits_indexes across every cluster size 100cc..1600cc, and we can't tell from that whether the limit is 1100 or 1900. After the coarse N-walk hits its first unhealthy point, bisect the (last_healthy, first_unhealthy) interval --cluster-object-limits-bisect- steps times (default 4) and probe each midpoint. The bisection step adds or drops objects in place — never rebuilds the catalog — so the cost is only ~bisect_steps extra hydrate-and-probe rounds per cluster size. With the default 4 steps, the cliff narrows to ±~60 objects. Adds: - `remove_objects(target_n)` symmetric to `add_objects(target_n)` on both ClusterObjectLimitsScenario subclasses. Indexes scenario drops via DROP VIEW ... CASCADE (cascades to the default index); MVs scenario drops via DROP MATERIALIZED VIEW. - `--cluster-object-limits-bisect-steps` CLI flag plumbed through to `run_scenario_cluster_object_limits`. - Bisection block in the per-cluster-size loop that calls add+remove (one is a no-op) and records each probe under the same CSV schema, so the existing freshness-lag-vs-N plot just gets denser near the cliff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If CREATE CLUSTER fails for a cluster_object_limits size — because the target region doesn't expose that replica size, or because allocating the cluster would exceed max_credit_consumption_rate — today the scenario either aborts with a traceback or (when the cluster is created but then can't actually keep up) reports a confusing "unhealthy at N=100" data point. Catch psycopg.errors.DatabaseError around the CREATE CLUSTER, log a clear "size unavailable" line (with the underlying error class + message), and `continue` to the next cluster size. OperationalError is re-raised so that genuine connection failures (which run_query's retry loop has already given up on) aren't silently masked as a size problem. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits plots (max-healthy-N bars + lag-vs-N legend) ordered cluster sizes alphanumerically — "100cc, 1600cc, 200cc, 3200cc, 400cc, 800cc" — making the small→large progression unreadable. Lift the existing `extract_cluster_size` helper to module scope and use it to reindex the bar plot's index and reorder the line plot's columns. The cluster-results path was already using it for its x-axis, so the extraction is just hoisted, not duplicated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In build 1223 (release-qualification), 1600cc/3200cc showed implausibly small max-healthy-N — 1600cc reported 0 healthy indexes / 93 healthy MVs where 100cc–800cc routinely handled 1500+ indexes and 687+ MVs. Probing the first N (=100) on a freshly-created 1600cc cluster returned local_lag values of 90+ seconds for indexes and ~unix-epoch-ms for MVs (i.e. write_frontier stuck at zero). Once that first probe declared the cluster unhealthy, the bisect could not recover: each successive sample measured more accumulated lag, not less, because the cluster never got a chance to settle. Likely cause is cold-start: bigger replicas take longer to begin serving frontiers after CREATE CLUSTER + bulk DDL, and the 60s hydration window expires before steady state. Bump it to 300s as a first diagnostic — if 1600cc/3200cc now look healthy at reasonable N, this confirms the hypothesis and we can keep the higher timeout (or make it size-dependent). If they still look broken, the issue is elsewhere (provisioning, multi-process replica semantics, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ult cap The envd_objects_scalability default cap was reduced from 100k to 30k in bdb6607 but two comments still referenced the older shape. Update the system-parameter rationale to describe the headroom relationship between the lifted ceilings (200k) and the user-configurable cap (default 30k), and update the analyzer docstring to match.
…base_t EnvdObjectsScalabilityMvsScenario called its one-row base table pad_schema.pad_base while ClusterObjectLimitsScenario called the same shape table pad_schema.base_t. Use BASE_TABLE = "base_t" consistently.
The new envd_objects_scalability and cluster_object_limits teardown
paths each open-coded a try/except + print("WARNING: failed to drop ...")
block around their DROP statements. Pull the pattern into a single
helper used by all four call sites.
The four DictWriter blocks in workflow_default repeated almost the same 10-field fieldname list. envd_objects_scalability claimed in a comment to "reuse the cluster-focused schema" but spelled it out anyway, and cluster_object_limits redeclared the same list plus a single extra column. Hoist CLUSTER_FIELDNAMES + ENVD_FIELDNAMES and build all four writers from them via a small _make_csv_writer helper.
The four analyze_*_results_file functions repeated the same six-line header: print banner, read CSV, empty check, derive base_name, build plot_dir, mkdir. Pull it into a helper that returns (df, plot_dir) or None when the file is empty.
hydrate_and_sample's inner probe_once helper wrapped each probe with SET cluster=<probe>; <select>; SET cluster=c — three round-trips per probe. Over a 300s hydration window plus 5 steady-state samples that adds up to ~900 redundant SETs per N. Move the two SETs to a single try/finally around the whole polling window; the per-probe work is now just the lag SELECT.
The "DROP CLUSTER IF EXISTS c CASCADE; CREATE CLUSTER c SIZE ...; SET cluster = 'c'" sequence appeared verbatim in five run_scenario_* functions (strong, envd_strong_scaling, envd_objects_scalability, cluster_object_limits, weak), and the one-row probe-table prep in three of them. Move both into helpers; the cluster_object_limits "skip if size unavailable" branch becomes a parameter on the helper rather than an open-coded try/except at the call site.
…stry Four workflow_plot_* functions were structurally identical (parser arg, parse, glob, call one analyzer); the multi-kind workflow_plot did the same plus a 4-way if/elif suffix dispatch. Replace both shapes with a shared _plot_files helper that takes either a fixed analyzer (per-kind workflows) or dispatches via the new SUFFIX_ANALYZERS registry (the multi-kind workflow_plot). The five workflow_* functions are now one-call wrappers.
…ario subclasses
The two ClusterObjectLimits scenario subclasses differed only in (a) which
DDL statements create/drop one materialization (CREATE VIEW + CREATE
DEFAULT INDEX vs CREATE MATERIALIZED VIEW), and (b) which catalog table
mz_materialization_lag is joined against. Lift the differences into a
frozen ClusterObjectLimitsKind dataclass carrying create/drop SQL
templates plus the lag-filter join, and have a single
ClusterObjectLimitsScenario class read its kind to drive add_objects /
remove_objects / probe_lag_ms. The two scenarios are now constructed as
ClusterObjectLimitsScenario(CLUSTER_OBJECT_LIMITS_{INDEXES,MVS}_KIND).
EnvdObjectsScalability{Tables,Mvs} are left as separate subclasses: the
MV variant carries pad-cluster sharding state and a distinct init/teardown,
so collapsing them would just hide the structural difference behind
conditionals rather than remove duplication.
…ion_statuses
The freshness probe previously combined "is the dataflow running yet?"
with "is the cluster keeping up?" into a single predicate:
reporting == N AND max_local_lag_ms < lag_threshold_ms
That meant every unhealthy probe burned the full hydration timeout
(300s in build 1226, see ace6b0f) before declaring failure: the lag
on an overloaded cluster never falls under 2s, so the loop polls to
the deadline and only then captures the lag. In 1226 the 100cc
N=2000, 200cc N=3000, and 1600cc N=2000 probes each sat for ~301s
before recording lag values of 654s–675s. Bisecting an unhealthy
region pays this cost again at every step.
Split into two phases:
1. Poll `mz_internal.mz_hydration_statuses` until every test object
on `c` reports `hydrated = true`. This is a definitive per-object
signal — the dataflow has finished initial snapshotting — and
converges quickly even on cold-started 1600cc/3200cc replicas.
Timeout here means the replica is genuinely wedged.
2. Once hydrated, take the existing `CLUSTER_OBJECT_LIMITS_SAMPLES`
steady-state lag samples. Unhealthy now means "hydrated but
can't keep up", which is the property we actually want to
measure; an overloaded cluster trips the threshold in
`samples * sample_interval` (~10s) instead of in 300s.
With this decoupling the cold-start argument for the 300s timeout no
longer applies, so drop it back to 60s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dration budget In build 1227, 3200cc indexes and 1600cc MVs both produced false cliffs at N=100: the very first probe after CREATE CLUSTER timed out with 0/100 hydrated, but every subsequent bisect step (N=50/75/87/93) hydrated cleanly with lag=0.0. The cluster works fine — the replica just isn't reporting introspection within 60s of being created on multi-process sizes. Thread a per-call `timeout_s` into `hydrate_and_sample` and let `probe_and_record` pick between the regular and a longer "first probe" budget. The coarse N-walk passes `first_probe=True` only on its first iteration, so big-replica cold start gets headroom while every other probe keeps the tight 60s budget that makes unhealthy points cheap to record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior changes; just shorter explanations where the original
prose restated the code or expanded beyond what a reader needs:
- MATERIALIZED_ADDITIONAL_SYSTEM_PARAMETER_DEFAULTS rationale: 7 → 3 lines
- _bulk_run docstring: 12 → 5 lines (keep the idempotency warning)
- hydrate_and_sample docstring: 25 → 13 lines (keep the "why split
the phases" justification)
- probe_lag_ms / probe_hydrated docstrings: drop the tuple-field
enumeration that duplicates the return type
- collapse the two "framework setup/drop unused" comments
- drop pure-label comments ("Snapshot of cluster sizes", "Outer loop")
and the init/teardown lines that restate the next statement
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`workflow_default` carried a 150-line if-ladder mapping each `SCENARIO_*`
string to a `run_scenario_*` invocation, plus four parallel sections that
open / upload / archive / analyze one CSV each. Adding a new scenario or
result kind meant touching every one of those.
Replace both with two data registries:
- `ScenarioSpec` + `SCENARIOS`: name, log label, family, factory lambda,
groups. `SCENARIOS_BY_NAME` and `SCENARIO_GROUPS` are derived from it,
so the hand-written `SCENARIOS_CLUSTERD` / `_COMPUTE` / etc. lists go
away. A `Family` literal + `FAMILY_TO_STREAM` table selects the
driver, and a small `run_spec` match replaces the if-ladder.
- `ResultStreamSpec` + `RESULT_STREAMS`: suffix, fieldnames, analyzer,
uploader. `workflow_default` opens one CSV per spec and the four
parallel close/upload/artifact/analyze blocks become single loops.
The old `SUFFIX_ANALYZERS` is now a derived alias.
The scenarios themselves, the four `run_scenario_*` drivers, and the
`Scenario` ABC are unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `Scenario` ABC pretended that all four scenario families shared a single setup/drop/materialize_views/run lifecycle, but only strong/weak/envd_qps actually used it. EnvdObjectsScalability and ClusterObjectLimits returned [] from every ABC method and were driven through entirely different protocols (init/add_objects/teardown and reset_for_cluster_size/probe_*/teardown respectively), with comments apologising that "framework-level setup/drop are unused". Replace the single ABC with three real shapes: * `ClusterScalingScenario` (renamed from `Scenario`) for the strong/weak/envd_qps families. `drop()` and `materialize_views()` now default to `[]` so the envd_qps subclasses no longer need empty overrides. * `EnvdObjectsScalabilityScenario` becomes its own ABC with the methods it actually exposes; the unused `replica_size` constructor parameter is dropped. * `ClusterObjectLimitsScenario` becomes a plain class (no inheritance) and the unused `replica_size` parameter is dropped. `ScenarioSpec.factory` now returns the union `AnyScenario`, and `run_spec` narrows it per-family with isinstance asserts. With ClusterObjectLimitsScenario no longer pretending to be a generic Scenario, the 220-line `run_scenario_cluster_object_limits` collapses: the `hydrate_and_sample` and `probe_and_record` closures and the N-walk + bisect loop move onto the scenario as `_hydrate_and_sample`, `_probe_and_record`, and `run_for_cluster_size`. The driver shrinks to the outer cluster-size loop plus ScenarioRunner construction and `_recreate_cluster_c` / `reset_for_cluster_size` / `teardown` bookkeeping. Behaviour is unchanged: same 13 scenarios, same groups, same CSV output. Verified by direct module import and pyright/ruff/black. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three parallel scenario hierarchies (ClusterScalingScenario, EnvdObjectsScalabilityScenario, ClusterObjectLimitsScenario) and five run_scenario_* drivers collapsed onto one Scenario ABC with a prepare/scale_points/apply/measure/cleanup_point/teardown lifecycle. Strong/weak/envd-cpu/envd-objects become thin sweep wrappers around the existing inner workloads; ClusterObjectLimitsScenario implements Scenario directly. The result-stream choice now lives on each scenario via stream_key(), so Family, FAMILY_TO_STREAM, AnyScenario, RunContext and run_spec all go away. Also: shared _extend_incremental/_shrink_incremental helpers used by both envd_objects and cluster_object_limits scenarios, and the four per-kind workflow_plot_* entries collapse to one workflow_plot that dispatches by filename suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rop dead replica_size param ClusterObjectLimitsScenario._probe_and_record was the only caller that bypassed runner.add_result and wrote rows directly via results_writer.writerow(...), because add_result didn't know about the extra `healthy` column. Extend add_result with optional `time_ms` (for values already in ms) and `healthy` kwargs so the cluster_object_limits path matches every other scenario. The `healthy` column is silently dropped on streams whose schema doesn't include it via the existing extrasaction="ignore". Also drop the `replica_size` constructor parameter from ScenarioRunner: every sweep wrapper passes None and mutates `runner.replica_size` in `apply()`. The param was dead weight. _probe_and_record's `replica_size` argument is dropped for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| ) -> None: | ||
| """Write one result row. |
There was a problem hiding this comment.
Do we need this dance around time/time_ms? Couldn't we just update the existing callsites to pass in time_ms because apparently here we always convert to time_ms
| These scenarios share a four-step lifecycle: `drop()` cleans up state from | ||
| a prior run, `setup()` creates load-generator clusters / sources / tables, | ||
| `materialize_views()` lists persist objects to hydrate, and `run()` | ||
| performs the measurements. The drivers (`run_scenario_strong`, |
There was a problem hiding this comment.
Do we still have these per scenario drivers?
| # Can return: status 404 Not Found | ||
| pass | ||
| The scale point (number of catalog objects, N) is driven externally by | ||
| `run_scenario_envd_objects_scalability` via init/add_objects/teardown; |
There was a problem hiding this comment.
we still have these scenario runners?
| def _recreate_cluster_c( | ||
| runner: ScenarioRunner, | ||
| replica_size: str, | ||
| smoke_test: bool = False, |
There was a problem hiding this comment.
What do we have this smoke_test thing for?
|
|
||
| label: str | ||
| cluster_size: str | None = None |
There was a problem hiding this comment.
What's the difference between cluster_size and replica_scale?
|
|
||
|
|
||
| class EnvdObjectsScalabilityMvsScenario(EnvdObjectsScalabilityScenario): | ||
| """N materialized views in the catalog, sharded across pad clusters. |
There was a problem hiding this comment.
We should also mention that each MV has it's own persist source (the table). This is as opposed to say having one table (persist source) that goes into an index, and then many materialized views off that index.
There was a problem hiding this comment.
And the same holds for the index scenario below. We can note that as follow-up work we want to also do the one table into index into many indexes scenario
| self._pad_clusters_created += 1 | ||
|
|
||
| def add_objects(self, runner: ScenarioRunner, target_n: int) -> None: | ||
| # Build one shard at a time: each shard lives on its own pad cluster |
There was a problem hiding this comment.
Shard naming is confusing here because persist also has the concept of shards. Audit the file for this and find something else.
| statements (indexes need a view + a default index; MVs need a single | ||
| CREATE MATERIALIZED VIEW). Statement templates take ``{schema}``, | ||
| ``{base}``, and ``{i}`` placeholders; the ``WHERE id < {i}`` predicate | ||
| is what makes each view/MV structurally distinct so plan caching can't |
There was a problem hiding this comment.
I don't think it's plan caching that would do this in materialize, but we are doing it because some other things might. Let's drop this, or let's just say to make sure that we get separate dataflows. Audit the reset of the file for this kind of thing and also fix up those places.
| - A single one-row base table `base_t` (never updated) lives in | ||
| ``PAD_SCHEMA``. Frontiers advance over wall-clock time, so even with | ||
| no writes the cluster must keep ticking every materialization's | ||
| write_frontier — which is what saturates an undersized cluster. |
There was a problem hiding this comment.
the saturation is independent of cluster size, so scratch that last bit
| """Return (total, hydrated) for our test objects on cluster ``c``. | ||
|
|
||
| Decoupled from `local_lag` so the freshness loop can wait for the | ||
| cluster to be *running* first, then judge whether it can keep up — |
There was a problem hiding this comment.
running might not be the right framing here. Let's just say "wait for objects to be hydrated"
| Two-phase probe: wait for hydration, then take steady-state lag | ||
| samples. Returns (max_lag_ms, healthy, reporting, total). | ||
|
|
||
| Splitting the two phases keeps unhealthy probes fast: an overloaded |
There was a problem hiding this comment.
This is not true, an overloaded cluster will also not hydrate quickly, often, so we have to workshop sth here
- collapse `add_result` to `time_ms` only; converting at callsites - refresh `ClusterScalingScenario` / `EnvdObjectsScalabilityScenario` docstrings to point at the current sweep wrappers - clarify `smoke_test` purpose on `_recreate_cluster_c` - document each `ScalePoint` field (cluster_size vs replica_scale, etc.) - note one-table-to-many-materializations topology on the MV and cluster_object_limits scenarios, with a follow-up TODO for the alternative one-index-fan-out topology on the indexes scenario - rename "shard" (per-pad-cluster MV batch) to avoid collision with persist's notion of shards; talk about pad clusters directly - drop "plan caching" framing; say we want separate dataflows - drop the "saturates an undersized cluster" aside - reframe `probe_hydrated` around hydration, not "running" - correct `_hydrate_and_sample` rationale: the split distinguishes warm-up lag from steady-state lag, not "fast unhealthy probes" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 60s hydration timeout was being hit on every cluster size in the
coarse N=2000 step on staging, and on big replicas (1600cc/3200cc) it
was hit at the bisect points too — yielding a measured "cliff" that
was driven by hydration time, not steady-state capacity, with bigger
replicas paradoxically reporting smaller max-healthy N.
Bump CLUSTER_OBJECT_LIMITS_HYDRATION_TIMEOUT_S from 60s to 300s as a
blanket raise.
Also distinguish hydration-timeout failures from lag failures in the
output: `_hydrate_and_sample` now returns a status string ("healthy" /
"lag" / "hydration_timeout"), and `_probe_and_record` writes that into
a new `failure_mode` CSV column (empty when healthy). The existing
`healthy` 0/1 column is unchanged, so post-hoc analysis keeps working;
the new column lets us tell the two unhealthy modes apart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a third cluster_object_limits scenario that puts a single root view + index in front of base_t. The N test indexes read from the root view instead of base_t, so the optimizer imports the root index's arrangement and the whole topology has one persist source regardless of N. base_t is still on the data path so frontier ticks propagate through. The existing two scenarios are dominated by persist-source / persist-sink breadth (one source per indexed view; one source + one sink per MV), which obscures the "how many compute-only dataflows can a cluster tick" axis. The new variant isolates that axis. Rename for clarity: - cluster_object_limits_indexes -> cluster_object_limits_indexes_from_persist_sources - new -> cluster_object_limits_indexes_from_index - cluster_object_limits_mvs -> cluster_object_limits_mvs_from_persist_sources ClusterObjectLimitsKind grows a setup_statements tuple, run once per cluster-size iteration after base_t exists. Only the from_index variant uses it (to create v_root + v_root_primary_idx); the other two leave it empty. The from_index lag/hydration filters exclude v_root_primary_idx so it doesn't count as one of the N test objects.
New doc/developer/reports/ directory with a short report on the first numbers from the cluster-spec-sheet scenarios on staging. Plots: - cluster_object_limits: max idle materializations per cluster size, three series (indexes_from_persist_sources, indexes_from_index, mvs_from_persist_sources). Data from release-qualification build 1231. - envd_objects_scalability: CREATE TABLE latency vs catalog size (tables and MVs). Data from release-qualification build 1229. Assets live under static/<slug>/, matching the design-doc convention. Findings called out: the from_index variant carries ~1.7x more indexes than from_persist_sources on small clusters; both index curves trend downward with cluster size (matches the timely-dataflow argument); the MV curve is flatter at lower N. Two single-cell outliers flagged for a repeat run before publishing absolute numbers.
Drop the "third pass" history paragraph, the from_index headline finding (the plot conveys it), the stalls aside, the MV-vs-tables baseline speculation, the "what's changed since the first cut" section, and the Caveats section. Distill the outlier note to a single sentence about the MV 400cc point and add a one-liner acknowledging flakiness with directionally-correct results.
| will tolerate fewer objects than this. The point of measuring the idle | ||
| ceiling is that everything else has to fit underneath it. | ||
|
|
||
| This report is the third pass over these scenarios; the first pass |
There was a problem hiding this comment.
don't need this paragraph, readers don't need to know the history of the testing
|
|
||
| Three observations: | ||
|
|
||
| **1. The `from_index` variant unlocks ~1.7× more objects on small |
There was a problem hiding this comment.
cut this paragraph
| objects is a flat-to-downward ceiling as the cluster grows, and that | ||
| is what we see. | ||
|
|
||
| **3. There are spikes worth flagging, not yet acting on.** The |
There was a problem hiding this comment.
also need to distill this one, just mention that we think the one deep outlier is an outlier, leave it at that
| investigation. If they don't, our methodology needs more samples per | ||
| size before we publish absolute numbers. | ||
|
|
||
| A subtler finding from this run's per-sample data: many of the |
There was a problem hiding this comment.
Also cut this one, say that measurements are still a bit flaky and we'll improve as follow-up. But directionally the results look correct
| the slope is essentially the same. The MV baseline at small `N` is | ||
| slightly lower than tables, which is mildly surprising given MVs are | ||
| the "heavier" object — could be variance, could be a real difference | ||
| in the adapter path for MVs at low cardinality, not investigated | ||
| further here. |
There was a problem hiding this comment.
drop this last bit here, no need to speculate
| ## What's changed since the first cut | ||
|
|
||
| The 2026-05-18 garden post described two cluster scenarios | ||
| (`cluster_object_limits_indexes`, `cluster_object_limits_mvs`) and a | ||
| 60-second hydration window. Three things have changed since then. | ||
|
|
||
| **New variant: `indexes_from_index`.** Added a chained-index scenario | ||
| that isolates compute-only dataflow tick overhead from persist-source | ||
| overhead. Implementation: `ClusterObjectLimitsKind.setup_statements` | ||
| runs once per cluster-size iteration after `base_t` exists. The | ||
| `from_index` kind uses it to create `v_root` + `v_root_primary_idx` on | ||
| the measurement cluster; the lag/hydration filters exclude | ||
| `v_root_primary_idx` so the root index doesn't count as one of the `N` | ||
| test objects. | ||
|
|
||
| **Renames for clarity:** | ||
|
|
||
| - `cluster_object_limits_indexes` → `cluster_object_limits_indexes_from_persist_sources` | ||
| - `cluster_object_limits_mvs` → `cluster_object_limits_mvs_from_persist_sources` | ||
| - new: `cluster_object_limits_indexes_from_index` | ||
|
|
||
| The new names spell out the topology each variant exercises. | ||
| References in our own pipelines were updated; no external (non-repo) | ||
| consumers were known. | ||
|
|
||
| **Hydration timeout raised to 5 minutes; failure mode is now | ||
| recorded.** The earlier 60-second hydration window timed out | ||
| deterministically on `1600cc` / `3200cc` at the coarse `N=2000` | ||
| step, which yielded a measured "cliff" that was driven by hydration | ||
| time, not steady-state capacity — bigger replicas paradoxically | ||
| reported smaller max-healthy `N`. The window is now 300 seconds (180 s | ||
| first-probe budget after `CREATE CLUSTER c` to absorb cold-start), and | ||
| hydration-timeout failures are written to a new `failure_mode` column | ||
| in the cluster_object_limits CSV (`"hydration_timeout"` vs `"lag"` vs | ||
| empty when healthy). The existing `healthy` 0/1 column is unchanged, | ||
| so prior analysis code keeps working. |
There was a problem hiding this comment.
don't need this section, readers are only interested in the final state
| ## Caveats | ||
|
|
||
| One run per data point. Every materialization here is derived from a | ||
| one-row table that never changes — real workloads will lower these | ||
| ceilings, often by a lot. The numbers are also specific to the | ||
| `cloud-staging` environment with its current `environmentd` CPU | ||
| allocation, `max_credit_consumption_rate`, and LaunchDarkly defaults; a | ||
| production environment will land somewhere else. Two of the | ||
| cluster-object-limits cells (`from_index` at `800cc`, `mvs` at `400cc`) | ||
| look like single-run outliers and should be repeated before being | ||
| quoted. |
There was a problem hiding this comment.
also drop the caveats here
| - ./ci/plugins/mzcompose: | ||
| composition: cluster-spec-sheet | ||
| run: default | ||
| args: [--cleanup, --target=cloud-staging, cluster_object_limits] |
There was a problem hiding this comment.
How long does all of this take now btw?
There was a problem hiding this comment.
They do take quite long: https://buildkite.com/materialize/release-qualification/builds/1229, so ~6h for the envd objects scalability and ~3h for the cluster objects scalability one. Is that too long? I can cut it back or make it so we don't run these on every release but only on request? Future improvements to DDL latency will likely help, but no clear roadmap when we can land these
There was a problem hiding this comment.
It's probably fine. If we get value out of the numbers, let's go for it.
Fixes SQL-222