Skip to content

spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios#36540

Merged
aljoscha merged 35 commits into
MaterializeInc:mainfrom
aljoscha:envd-specsheet
May 29, 2026
Merged

spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios#36540
aljoscha merged 35 commits into
MaterializeInc:mainfrom
aljoscha:envd-specsheet

Conversation

@aljoscha
Copy link
Copy Markdown
Contributor

Fixes SQL-222

aljoscha and others added 30 commits May 13, 2026 11:30
The existing scenarios scale cluster size or envd CPU cores -- nothing
measures how adapter/envd latency moves as the catalog itself grows. Add
two scenarios under a new `envd_scalability` group that fix the
measurement cluster and vary the number of catalog objects.

`envd_scalability_tables` puts N empty tables in the catalog -- pure
catalog/adapter pressure, no controller load. `envd_scalability_mvs`
does N materialized views over a single 1-row base table -- same
catalog footprint, plus controller load proportional to N. The MV
scenario shards across single-replica pad clusters at 10000 MVs per
cluster (so 100k MVs spans 10 clusters), since one cluster can't
reasonably host that many dataflows.

For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run
10 reps each of `CREATE TABLE` (DDL through the coordinator) and
`SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster).
The catalog is built incrementally across size points, so going from
N=k to the next size point only adds (next - k) objects -- otherwise
we'd pay an O(sizes * N) build cost. The size list is overridable via
`--envd-scalability-sizes` for scaffolding runs.

Results land in a third CSV (`*.envd_scalability.csv`) reusing the
cluster CSV schema; `mode='envd_scalability'` distinguishes the rows.
Test analytics rides on the existing `cluster_spec_sheet_result` table
-- no schema change needed. The analyzer plots `time_ms` vs N per
(scenario, category, test_name).

This is going to be long-running, especially the MV scenario where each
create exercises the controller -- expect hours for the full size
range.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two new scenarios -- cluster_object_limits_indexes and
cluster_object_limits_mvs -- that find, per cluster size, the maximum
number of idle materializations one cluster can keep fresh.

The materializations are derived from a one-row, never-updated base
table so the only work the cluster has to do is keep advancing each
materialization's write_frontier in step with the upstream table. Once
the cluster can't keep up, freshness collapses; the driver records the
largest N at which `max(local_lag) < 2s` was still achievable, with the
unhealthy data point recorded too so the cliff is visible.

Staging-only (rejects --target=cloud-production), to avoid burning
production resources on long object-limit searches.
…lability default at 50k

When a materialization stalls completely (write_frontier never advances
past the minimum timestamp), `mz_internal.mz_materialization_lag` reports
`now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this
crushes every healthy data point to ~0 on the plot. Cap the recorded
value at 10x the healthy threshold (= 20 s), preserve the underlying
truth via the `healthy` column, and label the plot to make the cap and
healthy threshold explicit.

Also drop 100_000 from the envd_scalability default size list: 50_000 is
a more sensible default ceiling for staging. The full size list is still
override-able via --envd-scalability-sizes for ad-hoc runs.
…tion

The release-qualification pipeline already runs three cluster-spec-sheet
groups (cluster_compute on production, source_ingestion on production,
environmentd on staging). Add two more groups -- envd_scalability and
cluster_object_limits -- both running against staging, since both push
the catalog / cluster to limits we don't want to exercise on production.
The three "envd / cluster" groups in the cluster-spec-sheet were named
inconsistently. Settle on the three concept names the cluster-spec-sheet
effort uses verbally:

  environmentd          -> envd_qps_scalability     (QPS vs envd CPU)
  envd_scalability      -> envd_objects_scalability (latency vs catalog N)
  cluster_object_limits -> cluster_object_limits    (unchanged)

Renames apply to: scenario constants, scenario-name string values, group
keys in SCENARIO_GROUPS, class names, the run/analyze function names,
the --envd-scalability-sizes CLI flag, the result CSV suffix, and the
`mode` field written into CSV rows. The pre-existing QPS scenarios keep
their individual `*_envd_strong_scaling` names since only the group is
renamed.

Also updates the release-qualification pipeline step ids/args and the
README to match.
…w start

When debugging cluster-spec-sheet runs on staging it's hard to tell which
environment we're actually talking to and whether the system parameter
defaults we expect (lifted via LaunchDarkly or similar) are actually
applied. Add a one-shot diagnostic right after target.initialize() that
prints mz_environment_id() and SHOWs the limits the test depends on
(max_tables, max_materialized_views, max_objects_per_schema, max_clusters,
max_credit_consumption_rate, memory_limiter_interval).

Best-effort: any probe error is logged and swallowed so a transient
failure does not abort the workflow.
psycopg3's execute() requires a LiteralString, so the f-string SHOW
query tripped pyright in CI. Compose the statement with
psycopg.sql.SQL/Identifier instead, matching the pattern already used in
test/orchestratord/mzcompose.py.
A staging run of `envd_objects_scalability_mvs` (release-qualification
build 1219) aborted at N≈19800 with:

    Retryable error: consuming input failed: SSL error: unexpected eof
    while reading, reconnecting...
    psycopg.errors.InternalError_: materialized view
    "materialize.pad_schema.pad_mv_19805" already exists

The TLS connection dropped mid-statement; envd had already committed the
CREATE but the response was lost. ConnectionHandler.retryable reconnects
and replays the same statement, which then fails with "already exists".

Use ``CREATE ... IF NOT EXISTS`` for every CREATE issued via _bulk_run so
the retry is a no-op. Affects the bulk-creation paths in both
envd_objects_scalability scenarios (tables, MVs) and both
cluster_object_limits scenarios (indexed views, MVs). Add a docstring on
_bulk_run spelling out the idempotency requirement so future CREATEs
don't reintroduce the hazard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50k scale point pushes a single Envd Objects Scalability run past
the 13-hour mark on staging — adapter latency degrades so much by then
that each measurement repetition takes several seconds, and the catalog
build itself runs at <1/s. 30k is where the interesting signal already
lives. Drop 50k from the default list; ad-hoc runs that want it can
still pass --envd-objects-scalability-sizes explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits N-list defaults to a +1k linear step past
N=1000, which is too coarse: a run on staging showed the cliff sits in
(1000, 2000] for cluster_object_limits_indexes across every cluster
size 100cc..1600cc, and we can't tell from that whether the limit is
1100 or 1900.

After the coarse N-walk hits its first unhealthy point, bisect the
(last_healthy, first_unhealthy) interval --cluster-object-limits-bisect-
steps times (default 4) and probe each midpoint. The bisection step adds
or drops objects in place — never rebuilds the catalog — so the cost is
only ~bisect_steps extra hydrate-and-probe rounds per cluster size. With
the default 4 steps, the cliff narrows to ±~60 objects.

Adds:
- `remove_objects(target_n)` symmetric to `add_objects(target_n)` on
  both ClusterObjectLimitsScenario subclasses. Indexes scenario drops
  via DROP VIEW ... CASCADE (cascades to the default index); MVs
  scenario drops via DROP MATERIALIZED VIEW.
- `--cluster-object-limits-bisect-steps` CLI flag plumbed through to
  `run_scenario_cluster_object_limits`.
- Bisection block in the per-cluster-size loop that calls add+remove
  (one is a no-op) and records each probe under the same CSV schema, so
  the existing freshness-lag-vs-N plot just gets denser near the cliff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If CREATE CLUSTER fails for a cluster_object_limits size — because the
target region doesn't expose that replica size, or because allocating
the cluster would exceed max_credit_consumption_rate — today the
scenario either aborts with a traceback or (when the cluster is created
but then can't actually keep up) reports a confusing "unhealthy at
N=100" data point.

Catch psycopg.errors.DatabaseError around the CREATE CLUSTER, log a
clear "size unavailable" line (with the underlying error class +
message), and `continue` to the next cluster size. OperationalError is
re-raised so that genuine connection failures (which run_query's retry
loop has already given up on) aren't silently masked as a size problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cluster_object_limits plots (max-healthy-N bars + lag-vs-N legend)
ordered cluster sizes alphanumerically — "100cc, 1600cc, 200cc, 3200cc,
400cc, 800cc" — making the small→large progression unreadable.

Lift the existing `extract_cluster_size` helper to module scope and use
it to reindex the bar plot's index and reorder the line plot's columns.
The cluster-results path was already using it for its x-axis, so the
extraction is just hoisted, not duplicated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In build 1223 (release-qualification), 1600cc/3200cc showed implausibly
small max-healthy-N — 1600cc reported 0 healthy indexes / 93 healthy MVs
where 100cc–800cc routinely handled 1500+ indexes and 687+ MVs. Probing
the first N (=100) on a freshly-created 1600cc cluster returned
local_lag values of 90+ seconds for indexes and ~unix-epoch-ms for MVs
(i.e. write_frontier stuck at zero). Once that first probe declared the
cluster unhealthy, the bisect could not recover: each successive sample
measured more accumulated lag, not less, because the cluster never got
a chance to settle.

Likely cause is cold-start: bigger replicas take longer to begin
serving frontiers after CREATE CLUSTER + bulk DDL, and the 60s
hydration window expires before steady state. Bump it to 300s as a
first diagnostic — if 1600cc/3200cc now look healthy at reasonable N,
this confirms the hypothesis and we can keep the higher timeout (or
make it size-dependent). If they still look broken, the issue is
elsewhere (provisioning, multi-process replica semantics, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ult cap

The envd_objects_scalability default cap was reduced from 100k to 30k in
bdb6607 but two comments still referenced the older shape. Update the
system-parameter rationale to describe the headroom relationship between
the lifted ceilings (200k) and the user-configurable cap (default 30k),
and update the analyzer docstring to match.
…base_t

EnvdObjectsScalabilityMvsScenario called its one-row base table
pad_schema.pad_base while ClusterObjectLimitsScenario called the same
shape table pad_schema.base_t. Use BASE_TABLE = "base_t" consistently.
The new envd_objects_scalability and cluster_object_limits teardown
paths each open-coded a try/except + print("WARNING: failed to drop ...")
block around their DROP statements. Pull the pattern into a single
helper used by all four call sites.
The four DictWriter blocks in workflow_default repeated almost the same
10-field fieldname list. envd_objects_scalability claimed in a comment
to "reuse the cluster-focused schema" but spelled it out anyway, and
cluster_object_limits redeclared the same list plus a single extra
column. Hoist CLUSTER_FIELDNAMES + ENVD_FIELDNAMES and build all four
writers from them via a small _make_csv_writer helper.
The four analyze_*_results_file functions repeated the same six-line
header: print banner, read CSV, empty check, derive base_name, build
plot_dir, mkdir. Pull it into a helper that returns (df, plot_dir) or
None when the file is empty.
hydrate_and_sample's inner probe_once helper wrapped each probe with
SET cluster=<probe>; <select>; SET cluster=c — three round-trips per
probe. Over a 300s hydration window plus 5 steady-state samples that
adds up to ~900 redundant SETs per N. Move the two SETs to a single
try/finally around the whole polling window; the per-probe work is
now just the lag SELECT.
The "DROP CLUSTER IF EXISTS c CASCADE; CREATE CLUSTER c SIZE ...; SET
cluster = 'c'" sequence appeared verbatim in five run_scenario_*
functions (strong, envd_strong_scaling, envd_objects_scalability,
cluster_object_limits, weak), and the one-row probe-table prep in
three of them. Move both into helpers; the cluster_object_limits
"skip if size unavailable" branch becomes a parameter on the helper
rather than an open-coded try/except at the call site.
…stry

Four workflow_plot_* functions were structurally identical (parser arg,
parse, glob, call one analyzer); the multi-kind workflow_plot did the
same plus a 4-way if/elif suffix dispatch. Replace both shapes with a
shared _plot_files helper that takes either a fixed analyzer
(per-kind workflows) or dispatches via the new SUFFIX_ANALYZERS
registry (the multi-kind workflow_plot). The five workflow_*
functions are now one-call wrappers.
…ario subclasses

The two ClusterObjectLimits scenario subclasses differed only in (a) which
DDL statements create/drop one materialization (CREATE VIEW + CREATE
DEFAULT INDEX vs CREATE MATERIALIZED VIEW), and (b) which catalog table
mz_materialization_lag is joined against. Lift the differences into a
frozen ClusterObjectLimitsKind dataclass carrying create/drop SQL
templates plus the lag-filter join, and have a single
ClusterObjectLimitsScenario class read its kind to drive add_objects /
remove_objects / probe_lag_ms. The two scenarios are now constructed as
ClusterObjectLimitsScenario(CLUSTER_OBJECT_LIMITS_{INDEXES,MVS}_KIND).

EnvdObjectsScalability{Tables,Mvs} are left as separate subclasses: the
MV variant carries pad-cluster sharding state and a distinct init/teardown,
so collapsing them would just hide the structural difference behind
conditionals rather than remove duplication.
…ion_statuses

The freshness probe previously combined "is the dataflow running yet?"
with "is the cluster keeping up?" into a single predicate:

    reporting == N AND max_local_lag_ms < lag_threshold_ms

That meant every unhealthy probe burned the full hydration timeout
(300s in build 1226, see ace6b0f) before declaring failure: the lag
on an overloaded cluster never falls under 2s, so the loop polls to
the deadline and only then captures the lag. In 1226 the 100cc
N=2000, 200cc N=3000, and 1600cc N=2000 probes each sat for ~301s
before recording lag values of 654s–675s. Bisecting an unhealthy
region pays this cost again at every step.

Split into two phases:

  1. Poll `mz_internal.mz_hydration_statuses` until every test object
     on `c` reports `hydrated = true`. This is a definitive per-object
     signal — the dataflow has finished initial snapshotting — and
     converges quickly even on cold-started 1600cc/3200cc replicas.
     Timeout here means the replica is genuinely wedged.

  2. Once hydrated, take the existing `CLUSTER_OBJECT_LIMITS_SAMPLES`
     steady-state lag samples. Unhealthy now means "hydrated but
     can't keep up", which is the property we actually want to
     measure; an overloaded cluster trips the threshold in
     `samples * sample_interval` (~10s) instead of in 300s.

With this decoupling the cold-start argument for the 300s timeout no
longer applies, so drop it back to 60s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dration budget

In build 1227, 3200cc indexes and 1600cc MVs both produced false
cliffs at N=100: the very first probe after CREATE CLUSTER timed out
with 0/100 hydrated, but every subsequent bisect step (N=50/75/87/93)
hydrated cleanly with lag=0.0. The cluster works fine — the replica
just isn't reporting introspection within 60s of being created on
multi-process sizes.

Thread a per-call `timeout_s` into `hydrate_and_sample` and let
`probe_and_record` pick between the regular and a longer "first
probe" budget. The coarse N-walk passes `first_probe=True` only on
its first iteration, so big-replica cold start gets headroom while
every other probe keeps the tight 60s budget that makes unhealthy
points cheap to record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No behavior changes; just shorter explanations where the original
prose restated the code or expanded beyond what a reader needs:

- MATERIALIZED_ADDITIONAL_SYSTEM_PARAMETER_DEFAULTS rationale: 7 → 3 lines
- _bulk_run docstring: 12 → 5 lines (keep the idempotency warning)
- hydrate_and_sample docstring: 25 → 13 lines (keep the "why split
  the phases" justification)
- probe_lag_ms / probe_hydrated docstrings: drop the tuple-field
  enumeration that duplicates the return type
- collapse the two "framework setup/drop unused" comments
- drop pure-label comments ("Snapshot of cluster sizes", "Outer loop")
  and the init/teardown lines that restate the next statement

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`workflow_default` carried a 150-line if-ladder mapping each `SCENARIO_*`
string to a `run_scenario_*` invocation, plus four parallel sections that
open / upload / archive / analyze one CSV each. Adding a new scenario or
result kind meant touching every one of those.

Replace both with two data registries:

  - `ScenarioSpec` + `SCENARIOS`: name, log label, family, factory lambda,
    groups. `SCENARIOS_BY_NAME` and `SCENARIO_GROUPS` are derived from it,
    so the hand-written `SCENARIOS_CLUSTERD` / `_COMPUTE` / etc. lists go
    away. A `Family` literal + `FAMILY_TO_STREAM` table selects the
    driver, and a small `run_spec` match replaces the if-ladder.

  - `ResultStreamSpec` + `RESULT_STREAMS`: suffix, fieldnames, analyzer,
    uploader. `workflow_default` opens one CSV per spec and the four
    parallel close/upload/artifact/analyze blocks become single loops.
    The old `SUFFIX_ANALYZERS` is now a derived alias.

The scenarios themselves, the four `run_scenario_*` drivers, and the
`Scenario` ABC are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous `Scenario` ABC pretended that all four scenario families
shared a single setup/drop/materialize_views/run lifecycle, but only
strong/weak/envd_qps actually used it. EnvdObjectsScalability and
ClusterObjectLimits returned [] from every ABC method and were driven
through entirely different protocols (init/add_objects/teardown and
reset_for_cluster_size/probe_*/teardown respectively), with comments
apologising that "framework-level setup/drop are unused".

Replace the single ABC with three real shapes:

* `ClusterScalingScenario` (renamed from `Scenario`) for the
  strong/weak/envd_qps families. `drop()` and `materialize_views()` now
  default to `[]` so the envd_qps subclasses no longer need empty
  overrides.
* `EnvdObjectsScalabilityScenario` becomes its own ABC with the methods
  it actually exposes; the unused `replica_size` constructor parameter
  is dropped.
* `ClusterObjectLimitsScenario` becomes a plain class (no inheritance)
  and the unused `replica_size` parameter is dropped.

`ScenarioSpec.factory` now returns the union `AnyScenario`, and
`run_spec` narrows it per-family with isinstance asserts.

With ClusterObjectLimitsScenario no longer pretending to be a generic
Scenario, the 220-line `run_scenario_cluster_object_limits` collapses:
the `hydrate_and_sample` and `probe_and_record` closures and the
N-walk + bisect loop move onto the scenario as `_hydrate_and_sample`,
`_probe_and_record`, and `run_for_cluster_size`. The driver shrinks to
the outer cluster-size loop plus ScenarioRunner construction and
`_recreate_cluster_c` / `reset_for_cluster_size` / `teardown`
bookkeeping.

Behaviour is unchanged: same 13 scenarios, same groups, same CSV
output. Verified by direct module import and pyright/ruff/black.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three parallel scenario hierarchies (ClusterScalingScenario,
EnvdObjectsScalabilityScenario, ClusterObjectLimitsScenario) and five
run_scenario_* drivers collapsed onto one Scenario ABC with a
prepare/scale_points/apply/measure/cleanup_point/teardown lifecycle.
Strong/weak/envd-cpu/envd-objects become thin sweep wrappers around
the existing inner workloads; ClusterObjectLimitsScenario implements
Scenario directly. The result-stream choice now lives on each scenario
via stream_key(), so Family, FAMILY_TO_STREAM, AnyScenario, RunContext
and run_spec all go away.

Also: shared _extend_incremental/_shrink_incremental helpers used by
both envd_objects and cluster_object_limits scenarios, and the four
per-kind workflow_plot_* entries collapse to one workflow_plot that
dispatches by filename suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rop dead replica_size param

ClusterObjectLimitsScenario._probe_and_record was the only caller that
bypassed runner.add_result and wrote rows directly via
results_writer.writerow(...), because add_result didn't know about the
extra `healthy` column. Extend add_result with optional `time_ms` (for
values already in ms) and `healthy` kwargs so the cluster_object_limits
path matches every other scenario. The `healthy` column is silently
dropped on streams whose schema doesn't include it via the existing
extrasaction="ignore".

Also drop the `replica_size` constructor parameter from ScenarioRunner:
every sweep wrapper passes None and mutates `runner.replica_size`
in `apply()`. The param was dead weight. _probe_and_record's
`replica_size` argument is dropped for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) -> None:
"""Write one result row.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this dance around time/time_ms? Couldn't we just update the existing callsites to pass in time_ms because apparently here we always convert to time_ms

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated
These scenarios share a four-step lifecycle: `drop()` cleans up state from
a prior run, `setup()` creates load-generator clusters / sources / tables,
`materialize_views()` lists persist objects to hydrate, and `run()`
performs the measurements. The drivers (`run_scenario_strong`,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still have these per scenario drivers?

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated
# Can return: status 404 Not Found
pass
The scale point (number of catalog objects, N) is driven externally by
`run_scenario_envd_objects_scalability` via init/add_objects/teardown;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still have these scenario runners?

def _recreate_cluster_c(
runner: ScenarioRunner,
replica_size: str,
smoke_test: bool = False,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we have this smoke_test thing for?


label: str
cluster_size: str | None = None
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between cluster_size and replica_scale?

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated


class EnvdObjectsScalabilityMvsScenario(EnvdObjectsScalabilityScenario):
"""N materialized views in the catalog, sharded across pad clusters.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention that each MV has it's own persist source (the table). This is as opposed to say having one table (persist source) that goes into an index, and then many materialized views off that index.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the same holds for the index scenario below. We can note that as follow-up work we want to also do the one table into index into many indexes scenario

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated
self._pad_clusters_created += 1

def add_objects(self, runner: ScenarioRunner, target_n: int) -> None:
# Build one shard at a time: each shard lives on its own pad cluster
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shard naming is confusing here because persist also has the concept of shards. Audit the file for this and find something else.

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated
statements (indexes need a view + a default index; MVs need a single
CREATE MATERIALIZED VIEW). Statement templates take ``{schema}``,
``{base}``, and ``{i}`` placeholders; the ``WHERE id < {i}`` predicate
is what makes each view/MV structurally distinct so plan caching can't
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's plan caching that would do this in materialize, but we are doing it because some other things might. Let's drop this, or let's just say to make sure that we get separate dataflows. Audit the reset of the file for this kind of thing and also fix up those places.

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated
- A single one-row base table `base_t` (never updated) lives in
``PAD_SCHEMA``. Frontiers advance over wall-clock time, so even with
no writes the cluster must keep ticking every materialization's
write_frontier — which is what saturates an undersized cluster.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the saturation is independent of cluster size, so scratch that last bit

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated
"""Return (total, hydrated) for our test objects on cluster ``c``.

Decoupled from `local_lag` so the freshness loop can wait for the
cluster to be *running* first, then judge whether it can keep up —
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running might not be the right framing here. Let's just say "wait for objects to be hydrated"

Comment thread test/cluster-spec-sheet/mzcompose.py Outdated
Two-phase probe: wait for hydration, then take steady-state lag
samples. Returns (max_lag_ms, healthy, reporting, total).

Splitting the two phases keeps unhealthy probes fast: an overloaded
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true, an overloaded cluster will also not hydrate quickly, often, so we have to workshop sth here

aljoscha and others added 5 commits May 19, 2026 19:00
- collapse `add_result` to `time_ms` only; converting at callsites
- refresh `ClusterScalingScenario` / `EnvdObjectsScalabilityScenario`
  docstrings to point at the current sweep wrappers
- clarify `smoke_test` purpose on `_recreate_cluster_c`
- document each `ScalePoint` field (cluster_size vs replica_scale, etc.)
- note one-table-to-many-materializations topology on the MV and
  cluster_object_limits scenarios, with a follow-up TODO for the
  alternative one-index-fan-out topology on the indexes scenario
- rename "shard" (per-pad-cluster MV batch) to avoid collision with
  persist's notion of shards; talk about pad clusters directly
- drop "plan caching" framing; say we want separate dataflows
- drop the "saturates an undersized cluster" aside
- reframe `probe_hydrated` around hydration, not "running"
- correct `_hydrate_and_sample` rationale: the split distinguishes
  warm-up lag from steady-state lag, not "fast unhealthy probes"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 60s hydration timeout was being hit on every cluster size in the
coarse N=2000 step on staging, and on big replicas (1600cc/3200cc) it
was hit at the bisect points too — yielding a measured "cliff" that
was driven by hydration time, not steady-state capacity, with bigger
replicas paradoxically reporting smaller max-healthy N.

Bump CLUSTER_OBJECT_LIMITS_HYDRATION_TIMEOUT_S from 60s to 300s as a
blanket raise.

Also distinguish hydration-timeout failures from lag failures in the
output: `_hydrate_and_sample` now returns a status string ("healthy" /
"lag" / "hydration_timeout"), and `_probe_and_record` writes that into
a new `failure_mode` CSV column (empty when healthy). The existing
`healthy` 0/1 column is unchanged, so post-hoc analysis keeps working;
the new column lets us tell the two unhealthy modes apart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a third cluster_object_limits scenario that puts a single root view
+ index in front of base_t. The N test indexes read from the root view
instead of base_t, so the optimizer imports the root index's
arrangement and the whole topology has one persist source regardless
of N. base_t is still on the data path so frontier ticks propagate
through.

The existing two scenarios are dominated by persist-source / persist-sink
breadth (one source per indexed view; one source + one sink per MV),
which obscures the "how many compute-only dataflows can a cluster tick"
axis. The new variant isolates that axis.

Rename for clarity:
- cluster_object_limits_indexes -> cluster_object_limits_indexes_from_persist_sources
- new -> cluster_object_limits_indexes_from_index
- cluster_object_limits_mvs -> cluster_object_limits_mvs_from_persist_sources

ClusterObjectLimitsKind grows a setup_statements tuple, run once per
cluster-size iteration after base_t exists. Only the from_index variant
uses it (to create v_root + v_root_primary_idx); the other two leave
it empty. The from_index lag/hydration filters exclude
v_root_primary_idx so it doesn't count as one of the N test objects.
New doc/developer/reports/ directory with a short report on the first
numbers from the cluster-spec-sheet scenarios on staging. Plots:

- cluster_object_limits: max idle materializations per cluster size,
  three series (indexes_from_persist_sources, indexes_from_index,
  mvs_from_persist_sources). Data from release-qualification build
  1231.
- envd_objects_scalability: CREATE TABLE latency vs catalog size
  (tables and MVs). Data from release-qualification build 1229.

Assets live under static/<slug>/, matching the design-doc convention.

Findings called out: the from_index variant carries ~1.7x more
indexes than from_persist_sources on small clusters; both index
curves trend downward with cluster size (matches the timely-dataflow
argument); the MV curve is flatter at lower N. Two single-cell
outliers flagged for a repeat run before publishing absolute
numbers.
Drop the "third pass" history paragraph, the from_index headline
finding (the plot conveys it), the stalls aside, the MV-vs-tables
baseline speculation, the "what's changed since the first cut"
section, and the Caveats section. Distill the outlier note to a
single sentence about the MV 400cc point and add a one-liner
acknowledging flakiness with directionally-correct results.
will tolerate fewer objects than this. The point of measuring the idle
ceiling is that everything else has to fit underneath it.

This report is the third pass over these scenarios; the first pass
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this paragraph, readers don't need to know the history of the testing


Three observations:

**1. The `from_index` variant unlocks ~1.7× more objects on small
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut this paragraph

objects is a flat-to-downward ceiling as the cluster grows, and that
is what we see.

**3. There are spikes worth flagging, not yet acting on.** The
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to distill this one, just mention that we think the one deep outlier is an outlier, leave it at that

investigation. If they don't, our methodology needs more samples per
size before we publish absolute numbers.

A subtler finding from this run's per-sample data: many of the
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cut this one, say that measurements are still a bit flaky and we'll improve as follow-up. But directionally the results look correct

Comment on lines +141 to +145
the slope is essentially the same. The MV baseline at small `N` is
slightly lower than tables, which is mildly surprising given MVs are
the "heavier" object — could be variance, could be a real difference
in the adapter path for MVs at low cardinality, not investigated
further here.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop this last bit here, no need to speculate

Comment on lines +152 to +187
## What's changed since the first cut

The 2026-05-18 garden post described two cluster scenarios
(`cluster_object_limits_indexes`, `cluster_object_limits_mvs`) and a
60-second hydration window. Three things have changed since then.

**New variant: `indexes_from_index`.** Added a chained-index scenario
that isolates compute-only dataflow tick overhead from persist-source
overhead. Implementation: `ClusterObjectLimitsKind.setup_statements`
runs once per cluster-size iteration after `base_t` exists. The
`from_index` kind uses it to create `v_root` + `v_root_primary_idx` on
the measurement cluster; the lag/hydration filters exclude
`v_root_primary_idx` so the root index doesn't count as one of the `N`
test objects.

**Renames for clarity:**

- `cluster_object_limits_indexes` → `cluster_object_limits_indexes_from_persist_sources`
- `cluster_object_limits_mvs` → `cluster_object_limits_mvs_from_persist_sources`
- new: `cluster_object_limits_indexes_from_index`

The new names spell out the topology each variant exercises.
References in our own pipelines were updated; no external (non-repo)
consumers were known.

**Hydration timeout raised to 5 minutes; failure mode is now
recorded.** The earlier 60-second hydration window timed out
deterministically on `1600cc` / `3200cc` at the coarse `N=2000`
step, which yielded a measured "cliff" that was driven by hydration
time, not steady-state capacity — bigger replicas paradoxically
reported smaller max-healthy `N`. The window is now 300 seconds (180 s
first-probe budget after `CREATE CLUSTER c` to absorb cold-start), and
hydration-timeout failures are written to a new `failure_mode` column
in the cluster_object_limits CSV (`"hydration_timeout"` vs `"lag"` vs
empty when healthy). The existing `healthy` 0/1 column is unchanged,
so prior analysis code keeps working.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this section, readers are only interested in the final state

Comment on lines +189 to +199
## Caveats

One run per data point. Every materialization here is derived from a
one-row table that never changes — real workloads will lower these
ceilings, often by a lot. The numbers are also specific to the
`cloud-staging` environment with its current `environmentd` CPU
allocation, `max_credit_consumption_rate`, and LaunchDarkly defaults; a
production environment will land somewhere else. Two of the
cluster-object-limits cells (`from_index` at `800cc`, `mvs` at `400cc`)
look like single-run outliers and should be repeated before being
quoted.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also drop the caveats here

@aljoscha aljoscha marked this pull request as ready for review May 21, 2026 05:56
@aljoscha aljoscha requested a review from a team as a code owner May 21, 2026 05:56
- ./ci/plugins/mzcompose:
composition: cluster-spec-sheet
run: default
args: [--cleanup, --target=cloud-staging, cluster_object_limits]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does all of this take now btw?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do take quite long: https://buildkite.com/materialize/release-qualification/builds/1229, so ~6h for the envd objects scalability and ~3h for the cluster objects scalability one. Is that too long? I can cut it back or make it so we don't run these on every release but only on request? Future improvements to DDL latency will likely help, but no clear roadmap when we can land these

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on ☝️ ? @def-

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine. If we get value out of the numbers, let's go for it.

@aljoscha aljoscha merged commit 537b0d3 into MaterializeInc:main May 29, 2026
127 checks passed
@aljoscha aljoscha deleted the envd-specsheet branch May 29, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants