Skip to content

design: add design doc for zero-downtime upgrades v2, physical isolation, high availability#34602

Merged
aljoscha merged 19 commits intoMaterializeInc:mainfrom
aljoscha:design-more-zero-downtime
Mar 11, 2026
Merged

design: add design doc for zero-downtime upgrades v2, physical isolation, high availability#34602
aljoscha merged 19 commits intoMaterializeInc:mainfrom
aljoscha:design-more-zero-downtime

Conversation

@aljoscha
Copy link
Copy Markdown
Contributor

@aljoscha aljoscha commented Dec 22, 2025

@aljoscha aljoscha force-pushed the design-more-zero-downtime branch from b5035b2 to 262ec33 Compare January 6, 2026 15:55
works. We have to audit whether it would work for both the old and new version
to write at the same time. This is important for builtin tables that are not
derived from catalog state, for example `mz_sessions`, the audit log, and
probably others.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the audit log derived from catalog state? Are there any audit events we emit not in response to DDL?

In general, it seems problematic for the old environment to keep writing to any builtin tables. Afaik, the new environment completely truncates the builtin tables (except the audit log) when it promotes, and re-populates them. If the old environment continues to write, we could quickly end up with inconsistent data. For example, entries from mz_sessions could be removed twice, or could leak.

Comment on lines +199 to +200
TODO: Figure out if we want to allow background tasks to keep writing. This
includes, but is not limited to storage usage collection.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also all the storage-managed collections, right?

Comment on lines +30 to +32
The lived experience for users will be: **no perceived downtime for DQL and
DML, and an error message about DDL not being allowed during version cutover,
which should be on the order of 10s of seconds**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be less disruptive to indefinitely hang DDL, instead of returning an error. The client would see a very slow DDL query that eventually fails due to a connection error when the old envd shuts down. Clients should retry on connection errors and the retry would connect to the new envd and succeed. So instead of erroring DDL we would have slow DDL during an upgrade, which doesn't require any special handling. That is unless the client has a timeout configured...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit sneaky, but could work! Erroring is more honest but this is fudging things during the upgrade

Comment on lines +185 to +190
and will serve DQL/DML workload off of the catalog snapshot that it has. An
important detail to figure out here is what happens when the old-version
`environmentd` process crashes while we're in a lame-duck phase. If the since
of the catalog shard has advanced, it will not be able to restart and read the
catalog at the old version that it understands. This may require holding back
the since of the catalog shard during upgrades or a similar mechanism.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there reason to assume the old envd would restart faster than the new envd? There is some migration work to be done, but hopefully that doesn't increase startup time significantly, right? If we don't think the old envd can overtake the new envd then we don't need to worry about ensuring the old envd can start up again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, that sounds interesting!

Comment on lines +236 to +238
We have to audit and find out which is which. For builtin tables that mirror
catalog state we can use the self-correcting approach that we also use for
materialized views and for a number of builtin storage-managed collections. For
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the self-correcting approach for builtin tables seems dubious! Consider the case of mz_sessions: The old envd likely has a set of active sessions that it thinks should be in there, the new envd likely has no active sessions. They would fight, with the new envd continually deleting all the contents and the old envd continually inserting them again. When you query mz_sessions you might see the expected contents, or an empty table. If we simply stopped writing the builtin tables from the old envd you would always see an empty mz_sessions table, which doesn't seem much worse, but would be much less effort.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but I guess "simply stop writing in the old envd" isn't sufficient because we also need to ensure that the old envd stops writing before the new envd has started writing to the builtin tables, which I don't think we have a way to do currently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this section is a bit handwave-y, but the general idea is that that we have to do something to allow multiple envds to collaborate. For zero-downtime but also for other things, so at some point we need to think about it and figure it out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For builtin tables that mirror
catalog state we can use the self-correcting approach

For those builtin tables that mirror catalog state, I would imagine the situation to not be too complicated: Since DDL is not allowed during the interesting time, the only catalog change is the new envd doing catalog migrations, right? If this is true, then there is no fighting needed between the two envds: Could we simply let the new envd change also the builtin tables when it does the migrations? (And the same might be true for the audit log, if it's true that the audit log changes only when the catalog changes.)

Comment on lines +276 to +278
I think they are ready, but sources will have a bit of a fight over who gets to
read, for example for Postgres. Kafka is already fine because it already
supports concurrently running ingestion instances.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the postgres thing work today? When a source starts up it takes over the replication slot and crashes the other source? So both the old and the new source would be in a restart loop but hopefully still making progress?

Seems like we could just shut down all source dataflows when we enter "lame duck" mode, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's the fighting and crashing. And yes, we could shut them down in lame-duck mode!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, shutting them down in lame-duck mode isn't a good idea because it still takes a bit until the sources in the new leader env start writing. The new envd only sends the AllowWrites command after bootstrapping, so there would be 10s of seconds where the old sources have shut down but the new ones can't write yet.

Better if the new sources could fence out the old ones somehow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I had that same thought and thought I had actually commented this here, but seems not ... 🙈

@aljoscha aljoscha force-pushed the design-more-zero-downtime branch from 262ec33 to 28c37c0 Compare January 12, 2026 11:04
reaped, it can still serve queries but not apply writes to the catalog
anymore, so cannot process DDL queries
6. New `environmentd` re-establishes connection to clusters, brings them out of
read-only mode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does the new environmentd apply catalog migrations?

At that point the lame duck old environmentd can no longer serve queries, right? How do we fence out the old one at that point?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrations are applied at step 4, when the new version opens in read-write mode. The point of lame-duck mode is that the old version notices this but instead of halting (as we do today), we keep a snapshot of the catalog as we have it and keep serving queries from that. That way we can serve SELECT and INSERT traffic, but not DDL which would require the old version to still be able to write to the catalog.

That last bit is future work: we want to add forward/backward compat so that they two versions can both keep serving all types of queries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true. Couldn't you eg. write into tables that you'd had permissions revoked for, or that were deleted, or had their schema changed? (And similar questions for reads?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true.

Seems fine to me. Since DDL is not allowed, the only change after the snapshot are the catalog migrations that the new envd does, right? Even if that invalidates some operation that a client is doing on the old envd (by e.g. revoking some permission), the user can't really complain, since the user can't expect the catalog migrations to take place at a certain point in time. This is different from e.g. an explicit REVOKE command that the user can expect to take place immediately after it returns successfully to the user. In other words, before the cutover happens, permissions are as if the upgrade were just not taking place yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends a bit on how the network rerouting is implemented. Is it possible to ensure that all clients connect to the new envd at the same time? As in: When the first client issues a query against the new envd, is it guaranteed that no other client can run a query against the old envd subsequently? If not, there is a period where clients are connected to both envds simultaneously and then Ben's revoke scenario becomes relevant.

I'm also wondering how this is going to work in the HA/use-case-isolation work. If a REVOKE was committed on one envd, the other envds will learn about that with a delay. Does that mean we'll have to sync the catalog for every SELECT query that gets executed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the questions raised by Ben and Jan, couldn't the migration itself cause issues? How do we know that the migration doesn't change something relied upon by the old queries? Perhaps an internal table gets its schema changed or gets deleted.

Even if that invalidates some operation that a client is doing on the old envd (by e.g. revoking some permission), the user can't really complain, since the user can't expect the catalog migrations to take place at a certain point in time.

I don't think this is true. As Jan also raised, I don't see anything preventing the old one from answering queries after the new one, if for example, the cutover to the new one happens, a user connects and makes a query to it, then an older connection to an old environmentd makes a query.

That's why I brought up fencing. We don't seem to have anything to prevent the old one from answering queries after the new one takes over.

Copy link
Copy Markdown
Contributor Author

@aljoscha aljoscha Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we'll have to sync the catalog for every SELECT query that gets executed?

It does indeed! We wrote about that quite a while in the design doc about the logical architecture of pv2:

I think there is no way around it in a distributed system where you still want strict serializability. You wouldn't do the actual implementation as described there, though. What we've had in mind back then is that yes you introduce a catalog timestamp, and you change the API of the timestamp oracle to allow getting two timestamps in one operation. So you get (query ts, catalog ts) when you have a query incoming, instead of just getting a read ts right now. And then you compare the catalog snapshot you already have in memory against that and only need to sync to latest catalog changes when you're outdated.

In practice, this means that you get excellent performance when you don't see many DDL. When you see DDL right when you do your query, you might have to pay to cost to sync to latest catalog changes.

There are some more far-out ideas that we can discuss when we're all at the onsite next week.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true. Couldn't you eg. write into tables that you'd had permissions revoked for, or that were deleted, or had their schema changed? (And similar questions for reads?)

Yeah, this is the part that also makes me hesitant about the design. The hand-wavy reasoning is that you cannot have catalog changes while the lame-duck envd is serving queries. And once the new version is serving queries the lame-duck one is not getting any more queries. But it's probably true that you cannot guarantee a "100% airtight" cutover from old to new.

IMO, if we determine that we can't go with the lame-duck approach, we still have to do all the suggested work, plug then actual forward/backward compat and following catalog changes. Which will also give us most of use-case isolation as a side effect. I like that, but it's more work so wanted to explore alternatives for delivering something quicker. 🕊️

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And once the new version is serving queries the lame-duck one is not getting any more queries. But it's probably true that you cannot guarantee a "100% airtight" cutover from old to new.

This is definitely not something we can guarantee. It is extremely likely the lame-duck will continue getting queries.

Comment on lines +171 to +175
The observation that makes this proposal work is that neither the schema nor
the contents of user collections (so tables, sources, etc.) change between
versions. So both the old version and the new version _should_ be able to
collaborate in writing down source data, accepting INSERTs for tables and
serving SELECT queries.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want to change the persisted data format? Or that is addressed by this later part in this document?:

A jumping-off point for this work is the recent work that allows multiple
versions to work with persist shards. Here we already have forward and backward
compatibility by tracking what versions are still "touching" a shard and
deferring the equivalent of migrations to a moment when no older versions are
touching the shard.

Comment on lines +236 to +238
We have to audit and find out which is which. For builtin tables that mirror
catalog state we can use the self-correcting approach that we also use for
materialized views and for a number of builtin storage-managed collections. For
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For builtin tables that mirror
catalog state we can use the self-correcting approach

For those builtin tables that mirror catalog state, I would imagine the situation to not be too complicated: Since DDL is not allowed during the interesting time, the only catalog change is the new envd doing catalog migrations, right? If this is true, then there is no fighting needed between the two envds: Could we simply let the new envd change also the builtin tables when it does the migrations? (And the same might be true for the audit log, if it's true that the audit log changes only when the catalog changes.)

reaped, it can still serve queries but not apply writes to the catalog
anymore, so cannot process DDL queries
6. New `environmentd` re-establishes connection to clusters, brings them out of
read-only mode
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that implies that it's always safe to serve traffic given a stale snapshot version of the catalog, but it's not obvious to me why that would be true.

Seems fine to me. Since DDL is not allowed, the only change after the snapshot are the catalog migrations that the new envd does, right? Even if that invalidates some operation that a client is doing on the old envd (by e.g. revoking some permission), the user can't really complain, since the user can't expect the catalog migrations to take place at a certain point in time. This is different from e.g. an explicit REVOKE command that the user can expect to take place immediately after it returns successfully to the user. In other words, before the cutover happens, permissions are as if the upgrade were just not taking place yet.

We can even render that as a dataflow and attempt the read-then-write directly
on a cluster, if you squint this is almost a "one-shot continual task" (if
you're familiar with that). But we could initially keep that loop in
`environmentd`, closer to how the current implementation works.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


We cut over network routes once the new deployment is fully ready, so any
residual downtime is the route change itself. During that window the old
deployment still accepts connections but rejects DDL with an error message.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really clear to me what "that window" refers to in this sentence. I would imagine the change in balancerd of starting to route incoming connections to the new envd to be instantaneous. Or does the "that window" refer to the time period from the previous paragraph instead, i.e., the time of the "lame-duck mode"?

@aljoscha aljoscha force-pushed the design-more-zero-downtime branch 2 times, most recently from df05e5c to 87514e9 Compare January 28, 2026 11:06
catalog and **halts**
6. New `environmentd` re-establishes connection to clusters, brings them out of
read-only mode
7. Cutover: network routes updated to point at new generation, orchestrator
Copy link
Copy Markdown
Contributor

@ggevay ggevay Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question: What if we run into a compute reconciliation failure when the new envd comes up in read/write mode, which would cause a lot of downtime? Do we delay the cutover until the rehydration finishes (possibly a long time, maybe hours)?

Asking because there can be some sneaky reasons for compute reconciliation failures. For example, we get one if an optimizer feature flag is changed in LD after the new envd is started in read-only mode but before it restarts in read-write mode.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason is indexes being dropped that still had objects depending on them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is a known risk that we'll need to add mitigations for as we work on this. the staged rollout helps a bit — if reconciliation fails during the read-only phase we can abort before writing anything. but if it happens later we'd need to handle it. we'll have to keep this in mind as we get into implementation.

@aljoscha aljoscha changed the title design: add More Zero-Downtime Upgrades design doc design: add doc for enhanced zero-downtime upgrades, high availability, physical isolation Feb 2, 2026
@aljoscha aljoscha changed the title design: add doc for enhanced zero-downtime upgrades, high availability, physical isolation design: add design doc for zero-downtime upgrades v2, physical isolation, high availability Feb 18, 2026
@aljoscha aljoscha force-pushed the design-more-zero-downtime branch from 2de7c97 to f3c76c1 Compare February 25, 2026 14:11
@aljoscha aljoscha marked this pull request as ready for review February 25, 2026 14:11
provides the mechanism for this.

## Alternatives

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the alternative used by most (all?) non-materialize systems (etcd, cockroach, etc...) that support zero downtime upgrades. Have the new generation able to operate in RW mode with both the older protocol/state and the newer protocol/state.

  1. Stand up the new generation in RW mode, but talking the older protocol and writing the older state formats.
  2. Wait until the new generation is totally ready (ie: clusterd hydrated).
  3. Update K8S services to point at the new generation.
  4. Wait until DNS has propagated. All new connections should go to the new generation after this point.
  5. Tear down the old generation.
  6. Send a signal to the new generation that it is safe to perform migrations to the new protocols/state formats.
  7. Perform catalog migrations and begin operating in the new protocols/state formats. Critically, this step should also fence out the old generation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's exactly what I'm proposing in https://github.com/aljoscha/materialize/blob/design-more-zero-downtime/doc/developer/design/20251219_zero_downtime_upgrades_physical_isolation_high_availability.md#proposal, but you're right I didn't spell that out well, and didn't put the "apply migrations and fence out" part in the new upgrade flow. I will update those.

@aljoscha aljoscha force-pushed the design-more-zero-downtime branch from f3c76c1 to 8afdd0b Compare February 26, 2026 15:35
Comment on lines +169 to +172
4. Orchestrator triggers promotion: new `environmentd` enters read/write mode,
writes its `deploy_generation`/`version` to the catalog. The new version
constrains itself to only write catalog data that is backward-compatible
with the old version.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason why we write the new deploy_generation/version down here already, and not in step 9 when we perform migrations? Asked differently: What purpose do these two fields serve in this new world?

A purpose I imagined the version field would serve is tell envds which catalog version is legal to write. However, that doesn't work if we already update the version here. So it seems like we need another catalog_version field in the catalog?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jan, that this probably needs to be the catalog version, and only written later when we do the migrations. Do we still need the deploy generation or envd version any more? These previously fenced out the old pods, but the catalog version should do the fencing now, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I updated the doc: deploy_generation and code version are no longer stored in the catalog. The only version-related field in the catalog is catalog_version, which is written in step 9 when migrations are applied — that's also what fences out old versions. deploy_generation remains useful for orchestration (orchestratord knowing old vs new deployment) but is not a catalog concern. Nothing is written to the catalog in step 4 anymore.

compatibility: it tracks what versions are still "touching" a shard and defers
migrations until no older versions remain. We apply the same principle at the
catalog level, giving us gating both by regular feature flags and by catalog
version.
Copy link
Copy Markdown
Contributor

@teskje teskje Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems pretty difficult to me, if we want to implement it in a way that is robust and doesn't impact our ability to ship changes. I'd be interested to hear your thoughts about how we'd implement this, if you have any!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it's difficult, but I don't really see a way around it — true zero-downtime needs forward/backward compat between versions, as Alex also pointed out. Other systems do it this way too. What I have in mind is version checks at specific code points, similar to how persist does it. Compat window would only span adjacent versions. Probably needs its own design/issue when we get to it.

Another approach would be to introduce a self correcting loop that syncs
catalog state (the desired state) to builtin tables. Similar to how we already
have such loops at the core of the materialized views sink or for "differential
sources" inside the storage controller.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making builtin tables views over the catalog would be yet another approach, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, and we already have work on the way for this. I updated the doc to propose that as the preferred approach and mention the self-correcting loop as an alternative.


## Alternatives

### Cut over to new version without halt-and-restart
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alternative looks more attractive now that the main proposal includes gating envd behavior on catalog versions. At least this seems easier to implement to my (perhaps naive) mind.

What exactly are the things that cause seconds of downtime here? Applying the catalog migrations? How is that different from applying the catalog migrations in step (9) above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way I think about it: the "speed of light" of a version cutover is cutting over network routes — you have to do that regardless because you're pointing clients at different processes. the proposal with two deployments running concurrently gets as close to that speed of light as possible. with an in-process transition, however long it takes to bring everything out of read-only mode (transitioning in-memory data structures, controllers, on-cluster dataflows, etc.) shows up directly in the downtime. getting that to actual zero is very hard or impossible. on top of the time cost, successfully transitioning all in-memory data structures from read-only to read-write mode is itself a difficult and error-prone undertaking. I updated the doc with this reasoning.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting that to actual zero is very hard or impossible

I think this is the point I don't quite follow. It's probably just my naive view on things, but it would be helpful for me to have some more specifics about which things are hard/impossible. In my mind, we'd only have to flip a couple flags to allow things to start writing, and commit a catalog migration. The catalog migration is potentially expensive, but we'll have to do that also with the main approach proposed in this design. What else am I missing?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed about this offline. A convincing (to me) argument for needing to run both environments in r/w mode at the same time is that we want to cut over connections slowly as Materialize can only handle a limited amount of connections per second, so cutting over thousands of connections at once would lead to extended unavailability for some of them. So looks like we don't really have a choice. I'm still skeptical of our ability to implement the required compatibility mode in adapter without introducing a maintenance nightmare, but other databases do it, so there is probably a way.

Comment on lines +169 to +172
4. Orchestrator triggers promotion: new `environmentd` enters read/write mode,
writes its `deploy_generation`/`version` to the catalog. The new version
constrains itself to only write catalog data that is backward-compatible
with the old version.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jan, that this probably needs to be the catalog version, and only written later when we do the migrations. Do we still need the deploy generation or envd version any more? These previously fenced out the old pods, but the catalog version should do the fencing now, right?

The change from the current upgrade procedure would be this flow:

1. New `environmentd` starts with higher `deploy_generation`/`version`
2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why boot in read-only mode? Why not start in RW mode?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc to explain this. The upgrade is a staged rollout: read-only → RW with old catalog version → full RW with new catalog version. Each stage gates on the previous one succeeding. The read-only phase lets us validate that the new version can boot, open the catalog, spawn cluster processes, and hydrate before it writes anything — if something goes wrong we can abort without having touched any state.

Do you agree with this 3-staged approach or do you have concerns?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's totally fine to do this approach, it just seemed unnecessary to separate the first two phases, since the writes should be at the old catalog version. I wanted to confirm this was the intent, and not a hold over from earlier versions of this doc. It is absolutely OK to be more paranoid and slow roll this with RO first.

From earlier discussions, I was under the impression that we were trying to avoid supporting an RO to RW transition just to minimize the work needed. I guess that's likely similar work to an RW (old) to RW (new) transition, though, so maybe that doesn't buy us anything.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that there are other things to write than just the catalog! Sources, materialized views, and any persist collection are affected by this as well. It's nice to be able to bring up an environment in a mode where it can't do harm, validate it, and then cut it over.

On the other hand, if we shipped a version of Materialize that had a bug corrupting sources or MVs, we probably wouldn't notice until after we have cut over 🙃

Comment on lines +340 to +353
### Persist Pubsub

Currently, `environmentd` acts as the persist pub-sub server. It's the
component that makes the latency between persist writes and readers noticing
changes snappy.

When we have both the old and new `environmentd` handling traffic, and both
writing at different times, we would see an increase in read latency because
writes from one `environmentd` (and its cluster processes) are not routed to
the other `environmentd` and vice-versa.

We have to somehow address this, or accept the fact that we will have degraded
latency for the short period where both the old and new `environmentd` serve
traffic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked about this offline with Aljoscha but a summary of the conversation:
Jun: Is there a reason why we can't have the persist pubsub as a separate service instead of inside environmentd? I assume it's because environmentd has a handle on all the shards
Al: No reason why we can't, and the persist pubsub service doesn't need knowledge of persist shards. However it's the least of my worries. Main complications involve:

  1. Making sure the protocol of the pub-sub service is backwards compatible, although the use of protobufs makes this problem easier
  2. Another piece of orchestration

aljoscha and others added 7 commits March 5, 2026 14:02
Make explicit in the upgrade flow that step 9 (catalog upgrade) also
serves as the fencing mechanism for any lingering old-version processes,
and that old processes from step 8 may not terminate instantly.

Addresses review feedback from alex-hunt-materialize.
…n is the only version field

The deploy_generation and code version are no longer stored in the
catalog. deploy_generation remains useful for orchestration but is not
a catalog concern. The only version-related field in the catalog is
catalog_version, which controls schema/features and serves as the
fencing mechanism.

Update the upgrade flow to remove references to writing
deploy_generation/version in step 4, and clarify that step 9 is where
catalog_version is upgraded.

Addresses review feedback from teskje and alex-hunt-materialize.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add paragraph explaining that the upgrade is a staged rollout
(read-only → RW with old catalog version → full RW with new catalog
version), where each stage gates on the previous one succeeding. The
read-only phase validates the new version can boot and hydrate before
writing anything.

Addresses review feedback from alex-hunt-materialize.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…oblem

For builtin tables that mirror catalog state, making them views over
the catalog sidesteps the concurrent writer problem entirely. There is
already work in progress towards this approach.

Addresses review feedback from teskje.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add details about point-of-no-return marking in Materialize CR,
EndpointSlice management, and DNS TTL handling to the orchestration
cutover section.

Addresses review feedback from alex-hunt-materialize.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Discussion concluded that lame-duck mode can't be done safely because
the old version can't reliably interact with the catalog after
migrations.

Addresses review feedback from alex-hunt-materialize.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hand-wavy "on the order of seconds" argument with concrete
reasoning: the speed of light is the network route cutover, two
concurrent deployments get as close to that as possible, and in-process
transitions add both time and complexity from transitioning data
structures from read-only to read-write.

Addresses review feedback from teskje.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aljoscha aljoscha merged commit 6eea7a4 into MaterializeInc:main Mar 11, 2026
5 checks passed
@aljoscha aljoscha deleted the design-more-zero-downtime branch March 11, 2026 13:55
antiguru pushed a commit to antiguru/materialize that referenced this pull request Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants