Add freshness troubleshooting runbook#35319

Merged

kay-kim merged 13 commits intoMaterializeInc:mainfrom

antiguru:freshness_runbook

Mar 31, 2026

Member

antiguru commented Mar 4, 2026 •

edited by kay-kim

Loading

https://preview.materialize.com/materialize/35319/transform-data/freshness-troubleshooting/

Step-by-step guide for diagnosing freshness problems, covering:

Real-time diagnosis: wallclock lag, materialization lag, source health, cluster health, dependency graph attribution, sink lag
Historical spike analysis: finding spikes, determining scope (single object vs cluster vs system-wide), minute-by-minute inspection
Common patterns: disconnected sources, overloaded clusters, expensive MVs, system-wide spikes, correlated subsource lag

🤖 Generated with Claude Code

Contributor

github-actions bot commented Mar 4, 2026

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

kay-kim reviewed

View reviewed changes

Contributor

kay-kim left a comment

The docs look great. I left some questions/suggestions ... If amenable, I can add a patch to the page (once I know the answers to the questions as well as the suggestions are not off-the-mark).

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+                  o_next.type AS to_type,
+                  greatest(
+                      to_timestamp(fn.write_frontier::text::double / 1000)
+                          - to_timestamp(fp.write_frontier::text::double / 1000),

Contributor

kay-kim Mar 4, 2026

So ... the where clause has:

  AND fn.write_frontier <= fp.write_frontier
  AND fp.write_frontier::text::numeric > fn.write_frontier::text::numeric

Would this greatest ( fn - fp, interval '0') always return 0?

doc/user/content/transform-data/freshness-troubleshooting.md Show resolved Hide resolved

Contributor

maheshwarip commented Mar 4, 2026

big fan of this :)

antiguru marked this pull request as ready for review

March 4, 2026 20:08

antiguru requested review from a team as code owners

March 4, 2026 20:08

antiguru requested a review from aljoscha

March 4, 2026 20:08

teskje reviewed

View reviewed changes

doc/user/content/sql/system-catalog/mz_internal.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              ```
+              Objects with a lag of a few seconds are typical for healthy systems.
+              Large lag values (minutes or hours) indicate a problem.

Contributor

teskje Mar 5, 2026

Should we filter out objects maintained by paused clusters here? Those will be showing up with a high lag but don't indicate anything wrong.

Contributor

teskje Mar 5, 2026

Ah, this is mentioned below in "Filtering noise". Perhaps move that section up or reference it from here to make it clear that there exist some reasons why high lags in fact don't indicate a problem?

Contributor

kay-kim Mar 25, 2026

Kind of got rid of the the Filtering noise section

we now return cluster name and id ... as such, people should be able to see the non-production clusters are returning
As for having 0-replica clusters return ... this is in case they had forgotten to add back the replica. I do footnote it and add link to the Check for no compute section.

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+                AND s.status <> 'running';
+              ```
+              A source with status `stalled` or `starting` will hold back all downstream objects.

Contributor

teskje Mar 5, 2026

I'm not sure if that's still the case, or if I'm mis-remembering, but I think it used to be that unhealthy sources were hard to identify because they'd move through the stalled/starting statuses very quickly and usually showed up as "running" even when they were restart-looping. We might need to recommend looking at the status history too.

Contributor

kay-kim Mar 24, 2026

I checked the source status history for a source and it did just keep transitioning between starting/running
but, the source itself and its tables (using new syntax ... so its subsources in the old syntax) were green, so I removed the blurb about checking the mz_internal.mz_source_status_history because not sure it told me anything.
- Is it that for the mz_internal.mz_source_status_history query, we want to specify error/details is not null or something (in order to be actionable?). Because if not particularly actionable, there's limited value?

(I'm running a small cluster with highly inefficient mat views ... to test some of the materialization queries):

SELECT s.id, s.name, ssh.status, ssh.occurred_at, ssh.error, ssh.details
FROM mz_internal.mz_source_status_history ssh
JOIN mz_catalog.mz_sources s ON ssh.source_id = s.id
WHERE s.name = 'pg_source1'
ORDER BY ssh.occurred_at DESC;

| id | name       | status   | occurred_at                | error | details |
| -- | ---------- | -------- | -------------------------- | ----- | ------- |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.52+00  | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.52+00  | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.494+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.494+00 | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.465+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.465+00 | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.431+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.431+00 | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.4+00   | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.4+00   | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.361+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.361+00 | null  | null    |

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated


		## Investigating historical spikes

		Materialize retains wallclock lag history for up to 30 days in [`mz_internal.mz_wallclock_global_lag_history`](/sql/system-catalog/mz_internal/#mz_wallclock_global_lag_history), binned by minute.

Contributor

teskje Mar 5, 2026

As mentioned above: It's "at least 30 days".

doc/user/content/transform-data/freshness-troubleshooting.md Outdated


		Resolution: Scale the cluster up, or move expensive workloads to a separate cluster.

		### Expensive materialized view

Contributor

teskje Mar 5, 2026

Why is this specific to materialized views? The same is true for indexes, no?

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              ### OOM crash loop
+              **Symptoms**: An object shows persistent lag that fluctuates.
+              Historical lag data for the object has gaps.

Contributor

teskje Mar 5, 2026

How does a cluster crash loop lead to gaps in the lag data? That data is collected by the controller and shouldn't be affected by clusters crashing.

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

Contributor

kay-kim commented Mar 12, 2026

Just added a patch w. minor reorg as a starting point for myself. Once I'm back in NY with a big monitor and printer, I can absorb the content a bit better and patch it with better organization and tweaks here and there. (heh... I'll also check that my copy+pasting to move things didn't accidentally clobber anything ... am so dependent on a monitor 😄 )

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              A source that is restart-looping may briefly show `running` between restarts; check `mz_internal.mz_source_status_history` for repeated transitions to confirm.
+              For PostgreSQL sources, the subsources share replication state with the parent source; if one subsource lags, all subsources of that source typically lag together.
+              Check the frontier of a specific source against wall-clock time:

Contributor

kay-kim Mar 16, 2026

FYI: The previous query above ... it returns the status. So ...

What purpose does this query serve?
Also, in here we query the mz_frontiers ... in the above check wallclock lag, we use mz_wallclock_global_lag ... What's the diff other than we need to do the subtraction ourselves?

I'm going to post up my next patch (which just focuses on this section) where I remove it for now.

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              restarts; check `mz_internal.mz_source_status_history` for repeated transitions
+              to confirm.
+              {{< /tip >}}

Contributor

kay-kim Mar 16, 2026 •

edited

Loading

Per comment, removed the following query from this section:

SELECT
    o.name,
    to_timestamp(f.write_frontier::text::double / 1000) AS frontier_time,
    now() - to_timestamp(f.write_frontier::text::double / 1000) AS behind_wallclock
FROM mz_internal.mz_frontiers f
JOIN mz_catalog.mz_objects o ON f.object_id = o.id
WHERE o.id = '<source_id>';

Since the previous query returns the status and error.
Also, unclear why we check mz_frontiers and do the calc versus at the check wallclock lag section where we check mz_wallclock_global_lag.

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              | **`local_lag` is low but `global_lag` is high** | An upstream dependency is the bottleneck. Look at `slowest_global_input` to identify the root cause. | See [Computation bottleneck](#computation-bottleneck). |
+              | **`local_lag` = 0, `global_lag` = 0, wallclock lag is high** | The root source is behind. The entire pipeline is caught up relative to its inputs, but the inputs themselves lag behind wall-clock time. | See [Source ingestion bottleneck](#source-ingestion-bottleneck). |
+              ## Source ingestion bottleneck

Contributor

kay-kim Mar 16, 2026

Rendered: https://preview.materialize.com/materialize/35319/transform-data/freshness-troubleshooting/#source-ingestion-bottleneck

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md Outdated Show resolved Hide resolved

kay-kim reviewed

View reviewed changes

Contributor

kay-kim left a comment

Just leaving some comments as am about to upload the next patch.

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+                  o.name,
+                  o.type,
+                  ml.local_lag,
+                  ml.global_lag

Contributor

kay-kim Mar 17, 2026

In the next patch, will remove ml.global_lag since we don't use it in this section.

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              A cluster that repeatedly runs out of memory will have its replica crash and restart.
+              Each restart triggers rehydration, during which no progress is made, causing recurring freshness degradation.
+              Check the current replica status:

Contributor

kay-kim Mar 17, 2026

Just curious ... this would only be if the replica restarted yet? ... I'm wondering if the next query which checks the status history is possibly the one-and-done one. Left both in the next patch ... but something to think over. (Also, am realizing I probably should be making these comments on the patch instead of the overall files changed ... but, the "shutting the barn door after the horses are out" kind of a thing now).

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

		The time between restarts indicates the severity: a replica that OOMs every few minutes is fundamentally too small for its workload.

		To see the full lifecycle of replicas, including how often new ones are created:

Contributor

kay-kim Mar 17, 2026

Removing this part in the next partch because not sure what this is telling us in terms of troubleshooting/diagnosis.

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              and restart. Each restart triggers rehydration, during which no progress is
+              made, causing recurring freshness degradation.
+              #### Replica is currently offline

Contributor

kay-kim Mar 17, 2026 •

edited

Loading

Just curious if this current status is needed or if having people check the loop is better.

Contributor

kay-kim Mar 17, 2026 •

edited

Loading

The reason I ask is what are the chances people will catch it when it's currently offline instead of when it has restarted; e.g.,

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              | **Diagnosis** | PostgreSQL sources use a single replication stream for all subsources/tables. If one slows down (e.g., due to a large transaction), all subsources/tables for that source are affected. |
+              | **Resolution** | Wait for the subsource/table to catch up. |
+              ## Cluster CPU or memory pressure

Contributor

kay-kim Mar 17, 2026

Rendered:
https://preview.materialize.com/materialize/35319/transform-data/freshness-troubleshooting/#cluster-cpu-or-memory-pressure

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md Outdated

+              | | |
+              |--|--|
+              | **Symptom** | Objects on the cluster do **not** have similar `local_lag`. |
+              | **Diagnosis** | The dataflow is expensive, not the cluster. |

Contributor

kay-kim Mar 17, 2026 •

edited

Loading

Am going to tweak this since if an object has a large local lag (and it's the only one) ... but it's so large that it is causing the cluster to OOM ... technically, yes ... the dataflow is expensive ... but ...

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md

+                o.type,
+                wl.lag,
+                c.name as cluster_name,
+                c.id as cluster_id

Contributor

kay-kim Mar 24, 2026

Added cluster details to plant the seed for people to prompt their agent in case people want to run for a specific cluster

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md

+              FROM mz_internal.mz_materialization_lag ml
+              JOIN mz_catalog.mz_objects o ON ml.object_id = o.id
+              WHERE o.id = '<object_id>'
+              ORDER BY ml.global_lag DESC;

Contributor

kay-kim Mar 24, 2026

Is mz_materialization_lag doing more or less the query on mz_internal.mz_frontiers ?

 SELECT
            o.id, o.name, o.type,
            round(
              extract(epoch from now()) * 1000
              - f.write_frontier::text::numeric
            ) AS lag_ms
          FROM mz_internal.mz_frontiers f
          JOIN mz_catalog.mz_objects o ON f.object_id = o.id
          WHERE f.object_id LIKE 'u%'
            AND f.write_frontier IS NOT NULL
          ORDER BY 4 DESC

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md

+                    (SELECT write_frontier::text::numeric
+                       FROM mz_internal.mz_frontiers
+                      WHERE object_id = '<object_id>')                   -- update
+              ORDER BY lag_ms DESC;

Contributor

kay-kim Mar 24, 2026 •

edited

Loading

Updated to use Frank's freshness poc query.

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md

+              SELECT
+                  o_probe.name AS object_name,
+                  o_prev.name AS from_name,
+                  o_prev.id   AS from_id,

Contributor

kay-kim Mar 24, 2026

added the from_id ... so that it's easier to iterate on these objects.

kay-kim reviewed

View reviewed changes

doc/user/content/transform-data/freshness-troubleshooting.md

+              | `stalled`| Common causes include network partitions, credential expiration, and upstream database restarts. Check the returned `error` field and address appropriately. Once the source reconnects, downstream objects should catch up automatically. |
+              | `paused` | The cluster associated with the source has no compute/replica assigned (`replication_factor = 0`). See [Check for no compute](#check-for-no-compute). |
+              | `starting` | Wait for the source to transition to running. Downstream objects should catch up automatically.  |

Contributor

kay-kim Mar 24, 2026

As mentioned, removed for now the mention of the checking the source status ... So ... if the stalled is difficult to find ... we will need some other actionable thing to diagnose.

Contributor

kay-kim Mar 24, 2026

I stopped here for this 3rd patch. Will do the remainder in the next patch.

antiguru and others added 8 commits

March 25, 2026 16:36


          Add freshness troubleshooting runbook

7c1242f

Add a step-by-step guide for diagnosing freshness problems in
Materialize, covering real-time diagnosis (wallclock lag, materialization
lag, source health, cluster health, dependency graph attribution) and
historical spike analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add OOM crash loop diagnosis to freshness runbook

e06cc94

Include mz_cluster_replica_statuses, mz_cluster_replica_status_history,
and mz_cluster_replica_history queries for detecting OOM crash loops
that cause recurring freshness degradation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Add aggregate freshness measurement and noise filtering to runbook

e88fe65

Adds guidance on measuring P99.999 freshness across a deployment,
including threshold-based queries that work around Materialize SQL
limitations (no WITHIN GROUP, no sum(interval)). Documents common
noise sources (paused sources, zero-replica clusters, static data,
non-production clusters) and adds two new common patterns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          Document mz_wallclock_global_lag_history and _recent_history views

f7d070a

Adds catalog documentation for the two previously undocumented views:
- `mz_wallclock_global_lag_history`: minute-binned, 30-day retention,
  aggregated across replicas (min lag per object per minute)
- `mz_wallclock_global_lag_recent_history`: filtered to last 24 hours

Also fixes review feedback in the freshness runbook:
- Replace `!=` with `<>` for psql compatibility
- Add `o.id` to GROUP BY in aggregate query to prevent name collisions
- Fix paused source resolution (no ALTER SOURCE resume exists)
- Clarify P99.999 claim as per-object

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          minor improvements

15afe5d

Signed-off-by: Moritz Hoffmann <mh@materialize.com>


          Address PR feedback and add historical spike analysis techniques

828425a

Add new sections for distinguishing source-driven vs. computation-driven
spikes, correlating spikes with DDL events via audit log, steady-state
freshness analysis by excluding identified time windows, and a
deploy-related freshness degradation pattern.

Fix various issues flagged in review: make cause list non-exhaustive,
rename Step 2, convert interpretation to table with next-step links,
fix edge_delay computation in dependency graph query, remove incorrect
claims about static data lag and OOM lag data gaps, fix retention
wording, generalize "expensive MV" to "expensive dataflow", and correct
advice for unpausing sources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


          WIP: first pass reorg ... will tune more


          WIP: Updates to the Source ingestion bottleneck section

f7fce15

kay-kim added 4 commits

March 25, 2026 16:36


          fix docs lint errors

edaef06


          WIP: Updates to Cluster bottleneck + no compute sections

c80b169


          WIP: Additional refactoring

07bbf37


          reorg the latter half

d236653

kay-kim force-pushed the freshness_runbook branch from 184e867 to d236653 Compare

March 25, 2026 20:36


          fix lint errors

335287f

kay-kim approved these changes

View reviewed changes

kay-kim merged commit 22c58f7 into MaterializeInc:main

119 checks passed

antiguru deleted the freshness_runbook branch

March 31, 2026 15:14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet