Add freshness troubleshooting runbook#35319
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
kay-kim
left a comment
There was a problem hiding this comment.
The docs look great. I left some questions/suggestions ... If amenable, I can add a patch to the page (once I know the answers to the questions as well as the suggestions are not off-the-mark).
| o_next.type AS to_type, | ||
| greatest( | ||
| to_timestamp(fn.write_frontier::text::double / 1000) | ||
| - to_timestamp(fp.write_frontier::text::double / 1000), |
There was a problem hiding this comment.
So ... the where clause has:
AND fn.write_frontier <= fp.write_frontier
AND fp.write_frontier::text::numeric > fn.write_frontier::text::numeric
Would this greatest ( fn - fp, interval '0') always return 0?
|
big fan of this :) |
| ``` | ||
|
|
||
| Objects with a lag of a few seconds are typical for healthy systems. | ||
| Large lag values (minutes or hours) indicate a problem. |
There was a problem hiding this comment.
Should we filter out objects maintained by paused clusters here? Those will be showing up with a high lag but don't indicate anything wrong.
There was a problem hiding this comment.
Ah, this is mentioned below in "Filtering noise". Perhaps move that section up or reference it from here to make it clear that there exist some reasons why high lags in fact don't indicate a problem?
There was a problem hiding this comment.
Kind of got rid of the the Filtering noise section
- we now return cluster name and id ... as such, people should be able to see the non-production clusters are returning
- As for having 0-replica clusters return ... this is in case they had forgotten to add back the replica. I do footnote it and add link to the Check for no compute section.
| AND s.status <> 'running'; | ||
| ``` | ||
|
|
||
| A source with status `stalled` or `starting` will hold back all downstream objects. |
There was a problem hiding this comment.
I'm not sure if that's still the case, or if I'm mis-remembering, but I think it used to be that unhealthy sources were hard to identify because they'd move through the stalled/starting statuses very quickly and usually showed up as "running" even when they were restart-looping. We might need to recommend looking at the status history too.
There was a problem hiding this comment.
- I checked the source status history for a source and it did just keep transitioning between starting/running
- but, the source itself and its tables (using new syntax ... so its subsources in the old syntax) were green, so I removed the blurb about checking the
mz_internal.mz_source_status_historybecause not sure it told me anything.- Is it that for the
mz_internal.mz_source_status_historyquery, we want to specify error/details is not null or something (in order to be actionable?). Because if not particularly actionable, there's limited value?
- Is it that for the
(I'm running a small cluster with highly inefficient mat views ... to test some of the materialization queries):
SELECT s.id, s.name, ssh.status, ssh.occurred_at, ssh.error, ssh.details
FROM mz_internal.mz_source_status_history ssh
JOIN mz_catalog.mz_sources s ON ssh.source_id = s.id
WHERE s.name = 'pg_source1'
ORDER BY ssh.occurred_at DESC;
| id | name | status | occurred_at | error | details |
| -- | ---------- | -------- | -------------------------- | ----- | ------- |
| u3 | pg_source1 | running | 2026-03-24 14:49:08.52+00 | null | null |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.52+00 | null | null |
| u3 | pg_source1 | running | 2026-03-24 14:49:08.494+00 | null | null |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.494+00 | null | null |
| u3 | pg_source1 | running | 2026-03-24 14:49:08.465+00 | null | null |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.465+00 | null | null |
| u3 | pg_source1 | running | 2026-03-24 14:49:08.431+00 | null | null |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.431+00 | null | null |
| u3 | pg_source1 | running | 2026-03-24 14:49:08.4+00 | null | null |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.4+00 | null | null |
| u3 | pg_source1 | running | 2026-03-24 14:49:08.361+00 | null | null |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.361+00 | null | null |
|
|
||
| ## Investigating historical spikes | ||
|
|
||
| Materialize retains wallclock lag history for up to 30 days in [`mz_internal.mz_wallclock_global_lag_history`](/sql/system-catalog/mz_internal/#mz_wallclock_global_lag_history), binned by minute. |
There was a problem hiding this comment.
As mentioned above: It's "at least 30 days".
|
|
||
| **Resolution**: Scale the cluster up, or move expensive workloads to a separate cluster. | ||
|
|
||
| ### Expensive materialized view |
There was a problem hiding this comment.
Why is this specific to materialized views? The same is true for indexes, no?
| ### OOM crash loop | ||
|
|
||
| **Symptoms**: An object shows persistent lag that fluctuates. | ||
| Historical lag data for the object has gaps. |
There was a problem hiding this comment.
How does a cluster crash loop lead to gaps in the lag data? That data is collected by the controller and shouldn't be affected by clusters crashing.
|
Just added a patch w. minor reorg as a starting point for myself. Once I'm back in NY with a big monitor and printer, I can absorb the content a bit better and patch it with better organization and tweaks here and there. (heh... I'll also check that my copy+pasting to move things didn't accidentally clobber anything ... am so dependent on a monitor 😄 ) |
| A source that is restart-looping may briefly show `running` between restarts; check `mz_internal.mz_source_status_history` for repeated transitions to confirm. | ||
| For PostgreSQL sources, the subsources share replication state with the parent source; if one subsource lags, all subsources of that source typically lag together. | ||
|
|
||
| Check the frontier of a specific source against wall-clock time: |
There was a problem hiding this comment.
FYI: The previous query above ... it returns the status. So ...
- What purpose does this query serve?
- Also, in here we query the
mz_frontiers... in the above check wallclock lag, we usemz_wallclock_global_lag... What's the diff other than we need to do the subtraction ourselves?
I'm going to post up my next patch (which just focuses on this section) where I remove it for now.
| restarts; check `mz_internal.mz_source_status_history` for repeated transitions | ||
| to confirm. | ||
| {{< /tip >}} | ||
|
|
There was a problem hiding this comment.
Per comment, removed the following query from this section:
SELECT
o.name,
to_timestamp(f.write_frontier::text::double / 1000) AS frontier_time,
now() - to_timestamp(f.write_frontier::text::double / 1000) AS behind_wallclock
FROM mz_internal.mz_frontiers f
JOIN mz_catalog.mz_objects o ON f.object_id = o.id
WHERE o.id = '<source_id>';
- Since the previous query returns the status and error.
- Also, unclear why we check
mz_frontiersand do the calc versus at the check wallclock lag section where we checkmz_wallclock_global_lag.
| | **`local_lag` is low but `global_lag` is high** | An upstream dependency is the bottleneck. Look at `slowest_global_input` to identify the root cause. | See [Computation bottleneck](#computation-bottleneck). | | ||
| | **`local_lag` = 0, `global_lag` = 0, wallclock lag is high** | The root source is behind. The entire pipeline is caught up relative to its inputs, but the inputs themselves lag behind wall-clock time. | See [Source ingestion bottleneck](#source-ingestion-bottleneck). | | ||
|
|
||
| ## Source ingestion bottleneck |
There was a problem hiding this comment.
kay-kim
left a comment
There was a problem hiding this comment.
Just leaving some comments as am about to upload the next patch.
| o.name, | ||
| o.type, | ||
| ml.local_lag, | ||
| ml.global_lag |
There was a problem hiding this comment.
In the next patch, will remove ml.global_lag since we don't use it in this section.
| A cluster that repeatedly runs out of memory will have its replica crash and restart. | ||
| Each restart triggers rehydration, during which no progress is made, causing recurring freshness degradation. | ||
|
|
||
| Check the current replica status: |
There was a problem hiding this comment.
Just curious ... this would only be if the replica restarted yet? ... I'm wondering if the next query which checks the status history is possibly the one-and-done one. Left both in the next patch ... but something to think over. (Also, am realizing I probably should be making these comments on the patch instead of the overall files changed ... but, the "shutting the barn door after the horses are out" kind of a thing now).
| The time between restarts indicates the severity: a replica that OOMs every few minutes is fundamentally too small for its workload. | ||
|
|
||
| To see the full lifecycle of replicas, including how often new ones are created: | ||
|
|
There was a problem hiding this comment.
Removing this part in the next partch because not sure what this is telling us in terms of troubleshooting/diagnosis.
| and restart. Each restart triggers rehydration, during which no progress is | ||
| made, causing recurring freshness degradation. | ||
|
|
||
| #### Replica is currently offline |
There was a problem hiding this comment.
Just curious if this current status is needed or if having people check the loop is better.
| | **Diagnosis** | PostgreSQL sources use a single replication stream for all subsources/tables. If one slows down (e.g., due to a large transaction), all subsources/tables for that source are affected. | | ||
| | **Resolution** | Wait for the subsource/table to catch up. | | ||
|
|
||
| ## Cluster CPU or memory pressure |
There was a problem hiding this comment.
| | | | | ||
| |--|--| | ||
| | **Symptom** | Objects on the cluster do **not** have similar `local_lag`. | | ||
| | **Diagnosis** | The dataflow is expensive, not the cluster. | |
| o.type, | ||
| wl.lag, | ||
| c.name as cluster_name, | ||
| c.id as cluster_id |
There was a problem hiding this comment.
Added cluster details to plant the seed for people to prompt their agent in case people want to run for a specific cluster
| FROM mz_internal.mz_materialization_lag ml | ||
| JOIN mz_catalog.mz_objects o ON ml.object_id = o.id | ||
| WHERE o.id = '<object_id>' | ||
| ORDER BY ml.global_lag DESC; |
There was a problem hiding this comment.
Is mz_materialization_lag doing more or less the query on mz_internal.mz_frontiers ?
SELECT
o.id, o.name, o.type,
round(
extract(epoch from now()) * 1000
- f.write_frontier::text::numeric
) AS lag_ms
FROM mz_internal.mz_frontiers f
JOIN mz_catalog.mz_objects o ON f.object_id = o.id
WHERE f.object_id LIKE 'u%'
AND f.write_frontier IS NOT NULL
ORDER BY 4 DESC
| (SELECT write_frontier::text::numeric | ||
| FROM mz_internal.mz_frontiers | ||
| WHERE object_id = '<object_id>') -- update | ||
| ORDER BY lag_ms DESC; |
There was a problem hiding this comment.
Updated to use Frank's freshness poc query.
| SELECT | ||
| o_probe.name AS object_name, | ||
| o_prev.name AS from_name, | ||
| o_prev.id AS from_id, |
There was a problem hiding this comment.
added the from_id ... so that it's easier to iterate on these objects.
| | `stalled`| Common causes include network partitions, credential expiration, and upstream database restarts. Check the returned `error` field and address appropriately. Once the source reconnects, downstream objects should catch up automatically. | | ||
| | `paused` | The cluster associated with the source has no compute/replica assigned (`replication_factor = 0`). See [Check for no compute](#check-for-no-compute). | | ||
| | `starting` | Wait for the source to transition to running. Downstream objects should catch up automatically. | | ||
|
|
There was a problem hiding this comment.
As mentioned, removed for now the mention of the checking the source status ... So ... if the stalled is difficult to find ... we will need some other actionable thing to diagnose.
There was a problem hiding this comment.
I stopped here for this 3rd patch. Will do the remainder in the next patch.
Add a step-by-step guide for diagnosing freshness problems in Materialize, covering real-time diagnosis (wallclock lag, materialization lag, source health, cluster health, dependency graph attribution) and historical spike analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include mz_cluster_replica_statuses, mz_cluster_replica_status_history, and mz_cluster_replica_history queries for detecting OOM crash loops that cause recurring freshness degradation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds guidance on measuring P99.999 freshness across a deployment, including threshold-based queries that work around Materialize SQL limitations (no WITHIN GROUP, no sum(interval)). Documents common noise sources (paused sources, zero-replica clusters, static data, non-production clusters) and adds two new common patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds catalog documentation for the two previously undocumented views: - `mz_wallclock_global_lag_history`: minute-binned, 30-day retention, aggregated across replicas (min lag per object per minute) - `mz_wallclock_global_lag_recent_history`: filtered to last 24 hours Also fixes review feedback in the freshness runbook: - Replace `!=` with `<>` for psql compatibility - Add `o.id` to GROUP BY in aggregate query to prevent name collisions - Fix paused source resolution (no ALTER SOURCE resume exists) - Clarify P99.999 claim as per-object Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Moritz Hoffmann <mh@materialize.com>
Add new sections for distinguishing source-driven vs. computation-driven spikes, correlating spikes with DDL events via audit log, steady-state freshness analysis by excluding identified time windows, and a deploy-related freshness degradation pattern. Fix various issues flagged in review: make cause list non-exhaustive, rename Step 2, convert interpretation to table with next-step links, fix edge_delay computation in dependency graph query, remove incorrect claims about static data lag and OOM lag data gaps, fix retention wording, generalize "expensive MV" to "expensive dataflow", and correct advice for unpausing sources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
184e867 to
d236653
Compare


https://preview.materialize.com/materialize/35319/transform-data/freshness-troubleshooting/
Step-by-step guide for diagnosing freshness problems, covering:
🤖 Generated with Claude Code