Skip to content

Add freshness troubleshooting runbook#35319

Merged
kay-kim merged 13 commits intoMaterializeInc:mainfrom
antiguru:freshness_runbook
Mar 31, 2026
Merged

Add freshness troubleshooting runbook#35319
kay-kim merged 13 commits intoMaterializeInc:mainfrom
antiguru:freshness_runbook

Conversation

@antiguru
Copy link
Copy Markdown
Member

@antiguru antiguru commented Mar 4, 2026

https://preview.materialize.com/materialize/35319/transform-data/freshness-troubleshooting/

Step-by-step guide for diagnosing freshness problems, covering:

  • Real-time diagnosis: wallclock lag, materialization lag, source health, cluster health, dependency graph attribution, sink lag
  • Historical spike analysis: finding spikes, determining scope (single object vs cluster vs system-wide), minute-by-minute inspection
  • Common patterns: disconnected sources, overloaded clusters, expensive MVs, system-wide spikes, correlated subsource lag

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 4, 2026

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

Copy link
Copy Markdown
Contributor

@kay-kim kay-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs look great. I left some questions/suggestions ... If amenable, I can add a patch to the page (once I know the answers to the questions as well as the suggestions are not off-the-mark).

o_next.type AS to_type,
greatest(
to_timestamp(fn.write_frontier::text::double / 1000)
- to_timestamp(fp.write_frontier::text::double / 1000),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ... the where clause has:

  AND fn.write_frontier <= fp.write_frontier
  AND fp.write_frontier::text::numeric > fn.write_frontier::text::numeric

Would this greatest ( fn - fp, interval '0') always return 0?

@maheshwarip
Copy link
Copy Markdown
Contributor

big fan of this :)

@antiguru antiguru marked this pull request as ready for review March 4, 2026 20:08
@antiguru antiguru requested review from a team as code owners March 4, 2026 20:08
@antiguru antiguru requested a review from aljoscha March 4, 2026 20:08
```

Objects with a lag of a few seconds are typical for healthy systems.
Large lag values (minutes or hours) indicate a problem.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we filter out objects maintained by paused clusters here? Those will be showing up with a high lag but don't indicate anything wrong.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is mentioned below in "Filtering noise". Perhaps move that section up or reference it from here to make it clear that there exist some reasons why high lags in fact don't indicate a problem?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of got rid of the the Filtering noise section

  • we now return cluster name and id ... as such, people should be able to see the non-production clusters are returning
  • As for having 0-replica clusters return ... this is in case they had forgotten to add back the replica. I do footnote it and add link to the Check for no compute section.

AND s.status <> 'running';
```

A source with status `stalled` or `starting` will hold back all downstream objects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's still the case, or if I'm mis-remembering, but I think it used to be that unhealthy sources were hard to identify because they'd move through the stalled/starting statuses very quickly and usually showed up as "running" even when they were restart-looping. We might need to recommend looking at the status history too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I checked the source status history for a source and it did just keep transitioning between starting/running
  • but, the source itself and its tables (using new syntax ... so its subsources in the old syntax) were green, so I removed the blurb about checking the mz_internal.mz_source_status_history because not sure it told me anything.
    • Is it that for the mz_internal.mz_source_status_history query, we want to specify error/details is not null or something (in order to be actionable?). Because if not particularly actionable, there's limited value?

(I'm running a small cluster with highly inefficient mat views ... to test some of the materialization queries):

SELECT s.id, s.name, ssh.status, ssh.occurred_at, ssh.error, ssh.details
FROM mz_internal.mz_source_status_history ssh
JOIN mz_catalog.mz_sources s ON ssh.source_id = s.id
WHERE s.name = 'pg_source1'
ORDER BY ssh.occurred_at DESC;

| id | name       | status   | occurred_at                | error | details |
| -- | ---------- | -------- | -------------------------- | ----- | ------- |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.52+00  | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.52+00  | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.494+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.494+00 | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.465+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.465+00 | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.431+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.431+00 | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.4+00   | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.4+00   | null  | null    |
| u3 | pg_source1 | running  | 2026-03-24 14:49:08.361+00 | null  | null    |
| u3 | pg_source1 | starting | 2026-03-24 14:49:08.361+00 | null  | null    |


## Investigating historical spikes

Materialize retains wallclock lag history for up to 30 days in [`mz_internal.mz_wallclock_global_lag_history`](/sql/system-catalog/mz_internal/#mz_wallclock_global_lag_history), binned by minute.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above: It's "at least 30 days".


**Resolution**: Scale the cluster up, or move expensive workloads to a separate cluster.

### Expensive materialized view
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this specific to materialized views? The same is true for indexes, no?

### OOM crash loop

**Symptoms**: An object shows persistent lag that fluctuates.
Historical lag data for the object has gaps.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a cluster crash loop lead to gaps in the lag data? That data is collected by the controller and shouldn't be affected by clusters crashing.

@kay-kim
Copy link
Copy Markdown
Contributor

kay-kim commented Mar 12, 2026

Just added a patch w. minor reorg as a starting point for myself. Once I'm back in NY with a big monitor and printer, I can absorb the content a bit better and patch it with better organization and tweaks here and there. (heh... I'll also check that my copy+pasting to move things didn't accidentally clobber anything ... am so dependent on a monitor 😄 )

A source that is restart-looping may briefly show `running` between restarts; check `mz_internal.mz_source_status_history` for repeated transitions to confirm.
For PostgreSQL sources, the subsources share replication state with the parent source; if one subsource lags, all subsources of that source typically lag together.

Check the frontier of a specific source against wall-clock time:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The previous query above ... it returns the status. So ...

  • What purpose does this query serve?
  • Also, in here we query the mz_frontiers ... in the above check wallclock lag, we use mz_wallclock_global_lag ... What's the diff other than we need to do the subtraction ourselves?

I'm going to post up my next patch (which just focuses on this section) where I remove it for now.

restarts; check `mz_internal.mz_source_status_history` for repeated transitions
to confirm.
{{< /tip >}}

Copy link
Copy Markdown
Contributor

@kay-kim kay-kim Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comment, removed the following query from this section:

SELECT
    o.name,
    to_timestamp(f.write_frontier::text::double / 1000) AS frontier_time,
    now() - to_timestamp(f.write_frontier::text::double / 1000) AS behind_wallclock
FROM mz_internal.mz_frontiers f
JOIN mz_catalog.mz_objects o ON f.object_id = o.id
WHERE o.id = '<source_id>';
  • Since the previous query returns the status and error.
  • Also, unclear why we check mz_frontiers and do the calc versus at the check wallclock lag section where we check mz_wallclock_global_lag.

| **`local_lag` is low but `global_lag` is high** | An upstream dependency is the bottleneck. Look at `slowest_global_input` to identify the root cause. | See [Computation bottleneck](#computation-bottleneck). |
| **`local_lag` = 0, `global_lag` = 0, wallclock lag is high** | The root source is behind. The entire pipeline is caught up relative to its inputs, but the inputs themselves lag behind wall-clock time. | See [Source ingestion bottleneck](#source-ingestion-bottleneck). |

## Source ingestion bottleneck
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@kay-kim kay-kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving some comments as am about to upload the next patch.

o.name,
o.type,
ml.local_lag,
ml.global_lag
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next patch, will remove ml.global_lag since we don't use it in this section.

A cluster that repeatedly runs out of memory will have its replica crash and restart.
Each restart triggers rehydration, during which no progress is made, causing recurring freshness degradation.

Check the current replica status:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious ... this would only be if the replica restarted yet? ... I'm wondering if the next query which checks the status history is possibly the one-and-done one. Left both in the next patch ... but something to think over. (Also, am realizing I probably should be making these comments on the patch instead of the overall files changed ... but, the "shutting the barn door after the horses are out" kind of a thing now).

The time between restarts indicates the severity: a replica that OOMs every few minutes is fundamentally too small for its workload.

To see the full lifecycle of replicas, including how often new ones are created:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this part in the next partch because not sure what this is telling us in terms of troubleshooting/diagnosis.

and restart. Each restart triggers rehydration, during which no progress is
made, causing recurring freshness degradation.

#### Replica is currently offline
Copy link
Copy Markdown
Contributor

@kay-kim kay-kim Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if this current status is needed or if having people check the loop is better.

Copy link
Copy Markdown
Contributor

@kay-kim kay-kim Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is what are the chances people will catch it when it's currently offline instead of when it has restarted; e.g.,

Image

| **Diagnosis** | PostgreSQL sources use a single replication stream for all subsources/tables. If one slows down (e.g., due to a large transaction), all subsources/tables for that source are affected. |
| **Resolution** | Wait for the subsource/table to catch up. |

## Cluster CPU or memory pressure
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| | |
|--|--|
| **Symptom** | Objects on the cluster do **not** have similar `local_lag`. |
| **Diagnosis** | The dataflow is expensive, not the cluster. |
Copy link
Copy Markdown
Contributor

@kay-kim kay-kim Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am going to tweak this since if an object has a large local lag (and it's the only one) ... but it's so large that it is causing the cluster to OOM ... technically, yes ... the dataflow is expensive ... but ...

Image

o.type,
wl.lag,
c.name as cluster_name,
c.id as cluster_id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cluster details to plant the seed for people to prompt their agent in case people want to run for a specific cluster

FROM mz_internal.mz_materialization_lag ml
JOIN mz_catalog.mz_objects o ON ml.object_id = o.id
WHERE o.id = '<object_id>'
ORDER BY ml.global_lag DESC;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mz_materialization_lag doing more or less the query on mz_internal.mz_frontiers ?

 SELECT
            o.id, o.name, o.type,
            round(
              extract(epoch from now()) * 1000
              - f.write_frontier::text::numeric
            ) AS lag_ms
          FROM mz_internal.mz_frontiers f
          JOIN mz_catalog.mz_objects o ON f.object_id = o.id
          WHERE f.object_id LIKE 'u%'
            AND f.write_frontier IS NOT NULL
          ORDER BY 4 DESC

(SELECT write_frontier::text::numeric
FROM mz_internal.mz_frontiers
WHERE object_id = '<object_id>') -- update
ORDER BY lag_ms DESC;
Copy link
Copy Markdown
Contributor

@kay-kim kay-kim Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use Frank's freshness poc query.

SELECT
o_probe.name AS object_name,
o_prev.name AS from_name,
o_prev.id AS from_id,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the from_id ... so that it's easier to iterate on these objects.

| `stalled`| Common causes include network partitions, credential expiration, and upstream database restarts. Check the returned `error` field and address appropriately. Once the source reconnects, downstream objects should catch up automatically. |
| `paused` | The cluster associated with the source has no compute/replica assigned (`replication_factor = 0`). See [Check for no compute](#check-for-no-compute). |
| `starting` | Wait for the source to transition to running. Downstream objects should catch up automatically. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, removed for now the mention of the checking the source status ... So ... if the stalled is difficult to find ... we will need some other actionable thing to diagnose.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped here for this 3rd patch. Will do the remainder in the next patch.

antiguru and others added 8 commits March 25, 2026 16:36
Add a step-by-step guide for diagnosing freshness problems in
Materialize, covering real-time diagnosis (wallclock lag, materialization
lag, source health, cluster health, dependency graph attribution) and
historical spike analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include mz_cluster_replica_statuses, mz_cluster_replica_status_history,
and mz_cluster_replica_history queries for detecting OOM crash loops
that cause recurring freshness degradation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds guidance on measuring P99.999 freshness across a deployment,
including threshold-based queries that work around Materialize SQL
limitations (no WITHIN GROUP, no sum(interval)). Documents common
noise sources (paused sources, zero-replica clusters, static data,
non-production clusters) and adds two new common patterns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds catalog documentation for the two previously undocumented views:
- `mz_wallclock_global_lag_history`: minute-binned, 30-day retention,
  aggregated across replicas (min lag per object per minute)
- `mz_wallclock_global_lag_recent_history`: filtered to last 24 hours

Also fixes review feedback in the freshness runbook:
- Replace `!=` with `<>` for psql compatibility
- Add `o.id` to GROUP BY in aggregate query to prevent name collisions
- Fix paused source resolution (no ALTER SOURCE resume exists)
- Clarify P99.999 claim as per-object

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Moritz Hoffmann <mh@materialize.com>
Add new sections for distinguishing source-driven vs. computation-driven
spikes, correlating spikes with DDL events via audit log, steady-state
freshness analysis by excluding identified time windows, and a
deploy-related freshness degradation pattern.

Fix various issues flagged in review: make cause list non-exhaustive,
rename Step 2, convert interpretation to table with next-step links,
fix edge_delay computation in dependency graph query, remove incorrect
claims about static data lag and OOM lag data gaps, fix retention
wording, generalize "expensive MV" to "expensive dataflow", and correct
advice for unpausing sources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kay-kim kay-kim force-pushed the freshness_runbook branch from 184e867 to d236653 Compare March 25, 2026 20:36
@kay-kim kay-kim merged commit 22c58f7 into MaterializeInc:main Mar 31, 2026
119 checks passed
@antiguru antiguru deleted the freshness_runbook branch March 31, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants