compute: diagnostic log when dropping per-replica read holds by teskje · Pull Request #35937 · MaterializeInc/materialize

teskje · 2026-04-10T16:41:18Z

When a replica is removed, we drop its per-replica input read holds. If the corresponding global read holds have already been released/downgraded (e.g. because the dataflow was dropped by a user), the per-replica holds are the last line of defense against compaction. The dataflow might still be in flight at the replica, and dropping its input read holds can allow the storage inputs to compact past a its as_of, causing the replica to panic when it tries to render the dataflow.

We believe that this is the cause for "cannot serve requested as_of" panics observed in the wild. This commit adds a warning log to help validate that theory.

Motivation

Part of diagnosing https://github.com/MaterializeInc/incidents-and-escalations/issues/39

When a replica is removed, we drop its per-replica input read holds. If the corresponding global read holds have already been released/downgraded (e.g. because the dataflow was dropped by a user), the per-replica holds are the last line of defense against compaction. The dataflow might still be in flight at the replica, and dropping its input read holds can allow the storage inputs to compact past a its as_of, causing the replica to panic when it tries to render the dataflow. We believe that this is the cause for "cannot serve requested as_of" panics observed in the wild. This commit adds a warning log to help validate that theory.

github-actions · 2026-04-10T16:41:27Z

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

teskje · 2026-04-13T12:57:46Z

TFTR!

Refactor `Instance::remove_replica`'s diagnostic loop (the "dropping per-replica read hold without equivalent global read hold" WARN added in PR #35937) into a pure helper `find_unprotected_replica_holds`, and add four unit tests that exercise the hold-asymmetry condition tracked under incidents-and-escalations#39. The tests are the first deterministic specification of the bug-class shape and pin down the regression contract for the eventual fix. Also extend the CLU-95 repro harness with two new workflows targeting the build 1248 manifestation more directly: * `cancelled-peek-reconnect` — slow-path SELECT (via mz_unsafe.mz_sleep) on an unmanaged cluster pinned to a standalone Clusterd, cancelled mid-render, then clusterd force-killed to provoke reconnect. * `replica-removal-under-load` — writer cluster MV + concurrent dataflow churn on a separate compute cluster, then DROP CLUSTER REPLICA on the compute side to drive `Instance::remove_replica` under load. Both workflows accumulate perturbations under one long-lived envd and then do a single ungraceful restart, mirroring the workload-replay sanity_restart sequence from build 1248. Neither reproduces the bootstrap panic over 30/40 iterations, but the harness now scans for the diagnostic WARN as a secondary signal. CLU-95-CONTINUATION.md is rewritten to reflect the build 1248 services.log findings, rule out the leased-expiry framing for that build, and lay out the three-pronged fix direction: upstream hold-accounting fix (#39), bootstrap report-don't-panic safety net (the CLU-95-specific recovery), and render-time report-don't-panic (the moral successor to the now-canceled CLU-34). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

teskje requested a review from a team as a code owner April 10, 2026 16:41

teskje force-pushed the read-hold-diagnostic-log branch from fd243c3 to d730b18 Compare April 10, 2026 16:41

teskje requested a review from ggevay April 13, 2026 08:05

ggevay approved these changes Apr 13, 2026

View reviewed changes

teskje merged commit 831ebab into MaterializeInc:main Apr 13, 2026
119 checks passed

teskje deleted the read-hold-diagnostic-log branch April 13, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: diagnostic log when dropping per-replica read holds#35937

compute: diagnostic log when dropping per-replica read holds#35937
teskje merged 1 commit into
MaterializeInc:mainfrom
teskje:read-hold-diagnostic-log

teskje commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

teskje commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teskje commented Apr 10, 2026

Motivation

Uh oh!

github-actions Bot commented Apr 10, 2026

PR title guidelines

Pre-merge checklist

Uh oh!

teskje commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants