Skip to content

compute: diagnostic log when dropping per-replica read holds#35937

Merged
teskje merged 1 commit into
MaterializeInc:mainfrom
teskje:read-hold-diagnostic-log
Apr 13, 2026
Merged

compute: diagnostic log when dropping per-replica read holds#35937
teskje merged 1 commit into
MaterializeInc:mainfrom
teskje:read-hold-diagnostic-log

Conversation

@teskje
Copy link
Copy Markdown
Contributor

@teskje teskje commented Apr 10, 2026

When a replica is removed, we drop its per-replica input read holds. If the corresponding global read holds have already been released/downgraded (e.g. because the dataflow was dropped by a user), the per-replica holds are the last line of defense against compaction. The dataflow might still be in flight at the replica, and dropping its input read holds can allow the storage inputs to compact past a its as_of, causing the replica to panic when it tries to render the dataflow.

We believe that this is the cause for "cannot serve requested as_of" panics observed in the wild. This commit adds a warning log to help validate that theory.

Motivation

Part of diagnosing https://github.com/MaterializeInc/incidents-and-escalations/issues/39

When a replica is removed, we drop its per-replica input read holds. If
the corresponding global read holds have already been
released/downgraded (e.g. because the dataflow was dropped by a user),
the per-replica holds are the last line of defense against compaction.
The dataflow might still be in flight at the replica, and dropping its
input read holds can allow the storage inputs to compact past a its
as_of, causing the replica to panic when it tries to render the
dataflow.

We believe that this is the cause for "cannot serve requested as_of"
panics observed in the wild. This commit adds a warning log to help
validate that theory.
@teskje teskje requested a review from a team as a code owner April 10, 2026 16:41
@teskje teskje force-pushed the read-hold-diagnostic-log branch from fd243c3 to d730b18 Compare April 10, 2026 16:41
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@teskje teskje requested a review from ggevay April 13, 2026 08:05
@teskje
Copy link
Copy Markdown
Contributor Author

teskje commented Apr 13, 2026

TFTR!

@teskje teskje merged commit 831ebab into MaterializeInc:main Apr 13, 2026
119 checks passed
@teskje teskje deleted the read-hold-diagnostic-log branch April 13, 2026 12:57
antiguru added a commit that referenced this pull request Jun 1, 2026
Refactor `Instance::remove_replica`'s diagnostic loop (the "dropping
per-replica read hold without equivalent global read hold" WARN added in
PR #35937) into a pure helper `find_unprotected_replica_holds`, and add
four unit tests that exercise the hold-asymmetry condition tracked under
incidents-and-escalations#39. The tests are the first deterministic
specification of the bug-class shape and pin down the regression contract
for the eventual fix.

Also extend the CLU-95 repro harness with two new workflows targeting
the build 1248 manifestation more directly:

* `cancelled-peek-reconnect` — slow-path SELECT (via mz_unsafe.mz_sleep)
  on an unmanaged cluster pinned to a standalone Clusterd, cancelled
  mid-render, then clusterd force-killed to provoke reconnect.
* `replica-removal-under-load` — writer cluster MV + concurrent
  dataflow churn on a separate compute cluster, then DROP CLUSTER
  REPLICA on the compute side to drive `Instance::remove_replica`
  under load.

Both workflows accumulate perturbations under one long-lived envd and
then do a single ungraceful restart, mirroring the workload-replay
sanity_restart sequence from build 1248. Neither reproduces the
bootstrap panic over 30/40 iterations, but the harness now scans for
the diagnostic WARN as a secondary signal.

CLU-95-CONTINUATION.md is rewritten to reflect the build 1248 services.log
findings, rule out the leased-expiry framing for that build, and lay out
the three-pronged fix direction: upstream hold-accounting fix (#39),
bootstrap report-don't-panic safety net (the CLU-95-specific recovery),
and render-time report-don't-panic (the moral successor to the now-canceled
CLU-34).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants