compute: diagnostic log when dropping per-replica read holds#35937
Merged
Conversation
When a replica is removed, we drop its per-replica input read holds. If the corresponding global read holds have already been released/downgraded (e.g. because the dataflow was dropped by a user), the per-replica holds are the last line of defense against compaction. The dataflow might still be in flight at the replica, and dropping its input read holds can allow the storage inputs to compact past a its as_of, causing the replica to panic when it tries to render the dataflow. We believe that this is the cause for "cannot serve requested as_of" panics observed in the wild. This commit adds a warning log to help validate that theory.
fd243c3 to
d730b18
Compare
Contributor
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
ggevay
approved these changes
Apr 13, 2026
Contributor
Author
|
TFTR! |
antiguru
added a commit
that referenced
this pull request
Jun 1, 2026
Refactor `Instance::remove_replica`'s diagnostic loop (the "dropping per-replica read hold without equivalent global read hold" WARN added in PR #35937) into a pure helper `find_unprotected_replica_holds`, and add four unit tests that exercise the hold-asymmetry condition tracked under incidents-and-escalations#39. The tests are the first deterministic specification of the bug-class shape and pin down the regression contract for the eventual fix. Also extend the CLU-95 repro harness with two new workflows targeting the build 1248 manifestation more directly: * `cancelled-peek-reconnect` — slow-path SELECT (via mz_unsafe.mz_sleep) on an unmanaged cluster pinned to a standalone Clusterd, cancelled mid-render, then clusterd force-killed to provoke reconnect. * `replica-removal-under-load` — writer cluster MV + concurrent dataflow churn on a separate compute cluster, then DROP CLUSTER REPLICA on the compute side to drive `Instance::remove_replica` under load. Both workflows accumulate perturbations under one long-lived envd and then do a single ungraceful restart, mirroring the workload-replay sanity_restart sequence from build 1248. Neither reproduces the bootstrap panic over 30/40 iterations, but the harness now scans for the diagnostic WARN as a secondary signal. CLU-95-CONTINUATION.md is rewritten to reflect the build 1248 services.log findings, rule out the leased-expiry framing for that build, and lay out the three-pronged fix direction: upstream hold-accounting fix (#39), bootstrap report-don't-panic safety net (the CLU-95-specific recovery), and render-time report-don't-panic (the moral successor to the now-canceled CLU-34). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a replica is removed, we drop its per-replica input read holds. If the corresponding global read holds have already been released/downgraded (e.g. because the dataflow was dropped by a user), the per-replica holds are the last line of defense against compaction. The dataflow might still be in flight at the replica, and dropping its input read holds can allow the storage inputs to compact past a its as_of, causing the replica to panic when it tries to render the dataflow.
We believe that this is the cause for "cannot serve requested as_of" panics observed in the wild. This commit adds a warning log to help validate that theory.
Motivation
Part of diagnosing https://github.com/MaterializeInc/incidents-and-escalations/issues/39