Fix "Unexpected refresh lock in keeper" exception in RefreshTask#103849
Fix "Unexpected refresh lock in keeper" exception in RefreshTask#103849groeneai wants to merge 1 commit intoClickHouse:masterfrom
Conversation
The chassert / abortOnFailedAssertion at RefreshTask.cpp:523 was added in PR ClickHouse#103427 as a defensive guard, but the state it asserts against is reachable through normal operation: 1. Server kill+restart inside the Keeper session timeout window: the ephemeral `/running` znode created by the previous Keeper session is still observable to the new session (with our replica name in its data) until Keeper expires the old session, ~30s by default. Dominant cause in stress tests with `RandomQueryKiller`. 2. DETACH+ATTACH while Keeper is unreachable during shutdown: `RefreshTask::shutdown` does not proactively call `removeRunningZnodeIfMine`; it relies on the in-flight cleanup path `updateCoordinationState(running=false)`, which can not run if Keeper is unreachable. After ATTACH the new task observes the leftover znode and aborts. The release-build recovery branch (clear local stale flag, call `removeRunningZnodeIfMine`, reschedule a Keeper re-read) is correct for both scenarios — `tryGet` + data match against this replica's name + `tryRemove` are session-agnostic. This change drops the `#ifdef DEBUG_OR_SANITIZER_BUILD` gate so the recovery runs unconditionally, and demotes the misleading `Likely a bug` `LOG_ERROR` to a `LOG_WARNING` that explains the actual cause. Adds an integration test that plants a stale `/running` znode through a separate Kazoo session — the same way an old session's leftover would look from a new session — and verifies `SYSTEM REFRESH VIEW` + `SYSTEM WAIT VIEW` complete without server abort. CIDB: ~22 hits across 7 unrelated PRs over 30 days (STIDs 2508-3e24, 2508-4754). Per @alexey-milovidov directive on ClickHouse#103737 Session: cron:clickhouse-ci-task-worker:20260501-091500 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Pre-PR validation gate (per the worker procedure): a) Deterministic repro? Yes — the new integration test plants b) Root cause explained? Yes. The chassert was added in #103427 as a defensive guard, but the state it asserts against is reachable in two normal paths: (1) old keeper session's ephemeral c) Fix matches root cause? Yes. The release-build branch already had a correct recovery (clear local stale flag, d) Test intent preserved / new tests added? Yes — added e) Both directions demonstrated? Yes — confirmed locally with a debug build:
f) Fix is general, not a narrow patch? Yes — the fix targets the root cause (the assertion's assumption that this state is impossible) rather than guarding against the symptom. The recovery path handles every code path that can produce the state, not just the two enumerated scenarios. No defensive null/empty guards were added. Session: cron:clickhouse-ci-task-worker:20260501-091500 |
|
cc @al13n321 @tuanpach @alexey-milovidov — could you review this? It removes the |
|
(I'm changing how this lock works, will likely obsolete this PR soon.) |
|
Thanks @al13n321 — happy to defer to your redesign. A couple of options depending on your timeline:
Either way works for me. Just let me know if you'd prefer I close this now in favor of a draft of yours, or wait. Not blocking on a reply — flagging so you have the option. |
|
Workflow [PR], commit [0a7174c] Summary: ❌
AI ReviewSummaryThis PR removes a debug/sanitizer-only PR Metadata
ClickHouse Rules
Final Verdict
|
| #ifdef DEBUG_OR_SANITIZER_BUILD | ||
| abortOnFailedAssertion("Unexpected refresh lock in keeper"); | ||
| #else | ||
| /// The `/running` znode is ephemeral and tagged with this replica's name in |
There was a problem hiding this comment.
This change removes a DEBUG_OR_SANITIZER_BUILD-only abort and makes debug/sanitizer builds follow the same recovery path that release builds already used. Because user-visible behavior in stable builds does not change, the PR template Changelog category should be CI Fix or Improvement (or Build/Testing/Packaging Improvement) rather than Bug Fix (user-visible misbehavior in an official stable release).
There was a problem hiding this comment.
Thanks @clickhouse-gh — you're right. Updated the PR description: changelog category is now CI Fix or Improvement (changelog entry is not required) and the changelog entry is removed.
Reasoning: release builds already ran the recovery path; only debug/sanitizer builds were aborting on a state that was always reachable. So the user-visible behavior on stable builds does not change — this only stops master CI sanitizer/debug stress runs from aborting on a state the recovery path was always meant to handle.
LLVM Coverage Report
Changed lines: 11.11% (1/9) · Uncovered code |
Removes the
DEBUG_OR_SANITIZER_BUILDabort that firedLogical error: 'Unexpected refresh lock in keeper'fromDB::RefreshTask::refreshTask()atsrc/Storages/MaterializedView/RefreshTask.cpp:523.The recovery branch that was already in place for release builds now runs
unconditionally.
Why this is not a bug in
RefreshTaskThe assertion was added in #103427 as a defensive guard, but the state it
asserts against (
/runningexists in Keeper with our replica name in its data,and this
RefreshTaskinstance is inScheduling) is reachable through normaloperation:
Server kill+restart inside the Keeper session timeout window. The
ephemeral
/runningznode created by the previous Keeper session of thisserver is still observable to the new session until Keeper expires the old
session (~30s by default). The new
RefreshTaskinstance reads/runningwith our replica's name in its data and wrongly concludes it is itself
running a refresh. This is the dominant cause in stress tests with
RandomQueryKiller.DETACH+ATTACH while Keeper is unreachable during shutdown.
RefreshTask::shutdowndoes not proactively callremoveRunningZnodeIfMine;it relies on the in-flight cleanup path
updateCoordinationState(running=false). If Keeper is unreachable whenshutdown runs, the cleanup cannot run, the znode lingers, and on ATTACH the
new task observes the leftover.
The release-build recovery branch is correct in both cases:
removeRunningZnodeIfMinedoestryGet+ data match against this replica'sname +
tryRemove— all session-agnostic and safe across session boundaries.schedule_keeper_retryrequeues a Keeper re-read in 5s; the next iterationobserves the cleaned-up state and proceeds normally.
This change drops the
#ifdef DEBUG_OR_SANITIZER_BUILDgate so the recoveryruns in every build, and demotes the misleading
Likely a bugLOG_ERRORto aLOG_WARNINGthat explains the actual cause.User-visible behavior on stable (release) builds does not change — release
builds already ran the recovery path. The fix only stops debug/sanitizer
builds from aborting on a state that was always reachable. Per the bot's
review (#103849 (comment)),
the category is therefore
CI Fix or Improvementrather thanBug Fix.Reproduction / regression test
The new integration test
test_refreshable_mv/test.py::test_refresh_recovers_from_stale_running_znodeplants a stale
/runningznode through a separate Kazoo session — exactly theway an old Keeper session's leftover would look from the new session — and
verifies
SYSTEM REFRESH VIEW+SYSTEM WAIT VIEWcomplete without serverabort and the leftover znode is cleaned up.
Verified locally in a debug build: test FAILS without this fix
(
Logical error: 'Unexpected refresh lock in keeper'aborts node1) and PASSESwith it.
CIDB evidence
30-day window: ~22 hits across 7 unrelated PRs (STIDs
2508-3e24and2508-4754). Primary failure modes:Per @alexey-milovidov's directive on
#103737
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Documentation entry for user-facing changes